Apache Parquet
Apache Parquet is a binary, columnar storage format. HERE Anonymizer Self-Hosted supports Parquet as a source and sink data format.
Configuration
When you set SOURCE_FORMAT=PARQUET or SINK_FORMAT=PARQUET, you can provide an optional field mapping with SOURCE_FORMAT_CONFIG or SINK_FORMAT_CONFIG. Use a JSON file path or an inline JSON string.
Field mapping
The JSON file uses key-value pairs where:
| JSON field | Description | Example |
|---|---|---|
key | One of the internal field names: provider, trace_id, timestamp, latitude, longitude, heading, speed. Any other key is treated as an extended attribute. | timestamp |
value | A Parquet column name, optionally with a format specifier in parentheses. To specify fallback columns, separate names with |. Format specifiers are supported for timestamp and speed only. | sample_timestamp, sample_timestamp(epoch_millis), mm_lat | lat |
Custom key-value pairs override the default field mapping for Parquet:
{
"provider": "provider",
"trace_id": "device_id",
"timestamp": "sample_time(epoch_seconds)",
"latitude": "mm_lat | lat",
"longitude": "mm_lng | lng",
"heading": "heading",
"speed": "speed(kmh)",
"dt": "device_type",
"iso": "iso",
"drop_reason": "drop_reason"
}For example, to read timestamp from the sample_timestamp column with epoch_millis format and map ingestion_time to the extended attribute other_field, create parquet_config.json:
{
"timestamp": "sample_timestamp(epoch_millis)",
"other_field": "ingestion_time"
}Set SOURCE_FORMAT_CONFIG to a path or an inline JSON string:
SOURCE_FORMAT_CONFIG=s3:///bucket-of-my-configs/path/to/parquet_config.jsonSupported formats for the timestamp field
timestamp fieldIf no format is specified, the default is epoch_seconds.
epoch_seconds and epoch_millis apply to numeric fields, including string fields that contain numeric values.
Any other format string, for example yyyy-MM-dd'T'HH:mm:ssX, is interpreted as a Java
DateTimeFormatter
pattern. The internal representation is ZonedDateTime, so zone offsets are supported (see X, x, and z pattern letters).
| Format name | Description | Example |
|---|---|---|
epoch_seconds | Timestamp in seconds since Unix epoch (January 1, 1970 00:00:00 UTC) | 1622547800 |
epoch_millis | Timestamp in milliseconds since Unix epoch (January 1, 1970 00:00:00 UTC) | 1622547800000 |
yyyy-MM-dd'T'HH:mm:ssX | Timestamp in basic W3C format, where Z is used for UTC timestamps | output: 1970-01-01T12:34:56Z, input might be 1970-01-01T09:34:56+03:00 |
yyyy-MM-dd'T'HH:mm:ssxxx | Timestamp in basic W3C format, where +00:00 is used for UTC timestamps | output: 1970-01-01T12:34:56+00:00, input might be 1970-01-01T09:34:56+03:00 |
Supported formats for the speed field
speed fieldThe speed field supports three formats:
kmh: speed in kilometers per hour. Internally converted tompsby coefficient3.6.mps: speed in meters per second.mph: speed in miles per hour. Internally converted tompsby coefficient2.23694.
If no format is specified, the default is kmh.
Note
Parquet source files should use a block size of at most 128 MB.