Apache Parquet

Apache Parquet is a binary, columnar storage format. HERE Anonymizer Self-Hosted supports Parquet as a source and sink data format.

Configuration

When you set SOURCE_FORMAT=PARQUET or SINK_FORMAT=PARQUET, you can provide an optional field mapping with SOURCE_FORMAT_CONFIG or SINK_FORMAT_CONFIG. Use a JSON file path or an inline JSON string.

Field mapping

The JSON file uses key-value pairs where:

JSON fieldDescriptionExample
keyOne of the internal field names: provider, trace_id, timestamp, latitude, longitude, heading, speed. Any other key is treated as an extended attribute.timestamp
valueA Parquet column name, optionally with a format specifier in parentheses. To specify fallback columns, separate names with |. Format specifiers are supported for timestamp and speed only.sample_timestamp, sample_timestamp(epoch_millis), mm_lat | lat

Custom key-value pairs override the default field mapping for Parquet:

{
  "provider": "provider",
  "trace_id": "device_id",
  "timestamp": "sample_time(epoch_seconds)",
  "latitude": "mm_lat | lat",
  "longitude": "mm_lng | lng",
  "heading": "heading",
  "speed": "speed(kmh)",
  "dt": "device_type",
  "iso": "iso",
  "drop_reason": "drop_reason"
}

For example, to read timestamp from the sample_timestamp column with epoch_millis format and map ingestion_time to the extended attribute other_field, create parquet_config.json:

{
  "timestamp": "sample_timestamp(epoch_millis)",
  "other_field": "ingestion_time"
}

Set SOURCE_FORMAT_CONFIG to a path or an inline JSON string:

SOURCE_FORMAT_CONFIG=s3:///bucket-of-my-configs/path/to/parquet_config.json

Supported formats for the timestamp field

If no format is specified, the default is epoch_seconds.

epoch_seconds and epoch_millis apply to numeric fields, including string fields that contain numeric values. Any other format string, for example yyyy-MM-dd'T'HH:mm:ssX, is interpreted as a Java DateTimeFormatter pattern. The internal representation is ZonedDateTime, so zone offsets are supported (see X, x, and z pattern letters).

Format nameDescriptionExample
epoch_secondsTimestamp in seconds since Unix epoch (January 1, 1970 00:00:00 UTC)1622547800
epoch_millisTimestamp in milliseconds since Unix epoch (January 1, 1970 00:00:00 UTC)1622547800000
yyyy-MM-dd'T'HH:mm:ssXTimestamp in basic W3C format, where Z is used for UTC timestampsoutput: 1970-01-01T12:34:56Z, input might be 1970-01-01T09:34:56+03:00
yyyy-MM-dd'T'HH:mm:ssxxxTimestamp in basic W3C format, where +00:00 is used for UTC timestampsoutput: 1970-01-01T12:34:56+00:00, input might be 1970-01-01T09:34:56+03:00

Supported formats for the speed field

The speed field supports three formats:

  • kmh: speed in kilometers per hour. Internally converted to mps by coefficient 3.6.
  • mps: speed in meters per second.
  • mph: speed in miles per hour. Internally converted to mps by coefficient 2.23694.

If no format is specified, the default is kmh.

📘

Note

Parquet source files should use a block size of at most 128 MB.