How to read versioned layer data
How to read versioned layer data
The Data Client Library provides the class LayerDataFrameReader, a custom
Spark
DataFrameReader
for creating
DataFrames
that contain the data for all supported layer type including versioned layer.
All the formats supported by
DataFrameReader
are also supported by the LayerDataFrameReader. Additionally, it supports
formats such as Apache Avro, Apache Parquet, Protobuf and raw byte arrays
(octet-stream).
Read process
Read operation works according to the following steps:
- Spark connector starts with a first communication with the server to get some useful information. For example layer type, layer schema, layer encoding format, etc.
- Partitions within the layer get filtered using the provided filter query. If the query is not provided, the value "mt_version==LATEST" will be used by default, and it would mean that all the partitions in the latest version will be matched.
- At this stage, we know the layer format. We can now create its Spark corresponding file format and with partition data, we have an iterator of rows (records).
- Some implicit columns will be added to each row depending on the layer type and partition metadata.
- The resulting rows will be handed over to the Spark framework to return the
finalized
DataFrame.
Dataframe columns
Besides the user-defined columns which derive from the partition data, Spark connector provides additional columns used to represent the data partitioning information and partition payload attributes.
Data columns
Corresponds to user defined columns and derives from the partition data.
Layer partitioning columns
| Column name | Data Type | Meaning |
|---|---|---|
mt_partition | String | partition Id |
mt_version | Long | layer version |
Partition payload attribute columns
| Column name | Data Type | Meaning |
|---|---|---|
mt_metadata | Map[String, String] | Metadata of partition |
mt_timestamp | Long | Timestamp of creation (UTC) |
mt_checksum | String | Checksum of payload |
mt_crc | String | CRC of payload |
mt_dataSize | Long | Size of payload |
mt_compressedDataSize | Long | Compressed size of payload |
Project dependencies
If you want to create an application that uses the HERE platform Spark Connector to read data from versioned layer, add the required dependencies to your project as described in chapter Dependencies for Spark Connector.
Read Parquet-Encoded data
The following snippet demonstrates how to access a Parquet-encoded DataFrame
from a versioned layer of a catalog. Note that the parquet schema is expected to
be bundled with the data. Therefore, you don't need to specify the format
explicitly.
import com.here.platform.data.client.spark.LayerDataFrameReader.SparkSessionExt
import org.apache.spark.sql.SparkSession
// val sparkSession: org.apache.spark.sql.SparkSession
// val catalogHrn: HRN (HRN of a catalog that contains the layer $layerId)
// val layerId: String (ID of an versioned layer containing parquet-encoded SDII data
val reader = sparkSession
.readLayer(catalogHrn, layerId)
.query(s"mt_partition==$tileId")
if (compressed)
reader.option("olp.connector.data-decompression-timeout", 300000)
val df = reader.load()
df.printSchema()
val messagesWithAtLeastOneSignRecognition = df
.select("pathEvents.signRecognition")
.where("size(pathEvents.signRecognition) > 0")
val count = messagesWithAtLeastOneSignRecognition.count()import static org.apache.spark.sql.functions.*;
import com.here.hrn.HRN;
import com.here.platform.data.client.spark.javadsl.JavaLayerDataFrameReader;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.IntegerType;
// val sparkSession: org.apache.spark.sql.SparkSession
// val catalogHrn: HRN (HRN of a catalog that contains the layer $layerId)
// val layerId: String (ID of an versioned layer containing parquet-encoded SDII data
Dataset<Row> dataFrame =
JavaLayerDataFrameReader.create(sparkSession)
.readLayer(catalogHrn, layerId)
.query(String.format("mt_partition==%s", tileId))
.load();
dataFrame.printSchema();
Dataset<Row> messagesWithAtLeastOneSignRecognition =
dataFrame
.select("pathEvents.signRecognition")
.where("size(pathEvents.signRecognition) > 0");
Long count = messagesWithAtLeastOneSignRecognition.count();Read Avro-Encoded data
The following snippet demonstrates how to access an Avro-encoded DataFrame
from a versioned layer of a catalog. Note that the avro schema is expected to be
bundled with the data. Therefore, you don't need to specify the format
explicitly.
import com.here.platform.data.client.spark.LayerDataFrameReader.SparkSessionExt
import org.apache.spark.sql.{DataFrame, SparkSession}
// val sparkSession: org.apache.spark.sql.SparkSession
// val catalogHrn: HRN (HRN of a catalog that contains the layer $layerId)
// val layerId: String (ID of an versioned layer containing avro-encoded SDII data
val reader = sparkSession
.readLayer(catalogHrn, layerId)
.query(s"mt_partition==$tileId")
if (compressed)
reader.option("olp.connector.data-decompression-timeout", 300000)
val df: DataFrame = reader.load()
df.printSchema()
val messagesWithAtLeastOneSignRecognition = df
.select("pathEvents.signRecognition")
.where("size(pathEvents.signRecognition) > 0")
val count = messagesWithAtLeastOneSignRecognition.count()import static org.apache.spark.sql.functions.*;
import com.here.hrn.HRN;
import com.here.platform.data.client.spark.javadsl.JavaLayerDataFrameReader;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.IntegerType;
// val sparkSession: org.apache.spark.sql.SparkSession
// val catalogHrn: HRN (HRN of a catalog that contains the layer $layerId)
// val layerId: String (ID of an versioned layer containing avro-encoded SDII data
Dataset<Row> dataFrame =
JavaLayerDataFrameReader.create(sparkSession)
.readLayer(catalogHrn, layerId)
.query(String.format("mt_partition==%s", tileId))
.load();
dataFrame.printSchema();
Dataset<Row> messagesWithAtLeastOneSignRecognition =
dataFrame
.select("pathEvents.signRecognition")
.where("size(pathEvents.signRecognition) > 0");
Long count = messagesWithAtLeastOneSignRecognition.count();Read Protobuf-Encoded data
The following snippet demonstrates how to access a Protobuf-encoded DataFrame
from a versioned layer of a catalog. Note that the protobuf schema is expected
to be referenced from the layer configuration. Therefore, you don't need to
specify the format explicitly.
import com.here.platform.data.client.spark.LayerDataFrameReader.SparkSessionExt
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
// val sparkSession: org.apache.spark.sql.SparkSession
// val catalogHrn: HRN (HRN of a catalog that contains the layer $layerId)
// val layerId: String (ID of an index layer containing protobuf-encoded SDII data that has copy of real data
// from "indexed-locations" layer of "hrn:here:data:::rib-2" catalog)
val dataFrame = sparkSession
.readLayer(catalogHrn, layerId)
.query("mt_partition=in=(DEU, CUB)")
.load()
val tileIdLists = dataFrame
.select(col("partition_name"), explode(col("tile_id")).as("tile"))
tileIdLists.show()import static org.apache.spark.sql.functions.*;
import com.here.hrn.HRN;
import com.here.platform.data.client.spark.javadsl.JavaLayerDataFrameReader;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.IntegerType;
// val sparkSession: org.apache.spark.sql.SparkSession
// val catalogHrn: HRN (HRN of a catalog that contains the layer $layerId)
// val layerId: String (ID of an index layer containing protobuf-encoded SDII data that has copy
// of real data
// from "indexed-locations" layer of "hrn:here:data:::rib-2" catalog)
Dataset<Row> dataFrame =
JavaLayerDataFrameReader.create(sparkSession)
.readLayer(catalogHrn, layerId)
.query("mt_partition=in=(DEU, CUB)")
.load();
Dataset<Row> tileIdLists =
dataFrame.select(
dataFrame.col("partition_name"),
org.apache.spark.sql.functions.explode(dataFrame.col("tile_id")).as("tile"));
tileIdLists.show();Note that to read protobuf data from a layer, the schema must be specified in
the layer configuration and needs to be available on Artifact Service.
Furthermore the schema must have a ds variant. For more information on how to
maintain schemas, see
Set up the Maven project.
Read JSON-Encoded data
The following snippet demonstrates how to access a JSON-encoded DataFrame from
a volatile layer of a catalog.
import com.here.platform.data.client.spark.LayerDataFrameReader.SparkSessionExt
import org.apache.spark.sql.SparkSession
// val sparkSession: org.apache.spark.sql.SparkSession
// val catalogHrn: HRN (HRN of a catalog that contains the layer $layerId)
// val layerId: String (ID of a versioned layer containing json-encoded data)
val result = sparkSession
.readLayer(catalogHrn, layerId)
.query(s"mt_partition=in=($partitionId, $anotherPartitionId)")
.load()
.map { row =>
val (tileId, index) = (row.getAs[Long]("tileId"), row.getAs[Long]("index"))
(tileId, index)
}(Encoders.tuple[Long, Long](Encoders.scalaLong, Encoders.scalaLong))
.collect()import static org.apache.spark.sql.functions.*;
import com.here.hrn.HRN;
import com.here.platform.data.client.spark.javadsl.JavaLayerDataFrameReader;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.IntegerType;
// org.apache.spark.sql.SparkSession sparkSession
// HRN catalogHrn (HRN of a catalog that contains the layer $layerId)
// String layerId (ID of a versioned layer containing json-encoded data)
List<Tuple2<Long, Long>> groupedData =
JavaLayerDataFrameReader.create(sparkSession)
.readLayer(catalogHrn, layerId)
.query("mt_partition=in=(" + partitionId + ", " + anotherPartitionId + ")")
.load()
.map(
(MapFunction<Row, Tuple2<Long, Long>>)
row -> {
Long tileId = row.getAs("tileId");
Long index = row.getAs("index");
return new Tuple2<>(tileId, index);
},
Encoders.tuple(Encoders.LONG(), Encoders.LONG()))
.collectAsList();Read Text-Encoded data
The following snippet demonstrates how to access a Text-encoded DataFrame from
a volatile layer of a catalog. In this example, the row object contains field
data as string.
NoteRestrictions
While reading Text data, each line becomes each row that has stringvaluecolumn by default. Therefore, Text data source has only a single columnvalueper row.
import com.here.platform.data.client.spark.LayerDataFrameReader.SparkSessionExt
import org.apache.spark.sql.SparkSession
// val sparkSession: org.apache.spark.sql.SparkSession
// val catalogHrn: HRN (HRN of a catalog that contains the layer $layerId)
// val layerId: String (ID of a versioned layer containing plain text data)
val result = sparkSession
.readLayer(catalogHrn, layerId)
.query(s"mt_partition=in=($partitionId, $anotherPartitionId)")
.load()
.map { row =>
val parts = row.getAs[String]("value").split("_").map(_.toInt)
require(
parts.length == 2,
s"Expected 2 parts but got ${parts.length} in value '${row.getAs[String]("value")}'")
val (tileId, index) = (parts(0), parts(1))
(tileId, index)
}(Encoders.tuple[Int, Int](Encoders.scalaInt, Encoders.scalaInt))
.collect()import static org.apache.spark.sql.functions.*;
import com.here.hrn.HRN;
import com.here.platform.data.client.spark.javadsl.JavaLayerDataFrameReader;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.IntegerType;
// org.apache.spark.sql.SparkSession sparkSession
// HRN catalogHrn (HRN of a catalog that contains the layer $layerId)
// String layerId (ID of a versioned layer containing plain text data)
List<Tuple2<Integer, Integer>> groupedData =
JavaLayerDataFrameReader.create(sparkSession)
.readLayer(catalogHrn, layerId)
.query("mt_partition=in=(" + partitionId + ", " + anotherPartitionId + ")")
.load()
.map(
(MapFunction<Row, Tuple2<Integer, Integer>>)
row -> {
String value = row.getAs("value");
String[] result = value.split("_");
Integer tileId = Integer.valueOf(result[0]);
Integer index = Integer.valueOf(result[1]);
return new Tuple2<>(tileId, index);
},
Encoders.tuple(Encoders.INT(), Encoders.INT()))
.collectAsList();Read Csv-Encoded data
The following snippet demonstrates how to access a Csv-encoded DataFrame from
a volatile layer of a catalog. In this example, the csv row contains columns
field1 as integer and field2 as string.
import com.here.platform.data.client.spark.LayerDataFrameReader.SparkSessionExt
import org.apache.spark.sql.SparkSession
// val sparkSession: org.apache.spark.sql.SparkSession
// val catalogHrn: HRN (HRN of a catalog that contains the layer $layerId)
// val layerId: String (ID of a versioned layer containing CSV text data)
val result = sparkSession
.readLayer(catalogHrn, layerId)
.query(s"mt_partition=in=($partitionId, $anotherPartitionId)")
.load()
.map { row =>
val (tileId, index) = (row.getAs[String]("tileId"), row.getAs[String]("index"))
(tileId, index)
}(Encoders.tuple[String, String](Encoders.STRING, Encoders.STRING))
.collect()import static org.apache.spark.sql.functions.*;
import com.here.hrn.HRN;
import com.here.platform.data.client.spark.javadsl.JavaLayerDataFrameReader;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.IntegerType;
// org.apache.spark.sql.SparkSession sparkSession
// HRN catalogHrn (HRN of a catalog that contains the layer $layerId)
// String layerId (ID of a versioned layer containing CSV text data)
List<Tuple2<String, String>> groupedData =
JavaLayerDataFrameReader.create(sparkSession)
.readLayer(catalogHrn, layerId)
.query("mt_partition=in=(" + partitionId + ", " + anotherPartitionId + ")")
.load()
.map(
(MapFunction<Row, Tuple2<String, String>>)
row -> {
String tileId = row.getAs("tileId");
String index = row.getAs("index");
return new Tuple2<>(tileId, index);
},
Encoders.tuple(Encoders.STRING(), Encoders.STRING()))
.collectAsList();Read other formats
The following snippet demonstrates how to access data in any arbitrary format from a versioned layer of a catalog:
import org.apache.spark.sql.{Encoders, Row, SparkSession}
val reader = sparkSession
.readLayer(catalogHrn, layerId)
.query(s"mt_partition==$tileId")
if (compressed)
reader.option("olp.connector.data-decompression-timeout", 300000)
val dataFrame = reader.load()
dataFrame.printSchema()
val dataFrameStringContent = dataFrame
.map[String]((r: Row) => r.getAs[String]("value"))(Encoders.STRING)
.collectAsList()
.asScalaimport static org.apache.spark.sql.functions.*;
import com.here.hrn.HRN;
import com.here.platform.data.client.spark.javadsl.JavaLayerDataFrameReader;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.IntegerType;
Dataset<Row> dataFrame =
JavaLayerDataFrameReader.create(sparkSession)
.readLayer(catalogHrn, layerId)
.query(String.format("mt_partition==%s", tileId))
.load();
dataFrame.printSchema();
List<String> dataFrameStringContent =
dataFrame
.map((MapFunction<Row, String>) row -> row.getAs("value"), Encoders.STRING())
.collectAsList();Known issues
DataFrameshould contain the columns representing the partitioning information such as the partition id and the layer version. Currently, these columns are missing.
Note
rawformat refers toapplication/octet-streamin layer config and not to be confused with raw layer config.For information on RSQL, see RSQL.
Updated 21 days ago