Batch Processing
Batch Processing
The HERE platform supports Apache Spark framework for running batch pipelines. We offer two different modules to do that, Spark Connector and Spark Support.
NoteHERE strongly suggests to use Spark Connector whenever possible as this allows to make use of the full power of Apache Spark framework.
Spark connector
Spark Connector implements the standard Spark interfaces that allows you to read
from a catalog and get a data set as a DataFrame[Row] and write a DataFrame
to a catalog.
As a result, you can use all standard Spark APIs and functions like select, filter, map, collect etc. to work with the data.
This means, your business logic does not need to contain any HERE-specific function calls.
For detailed explanation of Spark Connector, see Spark Connector.
Spark support
Spark Support is a HERE-proprietary implementation of using data from HERE platform catalogs and layers with Apache Spark framework. The distribution of processing jobs to workers is done by Spark but the data model is a HERE-proprietary format. There is no SQL-like interface, so you cannot select and filter data using RSQL query.
There are only very few specific use-cases when this module should be used e.g. when you need to have full control over the data format because you want to use a very compact format. For example, this principle is used by the Data Processing Library. If you want to implement batch pipelines which can be optimized for maximum performance we suggest to use Data Processing Library.
There are plans but no defined dates yet to retire Spark Support module in long term.
Updated 21 days ago