Batch Processing

The HERE platform supports Apache Spark framework
for running batch pipelines. We offer two different modules to do that, Spark
Connector and Spark Support.

Note

HERE strongly suggests to use Spark Connector whenever possible as this allows
to make use of the full power of Apache Spark framework.

Spark connector

Spark Connector implements the standard Spark interfaces that allows you to read
from a catalog and get a data set as a DataFrame[Row] and write a DataFrame
to a catalog.

As a result, you can use all standard Spark APIs and functions like select,
filter, map, collect etc. to work with the data.

This means, your business logic does not need to contain any HERE-specific
function calls.

For detailed explanation of Spark Connector, see
Spark Connector.

Spark support

Spark Support is a HERE-proprietary implementation of using data from HERE
platform catalogs and layers with Apache Spark framework. The distribution of
processing jobs to workers is done by Spark but the data model is a
HERE-proprietary format. There is no SQL-like interface, so you cannot select
and filter data using RSQL query.

There are only very few specific use-cases when this module should be used e.g.
when you need to have full control over the data format because you want to use
a very compact format. For example, this principle is used by the
Data Processing Library.
If you want to implement batch pipelines which can be optimized for maximum
performance we suggest to use Data Processing Library.

There are plans but no defined dates yet to retire Spark Support module in long
term.