GuidesPipeline API ReferenceChangelog
Guides

Pipelines API pipeline components

A pipeline is the top-level entity in the HERE platform that performs your data processing tasks. When you develop your own special purpose pipeline, multiple versions of it can be stored and managed by the HERE platform system (see Pipeline version below). Developing a pipeline is an iterative process, and it is typical to upload new versions of the pipeline code for testing each pipeline with different input and output catalogs or settings. After testing, you can put the pipeline into production with live data.

Many elements go into making a pipeline. The following list defines the most important components.

  • Pipeline template – A pipeline template is the logical entity representing the definition of an executable pipeline and its run-time properties on the platform.

Depending on which API is used to create the pipeline template there can be either a single or multiple versions of the pipeline template.

  • Pipelines API v2: Under this API a pipeline template will always have exactly one pipeline template version. The pipeline template and pipeline template version are managed as a single entity.
  • Pipelines API v3: Under this API a pipeline template can consist of one or more pipeline template versions. Pipeline template versions can be created and deleted individually.

In the Pipelines API v2, pipeline templates are identified by their UUID.

As of v3, all pipeline templates are identified by their HERE Resource Name (HRN).

  • Pipeline template version (V3 only) - This is the immutable definition of a version of an executable pipeline and its run-time properties on the platform.

Each time a pipeline template is created, the initial pipeline template version is created for it.

Creating additional versions for a pipeline template is similar to creating additional versions for a pipeline.

However, pipeline versions and pipeline template versions are independent entities - a pipeline template version can be used by many pipeline versions.

A pipeline template version is uniquely identified by its parent's pipeline template HRN and a unique version ID, that is typically a sequence number.

In addition to others, the pipeline template version allows you to specify several properties introduced by the v3 API:

  • Ports: These are the generalized input and output resource identifiers of the pipeline template.
  • Port compatibility: This property allows you to specify the configuration requirements for the ports expected by the pipeline implementation.
  • Runtime configuration schema: This is an optional JSON schema specification that describes the runtime configuration. It can contain relevant information for each supported property, such as type and whether it is required.

Sections below contain more information on the properties mentioned previously.

  • Pipeline template – A pipeline template is the logical entity representing the definition of an executable pipeline and its run-time properties on the platform.

It defines logical identifiers for the input and output catalogs it will use. One pipeline template can be used to create multiple pipeline versions.

Each template is assigned a unique template ID (UUID) by the pipeline when it is created. Pipelines and pipeline templates are independent entities.

A pipeline template can be used by many pipeline versions.

The pipeline template contains the JAR file as well as all the configuration information necessary to access, process, and store data.


  • Pipeline version – This is an immutable entity that is the executable form of a pipeline within the HERE platform.

Each pipeline version is created from a pipeline template version that contains a JAR file.

Each pipeline version is assigned its own identifier by the pipeline when created - {{ 'UUID for Pipelines API v2 or a sequence number for Pipelines API v3' if book.product == 'internal' else 'UUID'}}.

Multiple pipeline versions can be defined based on a single pipeline template version.

However, two instances of a pipeline version within the same pipeline cannot run at the same time.

  • Package – This is an immutable entity that represents a JAR file with executable pipeline code that has been uploaded onto the HERE platform.

The JAR file must contain both compiled pipeline code and supporting libraries. This is commonly referred to as a Fat JAR file.

A package contains metadata, such as the name of the file, in addition to the binary artifact. The filename of the JAR file has a 200-character limit, and the size of the binary artifact cannot exceed 500MB.

The best practice for selecting a filename for the JAR file is to use a semantically meaningful name and include some kind of unique versioning system.

The JAR file is identified within the pipeline template and is uploaded to the pipeline, where it is assigned a unique package ID (UUID). As of Pipelines API v3, packages are no longer an external component. Instead, the JAR file is uploaded as part of the creation of the pipeline template version, simplifying the steps to create a pipeline template version.

It is also possible to reuse a previously uploaded package of the same pipeline template for a pipeline template version.

  • Job – A job is an immutable entity that represents the one-time configuration parameters and input catalogs that are submitted to the cluster by a running pipeline version for that job only. When it specifies the exact versions of the input catalogs to be processed, that information overrides similar information specified in the pipeline template. It's possible to obtain the list of running or historical Jobs. Jobs have a state that may change over time as the job runs and terminates.
    • Stream job – For stream processing, a job is not intended to terminate by itself. These jobs typically have a continuing stream of input data for processing.
    • Batch job – For batch processing, a job terminates by either succeeding or failing to process the available input data. These jobs typically are run on one or more finite collections of data. {% if book.product == 'internal' %}
  • Port (V3 only) - A port is an abstraction for an input or output resource. A catalog is an example of a port.

A port must be declared in the pipeline template version as input or output. A port must be bound to a physical resource as part of the pipeline version.

For example, in case of a catalog port the port identifier must be bound to a catalog HRN.

A port should also meet the port compatibility requirements specified at the pipeline template version level, if there are any.

For more information, see the Configurations available for pipeline developers - Ports and bindings.

  • Catalog – A catalog is a specific port type, and the only type currently supported. It refers to platform data catalogs. An input catalog is the data source for the pipeline, and the output catalog is the data destination of the pipeline. A pipeline can have more than one input catalog, but can only have one output catalog.

  • Input and output catalogs – Catalogs refer to platform data catalogs. The input catalog is the data source for the pipeline, and the output catalog is the data destination from the pipeline. A pipeline can have more than one input catalog, but can only have one output catalog.

  • For a stream pipeline, you may need to specify catalog versions, depending on the type of catalog layer used (that is, versioned, volatile, or stream).
  • For a batch pipeline, you can choose to run the pipeline immediately or schedule it to run when the input catalog data is updated. To run the pipeline immediately, you must specify the catalog versions. To schedule the pipeline mode, you do not need to specify the catalog versions. Instead, the pipeline scheduler checks the input and output catalogs every five minutes to capture changes and check for consistent versions for all the catalogs to be processed. For example, assume that the pipeline has two input catalogs and one has changes from an upstream catalog version 5 and the other input catalog also includes the same upstream catalog, but has not yet processed version 5. Thus, you cannot run the pipeline because the two input catalogs' versions are not consistent.

Batch jobs can be scheduled as:

  • Run Now - The pipeline will run only once and return to the Ready state.
  • Schedule/Data Change - The pipeline will run when data changes are available, and will remain scheduled.
  • Schedule/Time Schedule - The pipeline will run at specific times and will remain scheduled (see Scheduler configuration in the section below). There are other equally important components in the form of commands and configurations that enable you to create your customized pipeline.
  • Operation – These are special operational commands that can be submitted to a pipeline version. The pipeline version can accept them or not, typically because they may be invalid or another Operation is already pending. Operations have a state that can be checked to see if an operation is still in progress or has been completed with a result.
  • Runtime configuration – A set of parameters that can be specified at run-time to configure the pipeline runtime environment.

The pipeline template specifies the default configuration parameters.

However, some of these parameters may be defined in specific job's configuration to override the default parameters of the template.

In a pipeline version, the runtime configuration specifies the actual configuration passed to jobs submitted for processing by that pipeline version.

Custom configuration parameters are placed on the classpath of the pipeline, as application.properties, which you can reference from within the pipeline code. For more information, see the Configurations available for pipeline developers.

  • Runtime configuration schema (V3 only)RuntimeConfigurationSchema is a property used with the pipeline template version. It allows the author of a pipeline template version to publish the exact specifications of the supported runtime configuration parameters. For more information, see the Configurations available for pipeline developers - Runtime parameters.

  • Scheduler configurationSchedulerConfig is a property used in each pipeline version. The scheduler controls when a job is created and submitted to the Flink or Spark cluster for processing. The scheduler may start a new job when the previous one completes as expected, or not as expected. The scheduler polls or waits for changes from upstream catalogs. It can also operate due to timers or other external triggers. It includes properties like when to start a job, whether to restart terminated jobs, and polling intervals for upstream catalogs.
  • Cluster configuration - ClusterConfig is a property used with the pipeline template and pipeline version. In a pipeline template, it represents the suggested minimum size of the cluster needed to run a pipeline version based on that pipeline template. It represents the actual size and configuration of the cluster dedicated to the execution of that specific pipeline version. The configuration specifies properties like the size of the cluster in processing units of a fixed number of CPUs and of memory. Also, the configuration includes a customizable Garbage Collector (GC) profile to optimize performance and resource management for each pipeline, improving application efficiency and responsiveness. You can also specify overhead memory fraction to be used for Batch pipelines. The following table lists the cluster configuration parameters. For more details, see the Deploy a pipeline via web portal - Resource profiles article.
Config ParameterMeaning
supervisorUnitsNumber of resource units per Supervisor (Flink JobManager or Spark Driver)
supervisorAdditionalOverheadMemoryFraction of proportionality of additional non-heap memory per Supervisor (Spark Driver)
workerUnitsNumber of resource units per Worker (Flink TaskManager or Spark Executor)
workersNumber of Workers (number of Flink TaskManagers or Spark Executors)
workerAdditionalOverheadMemoryFraction of proportionality of additional non-heap memory per Worker (Spark Executor)
gcProfileOption available for GC profile selection
📘

Note

For supervisorUnits and workerUnits, the following size limits apply:

  • minimum size: 1 unit
  • maximum size: 15 units

Another limitation is that a pipeline can contain only single supervisor and not more than 199 workers.

See also