Pipelines API pipeline components
A pipeline is the top-level entity in the HERE platform that performs your data processing tasks. When you develop your own special purpose pipeline, multiple versions of it can be stored and managed by the HERE platform system (see Pipeline version below). Developing a pipeline is an iterative process, and it is typical to upload new versions of the pipeline code for testing each pipeline with different input and output catalogs or settings. After testing, you can put the pipeline into production with live data.
Many elements go into making a pipeline. The following list defines the most important components.
-
Pipeline version – This is an immutable entity that is the executable form of a pipeline within the HERE platform. Each pipeline version is created from a pipeline template version that contains a JAR file. Each pipeline version is assigned its own identifier by the pipeline when created - UUID for Pipelines API v2. Multiple pipeline versions can be defined based on a single pipeline template version. However, two instances of a pipeline version within the same pipeline cannot run at the same time.
-
Package – This is an immutable entity that represents a JAR file with executable pipeline code that has been uploaded onto the HERE platform. The JAR file must contain both compiled pipeline code and supporting libraries. This is commonly referred to as a Fat JAR file. A package contains metadata, such as the name of the file, in addition to the binary artifact. The filename of the JAR file has a 200-character limit, and the size of the binary artifact cannot exceed 500MB.
The best practice for selecting a filename for the JAR file is to use a semantically meaningful name and include some kind of unique versioning system. The JAR file is identified within the pipeline template and is uploaded to the pipeline, where it is assigned a unique package ID (UUID).
-
Job – A job is an immutable entity that represents the one-time configuration parameters and input catalogs that are submitted to the cluster by a running pipeline version for that job only. When it specifies the exact versions of the input catalogs to be processed, that information overrides similar information specified in the pipeline template. It's possible to obtain the list of running or historical Jobs. Jobs have a state that may change over time as the job runs and terminates.
-
Stream job – For stream processing, a job is not intended to terminate by itself. These jobs typically have a continuing stream of input data for processing.
-
Batch job – For batch processing, a job terminates by either succeeding or failing to process the available input data. These jobs typically are run on one or more finite collections of data.
-
For a stream pipeline, you may need to specify catalog versions, depending on the type of catalog layer used (that is, versioned, volatile, or stream).
-
For a batch pipeline, you can choose to run the pipeline immediately or schedule it to run when the input catalog data is updated. To run the pipeline immediately, you must specify the catalog versions. To schedule the pipeline mode, you do not need to specify the catalog versions. Instead, the pipeline scheduler checks the input and output catalogs every five minutes to capture changes and check for consistent versions for all the catalogs to be processed. For example, assume that the pipeline has two input catalogs and one has changes from an upstream catalog version 5 and the other input catalog also includes the same upstream catalog, but has not yet processed version 5. Thus, you cannot run the pipeline because the two input catalogs' versions are not consistent.
Batch jobs can be scheduled as:
-
- Run Now - The pipeline will run only once and return to the Ready state.
- Schedule/Data Change - The pipeline will run when data changes are available, and will remain scheduled.
- Schedule/Time Schedule - The pipeline will run at specific times and will remain scheduled (see Scheduler configuration in the section below).
There are other equally important components in the form of commands and configurations that enable you to create your customized pipeline.
-
Operation – These are special operational commands that can be submitted to a pipeline version. The pipeline version can accept them or not, typically because they may be invalid or another Operation is already pending. Operations have a state that can be checked to see if an operation is still in progress or has been completed with a result.
-
Runtime configuration – A set of parameters that can be specified at run-time to configure the pipeline runtime environment. The pipeline template specifies the default configuration parameters. However, some of these parameters may be defined in specific job's configuration to override the default parameters of the template. In a pipeline version, the runtime configuration specifies the actual configuration passed to jobs submitted for processing by that pipeline version. Custom configuration parameters are placed on the classpath of the pipeline, as
application.properties, which you can reference from within the pipeline code. For more information, see the Configurations available for pipeline developers. -
Scheduler configuration –
SchedulerConfigis a property used in each pipeline version. The scheduler controls when a job is created and submitted to the Flink or Spark cluster for processing. The scheduler may start a new job when the previous one completes as expected, or not as expected. The scheduler polls or waits for changes from upstream catalogs. It can also operate due to timers or other external triggers. It includes properties like when to start a job, whether to restart terminated jobs, and polling intervals for upstream catalogs. -
Cluster configuration -
ClusterConfigis a property used with the pipeline template and pipeline version. In a pipeline template, it represents the suggested minimum size of the cluster needed to run a pipeline version based on that pipeline template. It represents the actual size and configuration of the cluster dedicated to the execution of that specific pipeline version. The configuration specifies properties like the size of the cluster in processing units of a fixed number of CPUs and of memory. Also, the configuration includes a customizable Garbage Collector (GC) profile to optimize performance and resource management for each pipeline, improving application efficiency and responsiveness. You can also specify overhead memory fraction to be used for Batch pipelines. The following table lists the cluster configuration parameters. For more details, see the Deploy a pipeline via web portal - Resource profiles article.
| Config Parameter | Meaning |
|---|---|
supervisorUnits | Number of resource units per Supervisor (Flink JobManager or Spark Driver) |
supervisorAdditionalOverheadMemory | Fraction of proportionality of additional non-heap memory per Supervisor (Spark Driver) |
workerUnits | Number of resource units per Worker (Flink TaskManager or Spark Executor) |
workers | Number of Workers (number of Flink TaskManagers or Spark Executors) |
workerAdditionalOverheadMemory | Fraction of proportionality of additional non-heap memory per Worker (Spark Executor) |
gcProfile | Option available for GC profile selection |
NoteFor
supervisorUnitsandworkerUnits, the following size limits apply:
- minimum size: 1 unit
- maximum size: 15 units
Another limitation is that a pipeline can contain only single supervisor and not more than 199 workers.
-
pipeline-config.conf – This file contains the parameters describing input catalogs, output catalog, and billing tags. For more information, see the Configurations available for pipeline developers.
-
pipeline-job.conf – This file contains set of parameters that are used to customise the execution mode of batch pipelines. For more information, see the Configurations available for pipeline developers.
See also
Updated 3 days ago