Best practices for high availability

Critical applications that rely on collecting and processing large amounts of data require the underlying infrastructure to be reliable and responsive.
To ensure that pipelines continue to operate reliably and meet business needs, the HERE platform offers several options for creating and deploying highly-available pipelines.
This guide will help you understand these options and how best to use them.

Update pipeline contact email

It is recommended to provide an email address for your pipeline, so that if it is restarted by
the HERE Platform for some reason, such as a planned outage, an email notification is sent to the assigned email address.
You can specify a contact email when creating a pipeline, or set an email for an existing
pipeline by updating it.

Without updated contact information, you risk missing important notifications about your pipeline, which can lead to
unexpected disruptions and a lack of awareness regarding the pipeline's status.

To learn more about these notifications, see Stream processing best practices - Enable notifications for stream pipelines chapter.

Use pipeline monitoring and alerts

Pipelines generate certain standard metrics that can be used to track their status over time. To monitor the status of a pipeline,
a Pipeline Status dashboard is available in Grafana. Based on that dashboard, you can set failure alerts that can be sent via email or other mediums.
It's also possible to create custom metrics for the pipelines to monitor and be alerted for specific expected outputs.

For additional insight into pipeline performance and potential areas of improvement, use the Spark UI for batch pipelines
and the [Flink Dashboard]((https://docs.here.com/workspace/docs/flink-ui) for stream pipelines. These tools provide performance data to help you assess and improve the efficiency of your pipelines.

Following these practices, you will ensure that you are always informed of the status of your pipeline and can proactively address issues.

For more information, see Pipeline monitoring section and the Logs, Monitoring, and Alerts User Guide.

Enable checkpointing for stream pipelines

For a failed stream pipeline to continue processing, it's important to restore it from a previous running state.
Flink checkpointing (which is disabled by default) is required to create such states. When enabled, Flink takes consistent
snapshots (checkpoints) of the pipeline at specified intervals.
These checkpoints are used to restore the pipeline to its last running state when it is restarted, preventing data from being reprocessed or lost.

For more information, see the following guides:

Enable a restart strategy for stream pipelines

By default, a failed stream pipelines do not restart.
To increase the availability of your stream pipelines, it is recommended that you enable a restart strategy for them.
With an appropriate restart strategy, you can ensure that your pipelines automatically restart after a failure, reducing downtime and maintaining operational stability.

Flink supports the following restart strategies:

Stream 5.0

If checkpointing is enabled, the default restart strategy is Fixed Delay, otherwise the No Restart restart strategy is used.

Stream 6.0

The default restart strategy used is Failure Rate with a 30-minute interval, 15-second delay and a maximum of 3 failures per interval.
This strategy applies to jobs both with and without checkpointing enabled.

For more information, see the following guides:

Provision extra TaskManagers with a short restart strategy

To ensure your stream pipeline remains resilient in the event of TaskManager failures, it is recommended to provision extra TaskManagers and configure a short restart strategy.
This approach helps minimize disruptions and maintain continuous data processing.

When a Flink TaskManager fails or is unreachable, the pipeline tries to get a replacement for it.
This can take several minutes and during this time, the data processing will be on hold.
It is also possible for the pipeline to fail if the retries exceed the time or frequency thresholds before a new TaskManager is spooled and assigned to it.
To avoid these situations, it is recommended to provision additional TaskManagers on standby, while configuring the pipeline's resource cluster.
The additional TaskManagers do not process data during normal pipeline operation and are only used if an active TaskManager fails.

Note

Additional cost
Provisioning additional TaskManagers introduces additional cost.

For critical pipelines, several extra TaskManagers can be provisioned.

For example, if a pipeline application has a parallelism of 8, and it is assigned with 9 TaskManagers, then only 8 TaskManagers will be used to process the data.
The 9th is on standby until one of the 8 active TaskManagers fails.
To take advantage of this setup, it is recommended to set a shorter time interval within the Flink restart strategy.
If one of the active TaskManagers fails, the pipeline will quickly select the TaskManager already available to continue processing the data,
rather than retrying and waiting for a new TaskManager.

For more information on restart strategies and stream application parallelism, see the following guides:

Enable high availability option for stream pipelines

The Flink JobManager coordinates the scheduling and resource management for each stream pipeline.
By default, a single JobManager is created for each pipeline.
This creates a single point of failure - if the JobManager crashes, the running pipeline fails.

To prevent this from happening, enable the high availability option of the Flink JobManager.
It reduces processing downtime to near zero and is beneficial for time-sensitive stream data processing applications.
When high availability is enabled, another JobManager is used as a standby for the pipeline.
These multiple JobManagers are managed by ZooKeeper, which coordinates leader selection and pipeline state.
This second JobManager is deployed in a different availability zone to the first.
If the primary JobManager fails, the standby JobManager quickly takes over and the pipeline continues to run.
The failed primary JobManager is also restarted, and it becomes the new standby JobManager to reestablish high availability to protect against future failures.

The option to enable High Availability is available during the Activate,
Resume, and Upgrade operations.

Please note that enabling this feature introduces additional cost for the additional resources.
These additional resources required to run a stream pipeline JobManager with high availability are:

Resources for the second JobManager (same size as the first one).
Resources for the ZooKeeper: 1.5 CPU and 1.5 GB of RAM.

The cost of these additional resources is added to the original cost of the pipeline.

More information is available in the Apache Flink documentation on high availability.

Enable multi-region setup for pipelines

For both batch and stream pipelines that require minimum downtime in the event of a primary region failure, the pipeline
can be configured with the multi-region option so that if the primary region fails, the pipeline is automatically transferred to the secondary region.

The running pipelines are restarted in the secondary region and the non-running pipelines are transferred as they are.
When the primary region is alive again, the pipelines are transferred back to it with the same care.

For example, if the primary region fails, an On-demand (Run now) batch pipeline in the Running state is automatically reactivated in the secondary region.
However, a Failed On-demand batch pipeline needs to be manually reactivated.

For a batch pipeline, the Spark history details are also transferred during a region failure.

Note

Additional cost for multi-region setup
The transfer of the pipeline state between two regions is billed as part of Pipeline I/O.

To enable multi-region setup, make sure that the following settings are configured:

For a pipeline to work successfully in the secondary region, the input and output catalogs used by the pipeline must be available in the secondary region.
For a stream pipeline to successfully utilize the multi-region option, it's important to enable checkpointing within the application code
to allow Flink to take periodic savepoints while running the pipeline in the primary region. When the primary region fails,
the last available savepoint is used to restart the pipeline in the secondary region. For more information on checkpointing and savepoints,
see the Enable checkpointing for stream pipelines chapter.

Without a multi-region setup, your pipelines are vulnerable to downtime if the primary region experiences a failure.
This can result in operational disruptions, impacting overall pipeline reliability and performance.