GuidesChangelogData Inspector Library API Reference
Guides

How to monitor a pipeline

The HERE platform pipelines generate a number of standard statistics and metrics that can be used to track their status over time. They are visualized by different tools like Spark UI, Flink Dashboard, and Grafana-powered dashboards that allows you to monitor and analyze these metrics, logs, and traces. These metrics are also used to generate alerts for specific events associated with a pipeline, such as a failed pipeline job.
This section demonstrates how to monitor the pipeline status using tools mentioned above, how to customize an existing Grafana alert which sends an alert message to you by email whenever a job fails, etc.

For more information on the standard metrics available for batch and stream pipelines, refer to the Logs, Monitoring, and Alerts User Guide.

Use Spark UI to inspect batch data processing jobs

The Spark framework includes a web console that is active for all Spark jobs that reached the Running state. It is called the Spark UI and it provides insight into batch pipeline processing, including jobs, stages, execution graphs, and logs from the executors. The Data Processing Library components also publish various statistics (see Spark AccumulatorV2), such as the number of metadata partitions read, the number of data bytes downloaded or uploaded, etc. This data can be seen in the stages where the operations were performed.

The Spark UI is a useful tool for optimizing the performance of your Spark jobs, troubleshooting job failures, and identifying issues, such as data skew, slow stages, etc.

For additional information on Spark UI page, see the following chapters:

Use Flink Dashboard to inspect stream data processing jobs

The Flink framework includes a web interface that is available for all Flink jobs in the Running state. It is called the Flink Dashboard and it allows you to inspect the execution and status of Flink jobs, tasks, and operators, visualize job performance, check the status of job checkpoints, savepoints, recovery mechanisms, etc.

The Flink Dashboard is a useful tool for optimizing the performance of your Flink jobs, troubleshooting task failures, and identifying issues, such as slow tasks, backpressure, and resource bottlenecks.

For additional information on Flink Dashboard page, see the following chapters:

Use Grafana dashboard to monitor pipeline status

Metrics generated by the data processing pipelines can be displayed on different dashboards powered by Grafana, which is a platform for monitoring and visualizing real-time data.
To monitor the status of pipelines, an OLP Pipeline Status dashboard is available. To access it from the platform portal home page, first open the Launcher menu. In the Manage section you will see two monitoring pages provided - for the EU (Ireland) and US (Oregon) regions. Since this section shows the instructions for Grafana instance that is related to the region EU (Ireland), the next step is to select the Regional monitoring EU (Ireland) page:

pipeline-monitoring-1.png

This will take you to the following Grafana home page:

pipeline-monitoring-2.png

As you can see, several default dashboards are listed on the left side of the page while the available User Defined Alerts are listed on the right side.

Now we need to open the OLP Pipeline Status dashboard. To do this, first go to the Browse page by selecting the appropriate option from the Dashboards menu:

pipeline-monitoring-3.png

This page contains all the dashboards available for this Grafana instance. To find the OLP Pipeline Status dashboard among them, enter its name in the search bar and open the dashboard with the label OLPDefault:

pipeline-monitoring-4.png

The OLP Pipeline Status dashboard has been created for demonstration purposes and shows the pipeline jobs with the following statuses:

  • Failed
  • Completed
  • Canceled
  • Submitted
  • Running

Each pipeline status is colour coded for quick identification:

pipeline-monitoring-5.png

You can use filers available for this dashboard, to retrieve the data more precisely. For example, the Pipeline Job Status filter displays jobs that have a particular status:

pipeline-monitoring-6.png

Note

Default dashboard settings
Default time period: Last 24 hours.
Default refresh interval: 30 minutes.

If you choose a larger sampling period in Grafana, it uses a sampling mechanism that shows fewer data points than it should. This allows for quicker responses, but to see more accurate data you should shorten the time period you are looking at.

Use Grafana alerts to monitor and respond to pipeline issues

Grafana alert is a notification triggered by specific events associated with a pipeline, allowing users to respond to potential issues or anomalies. Grafana allows you to set alerts and request email notifications when a condition or threshold is met.

There is one configured alert in the above dashboard. To access its settings, edit the Pipeline Jobs' Status panel:

pipeline-monitoring-7-1.png

All alert setting are present on the Alert tab:

pipeline-monitoring-8.png

This alert is triggered if at least one pipeline job has failed in the last minute time interval. When this happens, an email should be generated and sent to the recipient list specified in the notification channel attached to this alert. This particular alert does not contain any attached notification channels as it was created for demonstration purposes only and is not intended to act like a real alert. However, alerts on similar boards are usually configured with a notification channels and these settings look like this:

pipeline-monitoring-9.png

Let's take a closer look at this notification channel. First, you need to open the Notification channels tab from the Alerting menu:

pipeline-monitoring-10.png

Select the OLP Pipeline Failure Notification channel to open its settings:

pipeline-monitoring-10-1.png

A new tab opens with all the settings related to this notification channel, including notification type, list of recipients, notification-specific settings, etc.:

pipeline-monitoring-11.png

If you are interested in creating Grafana dashboards from scratch, creating alerts, and setting up a notification channels, see the following articles:

Alert behavior and limitations

Failure emails are only sent when the alert state changes. For example, if a pipeline job fails, the alert goes to the Alerting state and a failure email is sent to the specified recipients. If another pipeline job fails within the default alert interval (which is 1 minute for the alert described above), a second email will not be sent because the alert is already in the Alerting state. Before any subsequent failures can trigger alert emails, the first alert state must transition to the No Data state, which happens automatically at the end of the 1-minute interval. So if two pipeline jobs fail within the same 1-minute interval, only one Alerting email is sent, followed by the No Data email that is sent at the end of the 1-minute interval.

Note

In this case, the No Data alert that is sent at the end of the 1-minute interval does not indicate any problems with the data processing pipelines.

This is an inherent behavior of Grafana and not a limitation of the HERE platform. The diagram below illustrates what is happening and how the second pipeline failure is not processed:

Sequence diagram of Grafana alert handling.

See also