Operational modes of HERE Anonymizer Self-Hosted

HERE Anonymizer Self-Hosted supports 2 operational modes, streaming and batch. These modes have partly different parameters and some connectors work exclusively
for streaming or batch mode, while the core anonymization algorithms are applied in both modes.

Streaming mode

In streaming mode, the HERE Anonymizer Self-Hosted runs continuously without the Apache Flink process ever ending.

Apache Flink's streaming mode is a framework designed for real-time data processing and analysis of unbound data streams.

This means that the end of the streaming data is unknown as well as the size of the data to be processed.

It processes data as it arrives in a time window, ensuring precise and low-latency computations even with out-of-order events.

Batch mode

In batch mode, the HERE Anonymizer Self-Hosted processes a certain set of data and finishes the job after that. It is active only for the time of processing.

Apache Flink's batch mode allows to handle bound datasets, effectively supporting traditional batch processing.

This means that the size and end of the batch data is known at the start of the processing.

Please note, that batch mode works with connectors for permanent storages only.

In batch mode, it is required that the data is processed in an ordered way which means that a single file can contain
one or multiple complete traces and a single trace can't be split across multiple files.

Usually, batch data is huge therefore it requires some upfront indexing and sorting to be processed efficiently.

The HERE Anonymizer Self-Hosted deliverable package contains a dedicated preprocessor application
which exactly does that and arranges the input data in a way that HERE Anonymizer Self-Hosted in batch mode can process it in
an optimized way. With this 2-phase approach it is possible to process data sets which are way larger than the memory
capacity of the processing cluster. Also, there could be multiple anonymization runs with different parameters based
on the same preprocessed input data set.