Data Processing Library incremental compilation

The Data Processing Library runs complex compile patterns incrementally, whenever possible. This is based on the principle that an incremental run produces output that is identical to that of a non-incremental run; only faster. If an incremental compilation from version N to N+1 produces a different commit than just compiling version N+1 and publishing the changed payloads, this is considered an error. The Driver may prevent incremental compilation if any the following conditions is met:

The library is instructed by the calling environment to compile one or more input catalogs fully, for example, Reprocess is present in the Job description instead of Changes or NoChange.
The output catalog is empty, hence the first run cannot be incremental.
The output catalog contains dependencies that are logically incompatible with the changes being compiled. For example, the dependencies of the latest commit on the output catalog mentions that it was derived from input catalog A at version N; if the changes being compiled match it, such that the library is scheduled to compile changes of catalog A from version N to version N+X, then incremental compilation is allowed. Otherwise, if the library is scheduled to compile changes of catalog A from version N+Y to version N+Z, this is considered invalid, so incremental compilation is not attempted; but version N+Z is still compiled.
Software changes: Fingerprints of the library and your code from the last run are compared with the current ones and incremental compilation is not attempted if the fingerprints do not match.
Configuration changes: Fingerprints of the configuration from the last run is compared with the current ones from the active configuration and incremental compilation is not attempted if the fingerprints do not match.
Changes in shared variables: compilers may be dependent on global variables or external content that comes from outside of the library. See next chapter.

Access external content

It is possible for compilers to access external content, such as content not available in the input catalogs. The processing library does not and cannot block access to external data, but your compiler must make the library aware of this data.

Unless the compiler is a non-incremental compiler, the processing library must be aware of this external content. This is because, if the external content changes, it is not safe to run incrementally as some output partition may result in having content calculated from the updated externals while unchanged partition keep having content calculated from the previous externals. This may render the output catalog inconsistent, invalid, or result in unintended behavior. Therefore, when the external content changes, it is important to notify the Driver so it runs non-incrementally once, to update all the output partitions. Subsequent runs are then incremental again.

While implementing one of the setup pattern children of the DriverSetup interface, you can access a DriverContext and its Fingerprints. Using the addCustomHash method, you register the hash of external content. If this hash is different from the one registered in the previous run, the Driver will not run incrementally. Hashes are persisted together with Fingerprints and checked automatically.

Not registering the hash of external content may render incremental compilation unsafe or unwanted final behavior. Registering the hash is not mandatory, as long as you are aware of the consequences.

If the content is big, you can use the toBroadcast method in the broadcast package. This creates a Spark broadcast in a safe manner for incremental compilation from the value you provide. The mechanisms of Spark broadcast For more details, see Broadcast Input Layers to Executor Nodes.

Updated 18 days ago