Data Processing Library incremental compilation
Data Processing Library incremental compilation
The Data Processing Library runs complex compile patterns incrementally,
whenever possible. This is based on the principle that an incremental run
produces output that is identical to that of a non-incremental
run; only faster. If an incremental compilation from version N to N+1 produces
a different commit than just compiling version N+1 and publishing the changed
payloads, this is considered an error. The Driver may prevent incremental
compilation if any the following conditions is met:
- The library is instructed by the calling environment to compile one or more
input catalogs fully, for example,
Reprocessis present in theJobdescription instead ofChangesorNoChange. - The output catalog is empty, hence the first run cannot be incremental.
- The output catalog contains dependencies that are logically incompatible with the changes being compiled. For example, the dependencies of the latest commit on the output catalog mentions that it was derived from input catalog A at version N; if the changes being compiled match it, such that the library is scheduled to compile changes of catalog A from version N to version N+X, then incremental compilation is allowed. Otherwise, if the library is scheduled to compile changes of catalog A from version N+Y to version N+Z, this is considered invalid, so incremental compilation is not attempted; but version N+Z is still compiled.
- Software changes:
Fingerprintsof the library and your code from the last run are compared with the current ones and incremental compilation is not attempted if the fingerprints do not match. - Configuration changes:
Fingerprintsof the configuration from the last run is compared with the current ones from the active configuration and incremental compilation is not attempted if the fingerprints do not match. - Changes in shared variables: compilers may be dependent on global variables or external content that comes from outside of the library. See next chapter.
Access external content
It is possible for compilers to access external content, such as content not available in the input catalogs. The processing library does not and cannot block access to external data, but your compiler must make the library aware of this data.
Unless the compiler is a non-incremental compiler, the processing library must
be aware of this external content. This is because, if the external content
changes, it is not safe to run incrementally as some output partition may
result in having content calculated from the updated externals while unchanged
partition keep having content calculated from the previous externals. This may
render the output catalog inconsistent, invalid, or result in unintended
behavior. Therefore, when the external content changes, it is important to
notify the Driver so it runs non-incrementally once, to update
all the output partitions. Subsequent runs are then incremental again.
While implementing one of the setup pattern children of the DriverSetup
interface, you can access a DriverContext and its Fingerprints.
Using the addCustomHash method, you register the hash of external content.
If this hash is different from the one registered in the previous run, the
Driver will not run incrementally. Hashes are persisted together with
Fingerprints and checked automatically.
Not registering the hash of external content may render incremental compilation unsafe or unwanted final behavior. Registering the hash is not mandatory, as long as you are aware of the consequences.
If the content is big, you can use the toBroadcast method in the broadcast
package. This creates a Spark broadcast in a safe manner for incremental
compilation from the value you provide. The mechanisms of Spark broadcast
ensures effective distribution of the content to nodes. For more details, see
Broadcast Input Layers to Executor Nodes.
Updated 22 days ago