How to filter input partitions
How to filter input partitions
Sometimes, you may have tasks that only need a subset of the input layer partitions to produce their output layers. In such cases, it is usually desirable to exclude unnecessary partition keys as early as possible in the execution process to avoid wasting CPU, memory, and network bandwidth.
While it is possible to manually filter the input keys in the compiler's front-end implementation for most compilation patterns, for some patterns, such as the RefTreeCompiler, it can be difficult to know if a key should be filtered, or not, since we typically do not know if a subject or a referenced partition is being processed.
For these cases, the library provides a way to configure a partition filter to select partition keys and metadata from the list of partitions to process. This configuration works as follows:
- For the RefTreeCompiler, only the subject
partitions are filtered. This is something to remember, particularly if an
input layer is itself the product of an upstream compiler in a
multi-compiler driver task with filtering applied. If a compiler down the
chain references neighbor partition names from a layer produced by an
upstream compiler, make sure that the filtering is permissive enough to
output all partitions to be referenced by any following compilation in the
driver task. You can use the
byIdexecutor config to configure a different filter for a compiler. - The
here.platform.data-processing.executors.partitionKeyFiltersconfiguration is part of fingerprints, which means that changing this configuration triggers a non-incremental run for the compilation to remain deterministic. - Partition key filters can also be used with the
DeltaSetinterface. Partition key filters defined inhere.platform.data-processing.deltasets.partitionKeyFiltersare used in allqueryandreadBackDeltaSet transformations to filter the input partitions.
Configure filters
Partition key filtering can be configured in the application.conf.
For example, this filter can configure a task to only process partitions
contained within a given latitude/longitude bounding box:
// Note: use here.platform.data-processing.deltasets.partitionKeyFilters to filter deltasets.
here.platform.data-processing.executors.partitionKeyFilters = [
{
className = "BoundingBoxFilter"
param.boundingBox { north = 24.8, south = 24.68, east = 121.8, west = 121.7 }
}
]The root of the property is a list, multiple filters specified at this level
are combined as their union (OR logic). If there is a need to combine
them with AND logic, they can be put under a single AndFilter at the root.
This is the list of built-in filters:
BoundingBoxFilterAllowListFilterAndFilterOrFilterNotFilter
A custom filter can also be applied from applications by extending the
PartitionKeyFilter. These filters return either true or false from
shouldProcess. Boolean logic is applied to combine them up to the root
filter, such as:
Orof two bounding boxes is a unionAndof two bounding boxes is an intersectionNotof two bounding boxes means that only partitions outside of the underlying bounding box are considered
Filters can only be applied on partition key parameters, for example, catalog and layer IDs, and partition name.
NoteFor performance reasons, filters do not filter based on partition metadata or payloads.
Override filters for specific compilers
If one of the compilers needs a different partition filter applied, you can
use the byId mechanism to configure a different set of filter for the
executor that wraps it. Overriding these filters means replacing the whole
default set of filters, as they are not combined.
Example:
// Application-wide default filter.
here.platform.data-processing.executors.partitionKeyFilters = [
{
className = "BoundingBoxFilter"
param.boundingBox { north = 24.8, south = 24.68, east = 121.8, west = 121.7 }
}
]
// Filters can be overridden for specific executors/compilers using their Executor.Id.
here.platform.data-processing.executors.byId {
// Apply a larger bounding box filter to the input layers of this compiler to
// include neighbor partitions.
intermediate-compiler.partitionKeyFilters = [
{
className = "BoundingBoxFilter"
param.boundingBox { north = 24.9, south = 24.58, east = 121.9, west = 121.6 }
}
]
// This compiler isn't using filtering.
another-compiler.partitionKeyFilters = []
}Apply a filter to specific layers only
The AllowListFilter can be used to filter based on a fixed list of partition
names. When combined with Boolean operation filters, AllowListFilter can also
be used to apply some filters to specific layers only.
For example, to apply a bounding box only to inLayer of inCatalog, you can
configure your application as follows:
// Note: use here.platform.data-processing.deltasets.partitionKeyFilters to filter deltasets.
// Match (process) a partition key if:
here.platform.data-processing.executors.partitionKeyFilters = [
// NOT layer in inCatalog:inLayer
{
className = "NotFilter"
param.operand = {
className = "AllowListFilter"
param.catalogsAndLayers = {"inCatalog": ["inLayer"]}
}
},
// OR (layer in inCatalog:inLayer AND partition intersects boundingBox)
{
className = "AndFilter"
param.operands = [
{
className = "AllowListFilter"
param.catalogsAndLayers = {"inCatalog": ["inLayer"]}
}, {
className = "BoundingBoxFilter"
param.boundingBox { north = 2.8, south = 2.68, east = 121.8, west = 121.7 }
}
]
}
]For additional information about configuring partition key filters see Configure The Library.
Updated 21 days ago