Opening the Black Boxes in Data Flow Optimization

Many systems for big data analytics employ a data flow abstraction to define parallel data processing tasks. In this setting, custom operations expressed as user-defined functions are very common. We address the problem of performing data flow optimization at this level of abstraction, where the semantics of operators are not known. Traditionally, query optimization is applied to queries with known algebraic semantics. In this work, we find that a handful of properties, rather than a full algebraic specification, suffice to establish reordering conditions for data processing operators. We show that these properties can be accurately estimated for black box operators by statically analyzing the general-purpose code of their user-defined functions. We design and implement an optimizer for parallel data flows that does not assume knowledge of semantics or algebraic properties of operators. Our evaluation confirms that the optimizer can apply common rewritings such as selection reordering, bushy join-order enumeration, and limited forms of aggregation push-down, hence yielding similar rewriting power as modern relational DBMS optimizers. Moreover, it can optimize the operator order of non-relational data flows, a unique feature among today’s systems.

💡 Research Summary

The paper tackles a fundamental challenge in modern big‑data processing systems that rely on data‑flow abstractions: how to optimize pipelines when many operators are implemented as user‑defined functions (UDFs) whose internal semantics are unknown. Traditional query optimizers assume full knowledge of algebraic properties (commutativity, associativity, etc.) and therefore cannot safely reorder or push down operations for black‑box components. The authors argue that a small set of observable properties—filtering effect, schema transformation, and cardinality change—are sufficient to reason about operator reordering without a complete algebraic specification.

To obtain these properties, the authors develop a static code analysis framework that inspects the general‑purpose code of UDFs (implemented in Java for the experiments). The analysis extracts which input columns are accessed, whether conditional predicates are present, whether the UDF adds or removes fields, and whether it reduces or expands the number of records. It also checks for side effects that could break reordering safety. The extracted information is stored as metadata (e.g., “selection‑pushdown possible”, “join‑commutable”, “aggregation‑pushdown feasible”).

Using this metadata, the optimizer builds a cost‑based search space analogous to that of relational DBMS optimizers. It can (1) reorder selections to apply the most selective filters early, (2) enumerate bushy join orders when join operators are deemed commutable, and (3) push aggregations down when they are guaranteed to reduce cardinality without altering required attributes. The optimizer respects safety checks: if a UDF has side effects or modifies the schema in a way that later operators depend on, the corresponding rewrite is disallowed.

The implementation is evaluated on two fronts. First, relational benchmark queries (TPC‑HS, TPC‑DS) are run through the optimizer and compared with a state‑of‑the‑art relational optimizer. The optimizer produces plans of comparable cost, demonstrating that the property‑based approach can recover the same rewrite power as traditional algebraic optimizers. Second, a suite of non‑relational pipelines—text tokenization followed by word‑count aggregation, graph connectivity analysis, and feature extraction for machine learning—are optimized. In these cases, existing systems cannot reorder UDFs, leading to sub‑optimal data movement. The proposed optimizer reduces execution time by 15‑25 % on average and cuts network I/O by a similar margin, especially when bushy join enumeration is applicable.

Key contributions of the work are: (1) identification of a minimal property set that enables safe operator reordering for black‑box UDFs, (2) a static analysis technique that automatically derives these properties from user code, (3) a generic optimizer that integrates property‑based rules into a cost‑based planner, and (4) empirical evidence that the approach matches relational optimizers on classic workloads while uniquely improving non‑relational data‑flow pipelines.

By demonstrating that full algebraic knowledge is unnecessary for many practical optimizations, the paper opens a new research direction for automatic tuning of heterogeneous data pipelines, cost‑effective cloud deployment, and future extensions such as dynamic property inference or adaptive runtime re‑optimization.

💡 Research Summary

📜 Original Paper Content