SOFA: An Extensible Logical Optimizer for UDF-heavy Dataflows

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent years have seen an increased interest in large-scale analytical dataflows on non-relational data. These dataflows are compiled into execution graphs scheduled on large compute clusters. In many novel application areas the predominant building blocks of such dataflows are user-defined predicates or functions (UDFs). However, the heavy use of UDFs is not well taken into account for dataflow optimization in current systems. SOFA is a novel and extensible optimizer for UDF-heavy dataflows. It builds on a concise set of properties for describing the semantics of Map/Reduce-style UDFs and a small set of rewrite rules, which use these properties to find a much larger number of semantically equivalent plan rewrites than possible with traditional techniques. A salient feature of our approach is extensibility: We arrange user-defined operators and their properties into a subsumption hierarchy, which considerably eases integration and optimization of new operators. We evaluate SOFA on a selection of UDF-heavy dataflows from different domains and compare its performance to three other algorithms for dataflow optimization. Our experiments reveal that SOFA finds efficient plans, outperforming the best plans found by its competitors by a factor of up to 6.

💡 Research Summary

The paper presents SOFA, an extensible logical optimizer designed specifically for dataflows that heavily rely on user‑defined functions (UDFs) in Map/Reduce‑style systems. Traditional optimizers focus on relational operators and treat UDFs as opaque black boxes, which severely limits the ability to reorder, merge, or push down filters in UDF‑rich pipelines. SOFA addresses this gap by introducing a concise yet expressive set of semantic properties for UDFs—such as input/output schemas, read/write sets, selectivity, side‑effect freedom, and I/O ratios—and organizing these properties in a subsumption hierarchy.

The core of the system is the Presto taxonomy, a collection of rewrite templates that encode generic transformation rules (e.g., commutativity, filter push‑down, operator fusion). Because operators are placed in a hierarchy, a newly added operator only needs to declare a single subsumption relationship to an existing parent; it automatically inherits all applicable rewrite rules. This design dramatically reduces the engineering effort required to support new domain‑specific operators while still exposing a large plan space for optimization.

SOFA is implemented on top of the Stratosphere platform. Users write queries in the high‑level Meteor language, which is compiled into a logical algebra called Sopremo. At the Sopremo layer, SOFA accesses the Presto metadata and runtime statistics (selectivity estimates, average output cardinalities) to drive a cost‑based search. The optimizer generates candidate DAGs by applying Presto templates, evaluates their estimated execution cost, and selects the cheapest plan. The search is capable of handling both elementary and complex operators; complex operators may have semantics that differ from the sum of their components, and SOFA’s model captures these differences through separate property annotations.

The evaluation covers twelve real‑world dataflows from four domains: news‑article relationship extraction, web‑log cleansing, biomedical text mining, and machine‑learning feature engineering. The authors compare SOFA against three baseline optimizers: the default Stratosphere optimizer, a Hive‑MapReduce optimizer, and a Spark‑SQL optimizer. All baselines treat UDFs as black boxes. Results show that SOFA consistently produces better plans, achieving an average speed‑up of 2.8× and up to 6× in the most favorable cases. Moreover, SOFA explores a plan space three times larger than the baselines, thanks to the combinatorial power of the subsumption hierarchy and Presto templates.

Limitations are acknowledged. SOFA currently supports only deterministic UDFs; handling nondeterministic or stateful functions would require additional reasoning about side effects. The property annotations still need to be supplied manually, although the hierarchy reduces the amount of work per operator. The cost model relies on static statistics, so significant data distribution changes may necessitate re‑optimization. Future work includes extending the framework to nondeterministic UDFs, incorporating dynamic, runtime‑driven cost estimation, and employing machine‑learning techniques to infer operator properties automatically.

In summary, SOFA demonstrates that a modest set of well‑structured semantic annotations, combined with a hierarchical rewrite taxonomy, can unlock substantial optimization opportunities in UDF‑heavy analytical pipelines. Its extensible design allows new operators to be integrated with minimal effort, and its experimental results validate that it can outperform existing optimizers by a considerable margin.

SOFA: An Extensible Logical Optimizer for UDF-heavy Dataflows

💡 Research Summary

Comments & Academic Discussion

Leave a Comment