One Size Does NOT Fit All: On the Importance of Physical Representations for Datalog Evaluation

One Size Does NOT Fit All: On the Importance of Physical Representations for Datalog Evaluation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Datalog is an increasingly popular recursive query language that is declarative by design, meaning its programs must be translated by an engine into the actual physical execution plan. When generating this plan, a central decision is how to physically represent all involved relations, an aspect in which existing Datalog engines are surprisingly restrictive and often resort to one-size-fits-all solutions. The reason for this is that the typical execution plan of a Datalog program not only performs a single type of operation against the physical representations, but a mixture of operations, such as insertions, lookups, and containment-checks. Further, the relevance of each operation type highly depends on the workload characteristics, which range from familiar properties such as the size, multiplicity, and arity of the individual relations to very specific Datalog properties, such as the “interweaving” of rules when relations occur multiple times, and in particular the recursiveness of the query which might generate new tuples on the fly during evaluation. This indicates that a variety of physical representations, each with its own strengths and weaknesses, is required to meet the specific needs of different workload situations. To evaluate this, we conduct an in-depth experimental study of the interplay between potentially suitable physical representations and seven dimensions of workload characteristics that vary across actual Datalog programs, revealing which properties actually matter. Based on these insights, we design an automatic selection mechanism that utilizes a set of decision trees to identify suitable physical representations for a given workload.


💡 Research Summary

The paper addresses a fundamental yet under‑explored aspect of Datalog execution: the choice of physical representations for relations. While Datalog is declarative, its evaluation mixes insertions, lookups, and containment checks, and the relative importance of these operations varies with workload characteristics such as relation size, multiplicity, arity, key width, the ratio of initialization to query effort, rule interweaving, and recursion intensity. Existing engines typically adopt a one‑size‑fits‑all approach (e.g., a covered B‑Tree for every relation), which can lead to excessive memory consumption or poor random‑access performance in many realistic scenarios.
To overcome this limitation, the authors design a highly modular engine called PlayLog that supports a catalog of 13 physical representations, formed by crossing four access types (covered index, unclustered key‑based index, unclustered pointer‑based index, full scan) with five data structures (sorted array, B+‑Tree, radix tree, hash table, and a variant of sorted array with on‑demand growth). PlayLog compiles Datalog programs into specialized C++ code, ensuring that the flexibility of representation does not incur runtime overhead.
The study evaluates these representations across seven workload dimensions using both synthetic benchmarks and four real‑world Datalog applications (program analysis, network monitoring, distributed computing, and storage management). Key findings include: (1) sorted arrays excel for small, read‑heavy relations; (2) pointer‑based indexes are superior for wide, multi‑key relations; (3) key‑based indexes combined with hash tables dominate when keys are narrow and insert rates are high; (4) recursive relations that generate many new facts benefit from structures that support fast bulk appends and deferred sorting.
Based on these insights, the authors construct a decision‑tree‑based automatic selection mechanism. By extracting a concise signature of the workload’s seven characteristics, the mechanism traverses pre‑trained decision trees to recommend the most suitable access‑type/data‑structure pair. When applied to the four real workloads, the selected configurations outperform baseline systems (including Soufflé and RecStep) by an average factor of 1.8× and up to 3×, while the selection itself takes only a few milliseconds.
The work is limited to single‑node, main‑memory evaluation; distributed Datalog scenarios, where index replication and partitioning costs become critical, are left for future research. Nonetheless, the paper convincingly demonstrates that a diversified catalog of physical representations, coupled with lightweight automatic selection, can substantially improve Datalog performance and should become a standard component of next‑generation Datalog query planners.


Comments & Academic Discussion

Loading comments...

Leave a Comment