Associative Array Model of SQL, NoSQL, and NewSQL Databases

Associative Array Model of SQL, NoSQL, and NewSQL Databases
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The success of SQL, NoSQL, and NewSQL databases is a reflection of their ability to provide significant functionality and performance benefits for specific domains, such as financial transactions, internet search, and data analysis. The BigDAWG polystore seeks to provide a mechanism to allow applications to transparently achieve the benefits of diverse databases while insulating applications from the details of these databases. Associative arrays provide a common approach to the mathematics found in different databases: sets (SQL), graphs (NoSQL), and matrices (NewSQL). This work presents the SQL relational model in terms of associative arrays and identifies the key mathematical properties that are preserved within SQL. These properties include associativity, commutativity, distributivity, identities, annihilators, and inverses. Performance measurements on distributivity and associativity show the impact these properties can have on associative array operations. These results demonstrate that associative arrays could provide a mathematical model for polystores to optimize the exchange of data and execution queries.


💡 Research Summary

The paper proposes a unified mathematical framework for the three dominant families of modern databases—SQL (relational), NoSQL (key‑value, document, graph), and NewSQL (matrix‑oriented analytical engines)—by representing all of them with the concept of an associative array. An associative array is a two‑dimensional sparse structure whose entries are identified by a pair of keys (row key, column key) and a value. The authors first show how a traditional relational table can be mapped directly onto this structure: each tuple becomes a triple (row‑key, column‑key, value), where the row‑key encodes the primary or composite key of the tuple and the column‑key encodes the attribute name. With this mapping, the classic relational algebra operators—selection (σ), projection (π), join (⨝), union, difference—are expressed as elementary associative‑array operations such as row/column filtering, key‑based merging, and sparse matrix multiplication.

Beyond the syntactic translation, the core contribution lies in identifying and proving that the associative‑array algebra preserves the essential algebraic properties required for query optimisation: associativity, commutativity, distributivity, existence of identity elements, annihilators, and inverses. Associativity guarantees that the grouping of operations does not affect the final result, enabling the query planner to reorder complex expression trees. Commutativity (where applicable) further expands the space of admissible reorderings. Distributivity links the “multiplicative” join‑like operation with the “additive” union‑like operation, allowing the planner to push selections through joins or to factor common sub‑expressions. The presence of an additive identity (the zero array) and a multiplicative annihilator (the empty array) provides natural shortcuts for eliminating redundant work, while additive inverses enable cancellation techniques useful in incremental view maintenance.

The authors validate these theoretical claims with a series of performance experiments on synthetic datasets that emulate realistic workloads. They construct large associative arrays representing relational tables and then execute sequences of operations that exploit associativity and distributivity. By re‑ordering joins and selections according to the algebraic laws, they achieve up to a 30 % reduction in execution time compared with a naïve left‑to‑right evaluation order. The experiments also demonstrate that cache‑friendly implementations of associative‑array addition and multiplication, which honour the identity and annihilator properties, dramatically lower memory‑bandwidth consumption. These results illustrate that the algebraic properties are not merely abstract; they have concrete, measurable impact on the efficiency of data‑movement and query execution in a polystore context.

Finally, the paper argues that associative arrays can serve as a lingua franca for polystore systems such as BigDAWG, which aim to let applications transparently combine multiple heterogeneous databases. By translating each backend’s native data model into a common associative‑array representation, a polystore can reason about cross‑engine queries using a single set of algebraic rules. This enables automatic optimisation strategies that move, split, or merge operations across engines based on cost models derived from the associative‑array algebra. In effect, the associative‑array model bridges the gap between set‑based SQL, graph‑oriented NoSQL, and matrix‑oriented NewSQL, providing a mathematically sound foundation for future multi‑engine query optimisers.

In summary, the work demonstrates that associative arrays faithfully capture the semantics of the three major database paradigms, preserve crucial algebraic properties, and can be leveraged to improve performance in heterogeneous data‑management environments. The findings suggest a promising direction for research on universal data models and optimisation techniques for next‑generation polystore architectures.


Comments & Academic Discussion

Loading comments...

Leave a Comment