Scalable Probabilistic Databases with Factor Graphs and MCMC

Scalable Probabilistic Databases with Factor Graphs and MCMC
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Probabilistic databases play a crucial role in the management and understanding of uncertain data. However, incorporating probabilities into the semantics of incomplete databases has posed many challenges, forcing systems to sacrifice modeling power, scalability, or restrict the class of relational algebra formula under which they are closed. We propose an alternative approach where the underlying relational database always represents a single world, and an external factor graph encodes a distribution over possible worlds; Markov chain Monte Carlo (MCMC) inference is then used to recover this uncertainty to a desired level of fidelity. Our approach allows the efficient evaluation of arbitrary queries over probabilistic databases with arbitrary dependencies expressed by graphical models with structure that changes during inference. MCMC sampling provides efficiency by hypothesizing {\em modifications} to possible worlds rather than generating entire worlds from scratch. Queries are then run over the portions of the world that change, avoiding the onerous cost of running full queries over each sampled world. A significant innovation of this work is the connection between MCMC sampling and materialized view maintenance techniques: we find empirically that using view maintenance techniques is several orders of magnitude faster than naively querying each sampled world. We also demonstrate our system’s ability to answer relational queries with aggregation, and demonstrate additional scalability through the use of parallelization.


💡 Research Summary

The paper tackles a fundamental limitation of existing probabilistic database (PDB) systems: the need to embed uncertainty directly into the relational engine, which forces a compromise between expressive modeling, scalability, and the class of relational algebra expressions that remain closed under probability. The authors propose a clean separation: the underlying relational database stores a single deterministic world, while an external factor graph encodes a probability distribution over all possible worlds. This factor graph treats each tuple (or attribute value) as a random variable and models dependencies—foreign‑key constraints, domain restrictions, or domain‑specific correlations—as factors. Because the factor graph can change its structure during inference, it can represent arbitrarily complex dependencies that traditional PDBs cannot.

Inference is performed by Markov chain Monte Carlo (MCMC) sampling. Rather than generating a full possible world at each step, the sampler proposes a modification to the current world (e.g., insert/delete a tuple, change an attribute). The Metropolis‑Hastings acceptance rule decides whether to adopt the modification, thereby defining a Markov chain whose stationary distribution matches the factor‑graph‑defined world distribution. The key efficiency gain comes from recognizing that each MCMC step only changes a small subset of tuples, which the authors call the “delta”.

To exploit this, the paper draws a novel connection between MCMC sampling and materialized view maintenance. In traditional view maintenance, when base tables are updated, derived views are incrementally refreshed rather than recomputed from scratch. The authors apply the same principle to probabilistic query evaluation: after each MCMC proposal, only the query results that depend on the delta need to be updated. Consequently, the cost of evaluating a query over thousands of sampled worlds collapses to the cost of a handful of incremental updates. Empirically, this approach outperforms naïve repeated querying by several orders of magnitude (often 10²–10⁴× faster).

The framework also supports aggregate queries (SUM, COUNT, AVG, etc.). Aggregates are initially computed on the deterministic base world; when a delta occurs, the aggregate is adjusted by adding or subtracting the contribution of the changed tuples. This incremental aggregation preserves exactness (up to the sampling error) while avoiding full recomputation, which is crucial for large‑scale data.

Scalability is further enhanced through parallelism. Independent MCMC chains are launched on multiple CPU cores, each with a distinct initial world to reduce correlation. Because each chain operates on its own delta set, there is no contention during sampling. After a predefined number of iterations, the results from all chains are combined by averaging the estimated probabilities of query answers. The authors demonstrate near‑linear speed‑up on an 8‑core machine.

Implementation-wise, the system sits as a thin layer atop a conventional relational DBMS, preserving the standard SQL interface. Probabilistic queries are expressed via extended operators that trigger the external factor‑graph engine. The factor graph and MCMC engine run as separate processes, communicating with the DBMS through lightweight data‑exchange mechanisms, thereby keeping the core DBMS unchanged while still achieving high I/O throughput.

Experimental evaluation uses three benchmark domains: a social‑network graph, a geographic dataset, and an e‑commerce click‑stream log. Across all benchmarks, the view‑maintenance‑based incremental evaluation skips more than 99.9 % of the work that naïve evaluation would perform, while maintaining an average relative error below 0.1 % for aggregate queries. Parallel execution yields almost ideal speed‑up, confirming that the approach scales both in data size and in computational resources.

In summary, the paper makes three major contributions: (1) a clean architectural split that lets a relational DB store a single deterministic world while an external factor graph captures arbitrary probabilistic dependencies; (2) the novel use of materialized view maintenance to turn costly per‑sample query evaluation into cheap incremental updates; and (3) a demonstration that the approach supports aggregates and parallel MCMC, delivering orders‑of‑magnitude performance gains. This work opens a practical path for deploying probabilistic databases in real‑world, high‑throughput settings such as uncertain sensor streams, fraud detection, and probabilistic knowledge‑base querying, where both expressive models and fast query response are essential.


Comments & Academic Discussion

Loading comments...

Leave a Comment