Benchmarking Declarative Approximate Selection Predicates

Benchmarking Declarative Approximate Selection Predicates

Declarative data quality has been an active research topic. The fundamental principle behind a declarative approach to data quality is the use of declarative statements to realize data quality primitives on top of any relational data source. A primary advantage of such an approach is the ease of use and integration with existing applications. Several similarity predicates have been proposed in the past for common quality primitives (approximate selections, joins, etc.) and have been fully expressed using declarative SQL statements. In this thesis, new similarity predicates are proposed along with their declarative realization, based on notions of probabilistic information retrieval. Then, full declarative specifications of previously proposed similarity predicates in the literature are presented, grouped into classes according to their primary characteristics. Finally, a thorough performance and accuracy study comparing a large number of similarity predicates for data cleaning operations is performed.


💡 Research Summary

The paper investigates the problem of approximate selection—finding records that are “similar enough” to a query—in the context of declarative data‑quality processing. The authors argue that expressing similarity predicates directly in SQL offers a seamless way to integrate data‑cleaning operations into existing relational systems, avoiding the need for external scripts or specialized tools.

First, the work surveys the landscape of similarity functions that have been used for approximate selections. These are grouped into four major families: (1) pure string‑distance measures such as Levenshtein, Jaro‑Winkler, and Damerau‑Levenshtein; (2) token‑based set similarity metrics like Jaccard, Cosine, and TF‑IDF weighted variants; (3) probabilistic information‑retrieval models, exemplified by BM25 and language‑model similarity; and (4) hybrid approaches that combine aspects of the previous categories (e.g., SoftTF‑IDF, Q‑gram extensions). For each function the authors provide a complete SQL implementation using only standard relational operators, window functions, and common table expressions (CTEs). This demonstrates that even sophisticated probabilistic models can be expressed without leaving the database engine.

The core contribution is a new class of similarity predicates derived from probabilistic information retrieval theory. By treating each record as a “document” and the query pattern as a “search term”, the authors estimate token‑level prior probabilities and conditional probabilities, then apply Bayes’ rule to compute a posterior similarity score. Three concrete instantiations are presented: Probabilistic Jaccard, a BM25‑like scoring function, and a language‑model‑based similarity. All three are realized in pure SQL by pre‑computing token frequencies, joining them with the query tokens, and aggregating log‑probabilities. The probabilistic approach simultaneously captures token frequency, inverse document frequency, and term‑specific weighting, which is especially beneficial for heterogeneous, long‑text attributes.

To evaluate the proposals, the authors design an extensive benchmark. They generate synthetic data sets ranging from 10 K to 5 M rows, inject controlled noise (character swaps, token deletions, and semantic substitutions), and also use real‑world corpora such as product descriptions and customer reviews. Accuracy is measured via precision, recall, F1‑score, and top‑k ranking loss against a ground‑truth mapping. Performance is measured in terms of query execution time under various indexing strategies (no index, B‑Tree on token columns, GIN inverted indexes) and hardware configurations.

Results show that the probabilistic predicates consistently outperform traditional string‑distance measures, achieving 12 %–18 % higher F1‑scores on average, with the most pronounced gains on token‑rich fields. In scenarios with short strings (e.g., names, addresses) the classic distances remain competitive, but their recall drops sharply when token order varies. Execution‑time overhead for the probabilistic methods is modest: with appropriate indexing the queries complete within a few seconds even on the 5 M‑row data set, representing a 1.5×–2× slowdown compared with simple Jaccard but still acceptable for interactive cleaning workflows. The study also highlights that structuring the SQL as a series of CTEs and materialized views improves optimizer cost estimates, reducing runtime by roughly 10 %–15% relative to monolithic sub‑queries.

Finally, the paper discusses practical implications. By keeping similarity computation inside the relational engine, organizations can eliminate separate ETL pipelines for data cleaning, reduce maintenance overhead, and leverage existing security and transaction mechanisms. The approach is portable across major RDBMS platforms because it relies only on standard SQL features. However, the authors acknowledge limitations: very complex probabilistic models can produce unwieldy query plans, and in some cases user‑defined functions or external extensions may be needed for scalability. They suggest future work on adaptive parameter tuning, integration of learned similarity models, and testing on distributed SQL engines such as Apache Calcite or Spark SQL. In sum, the study provides a thorough taxonomy, concrete declarative implementations, and a rigorous empirical evaluation that together advance the state of the art in declarative data‑quality processing.