DP-Bench: A Benchmark for Evaluating Data Product Creation Systems
A data product is created with the intention of solving a specific problem, addressing a specific business usecase or meeting a particular need, going beyond just serving data as a raw asset. Data products enable end users to gain greater insights about their data. Since it was first introduced over a decade ago, there has been considerable work, especially in industry, to create data products manually or semi-automatically. However, there exists hardly any benchmark to evaluate automatic data product creation. In this work, we present a benchmark, first of its kind, for this task. We call it DP-Bench. We describe how this benchmark was created by taking advantage of existing work in ELT (Extract-Load-Transform) and Text-to-SQL benchmarks. We also propose a number of LLM based approaches that can be considered as baselines for generating data products automatically. We make the DP-Bench and supplementary materials available in https://huggingface.co/datasets/ibm-research/dp-bench .
💡 Research Summary
DP‑Bench introduces the first publicly available benchmark for evaluating systems that automatically generate data products. A data product, unlike a raw dataset, is a curated, packaged collection of tables, columns, and derived attributes designed to answer specific business questions. While industry has produced many data products manually or semi‑automatically, there has been no systematic way to assess the quality of fully automated pipelines. This paper fills that gap by constructing a benchmark that combines resources from two existing corpora: ELT‑Bench, which provides 100 ELT pipelines and associated data models, and BIRD, a large‑scale text‑to‑SQL benchmark covering 95 databases and over 12 k question‑SQL pairs across more than 37 domains.
Task definition
Given (i) one or more relational database schemas (and optionally unstructured documents) and (ii) a natural‑language description of a business use case (the Data Product Request, DPR), an automatic system must output a data product. In this benchmark a data product is limited to (a) a set of selected tables, (b) a set of columns that are either directly selected, derived from existing columns, or (optionally) created from extracted information in unstructured text, and (c) a provenance specification for each derived column expressed as SQL that can compute the column from the source schema. This formulation captures the core challenges of data product creation while remaining objectively measurable.
Benchmark construction
The authors aligned the 78 databases that appear in both ELT‑Bench and BIRD. From these they extracted 4 306 columns in BIRD and 921 columns in ELT‑Bench data models. Using a large language model (LLaMA‑3.3‑70B‑Instruct) they generated SQL provenance statements for the 634 derived columns (≈69 % of ELT‑Bench columns) and manually validated them, discarding those with erroneous or ambiguous provenance. After filtering, 582 derived columns remained, together with 287 non‑derived columns. These were grouped by table name to form 78 preliminary data products.
DPR generation and curation
For each data product, the LLM was prompted to produce three concise natural‑language summaries of the tables and columns. Each summary was then transformed into a candidate DPR using a second prompt that rewrites the summary into a business‑oriented request. This yielded 234 candidate DPRs (3 per data product). A two‑phase human annotation process was applied: (1) annotators examined five randomly selected BIRD question‑SQL pairs per data product and judged whether each candidate DPR would enable answers to those questions; if not, they edited the DPR; (2) a third annotator adjudicated any disagreements and produced the final gold‑standard DPR. Approximately 71 % of candidate DPRs required no edits; 67 DPRs across 27 data products were edited.
Baseline approaches
The paper proposes several LLM‑based baselines that follow a three‑step pipeline: (1) table and column selection, (2) derived‑column generation, and (3) SQL provenance generation. All steps use LLaMA‑3.3‑70B‑Instruct. Experiments reveal that while the model can correctly select many relevant tables/columns, the provenance generation step still produces a non‑trivial error rate, underscoring the necessity of human verification. The benchmark deliberately restricts the data product definition to table/column level, leaving dashboards, reports, and visualizations for future extensions.
Key insights and impact
- Integration of existing resources – By aligning ELT‑Bench and BIRD, the authors efficiently constructed a rich, multi‑domain dataset without building everything from scratch.
- Explicit provenance – Providing SQL definitions for derived columns makes the output interpretable, reproducible, and amenable to automatic scoring.
- Human‑in‑the‑loop necessity – Even with a powerful LLM, manual validation remains crucial for correctness, especially for derived‑column logic.
- Scalability and extensibility – The current benchmark covers 78 databases and 234 DPRs, but the underlying pipelines can be expanded to include more domains, larger numbers of DPRs, and higher‑level artifacts such as dashboards.
In summary, DP‑Bench offers a well‑structured, human‑validated benchmark that enables systematic comparison of data‑product generation systems. It bridges the gap between text‑to‑SQL research and ELT automation, establishing a foundation for future work on end‑to‑end data product creation, evaluation metrics, and the integration of LLMs with data engineering pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment