Can AI Agents Answer Your Data Questions? A Benchmark for Data Agents

Can AI Agents Answer Your Data Questions? A Benchmark for Data Agents
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Users across enterprises increasingly rely on AI agents to query their data through natural language. However, building reliable data agents remains difficult because real-world data is often fragmented across multiple heterogeneous database systems, with inconsistent references and information buried in unstructured text. Existing benchmarks only tackle individual pieces of this problem – e.g., translating natural-language questions into SQL queries, answering questions over small tables provided in context – but do not evaluate the full pipeline of integrating, transforming, and analyzing data across multiple database systems. To fill this gap, we present the Data Agent Benchmark (DAB), grounded in a formative study of enterprise data agent workloads across six industries. DAB comprises 54 queries across 12 datasets, 9 domains, and 4 database management systems. On DAB, the best frontier model (Gemini-3-Pro) achieves only 38% pass@1 accuracy. We benchmark five frontier LLMs, analyze their failure modes, and distill takeaways for future data agent development. Our benchmark and experiment code are published at github.com/ucbepic/DataAgentBench.


💡 Research Summary

Enterprises are rapidly adopting AI‑powered data agents that allow users to ask natural‑language questions over their corporate data. While the commercial interest is evident—examples include Uber’s Query GPT handling over a million queries per month and OpenAI’s internal agent accessing 70 000 datasets—the technical reality is far more challenging. Real‑world data is typically spread across heterogeneous database management systems (DBMSs), identifiers are inconsistently formatted, and crucial information is often embedded in unstructured text fields. Existing evaluation suites such as text‑to‑SQL or table‑question‑answering benchmarks address only isolated aspects (single‑DB query generation, or reasoning over tables supplied in the prompt) and therefore cannot measure the end‑to‑end capabilities required of a production data agent.

To fill this gap, the authors introduce the Data Agent Benchmark (DAB). The benchmark design is grounded in a formative study conducted with Hasura’s PromptQL platform, where semi‑structured interviews with enterprise customers across six industries (technology, finance, food services, e‑commerce, SaaS, healthcare) revealed four recurring challenges: (C1) multi‑database integration, (C2) semantic operations over text, (C3) domain‑specific knowledge, and (C4) open‑ended analytical reasoning. Because open‑ended reasoning and live API calls are not amenable to deterministic evaluation, the benchmark focuses on the first three themes and extracts four concrete properties: (i) multi‑DB integration, (ii) ill‑formatted join keys, (iii) unstructured‑text transformation, and (iv) domain knowledge.

DAB comprises 54 natural‑language queries built on 12 open‑source datasets spanning nine domains (news, e‑commerce, CRM, software engineering, local business, music, finance, medical research, patents). For each dataset the authors deliberately perturb the data to induce the target properties: join keys are renamed or prefixed, columns that would trivially answer a query are removed and their values re‑embedded as free‑text sentences generated by LLMs, and tables are split across at least two DBMSs (PostgreSQL, MongoDB, SQLite, DuckDB). The resulting benchmark mirrors the complexity observed in production workloads while preserving deterministic ground‑truth answers through extensive manual verification.

The evaluation uses five frontier large language models (LLMs): GPT‑5.2, GPT‑5‑mini, Gemini‑3‑Pro, Gemini‑2.5‑Flash, and Kimi‑K2. Each model is wrapped in a ReAct‑style agent architecture, which iteratively decides on an action, invokes a tool (SQL query, MongoDB query, Python script, etc.), observes the result, and repeats until a final answer is produced. For each of the 54 queries, 50 independent trials are run per agent, and accuracy is reported using the pass@k metric (the probability that at least one of k attempts succeeds).

Results are sobering. The best performing model, Gemini‑3‑Pro, achieves only 38 % pass@1, and even with 50 attempts the success probability caps at 69 %. One entire dataset remains unsolved by any model. Error analysis shows that 85 % of failures stem from incorrect planning or faulty implementation (e.g., wrong join order, failure to normalize keys), while selection of the wrong data source is rare. All agents rely on regular‑expression based extraction for free‑text fields; none employ LLM‑driven or semantic extraction primitives. Moreover, agents that either explore the schema too little or too much both underperform, with the two highest‑accuracy models allocating roughly 20 % of their tool calls to data exploration.

From these observations the authors derive actionable takeaways. First, data‑exploration strategies must be balanced: excessive probing wastes budget, while insufficient probing leaves necessary schema information undiscovered. Second, the toolset for agents should be enriched beyond SQL and Python to include robust text‑extraction primitives (e.g., entity recognizers, LLM‑based parsers). Third, incorporating a semantic layer or knowledge graph can offload domain‑specific reasoning from the LLM, reducing planning complexity. Finally, a case study with PromptQL’s proprietary agent shows a modest 7 percentage‑point improvement over the vanilla ReAct baseline when using the same LLM, yet it still fails completely on queries requiring unstructured‑text extraction, underscoring the need for better text‑processing capabilities.

In summary, DAB is the first publicly released benchmark that evaluates AI agents on the full pipeline of real‑world data analytics: multi‑DB integration, key reconciliation, text transformation, and domain knowledge application. The benchmark reveals that current frontier LLM‑based agents are far from reliable, achieving well below 50 % success on realistic queries. The paper therefore calls for research focused on (1) more efficient schema and data discovery, (2) richer, possibly LLM‑augmented, extraction toolchains, and (3) systematic integration of domain expertise, to move toward truly dependable AI data agents.


Comments & Academic Discussion

Loading comments...

Leave a Comment