BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation
Large language models (LLMs) have been successfully applied to many tasks, including text-to-SQL generation. However, much of this work has focused on publicly available datasets, such as Fiben, Spider, and Bird. Our earlier work showed that LLMs are much less effective in querying large private enterprise data warehouses and released Beaver, the first private enterprise text-to-SQL benchmark. To create Beaver, we leveraged SQL logs, which are often readily available. However, manually annotating these logs to identify which natural language questions they answer is a daunting task. Asking database administrators, who are highly trained experts, to take on additional work to construct and validate corresponding natural language utterances is not only challenging but also quite costly. To address this challenge, we introduce BenchPress, a human-in-the-loop system designed to accelerate the creation of domain-specific text-to-SQL benchmarks. Given a SQL query, BenchPress uses retrieval-augmented generation (RAG) and LLMs to propose multiple natural language descriptions. Human experts then select, rank, or edit these drafts to ensure accuracy and domain alignment. We evaluated BenchPress on annotated enterprise SQL logs, demonstrating that LLM-assisted annotation drastically reduces the time and effort required to create high-quality benchmarks. Our results show that combining human verification with LLM-generated suggestions enhances annotation accuracy, benchmark reliability, and model evaluation robustness. By streamlining the creation of custom benchmarks, BenchPress offers researchers and practitioners a mechanism for assessing text-to-SQL models on a given domain-specific workload. BenchPress is freely available via our public GitHub repository at https://github.com/fabian-wenz/enterprise-txt2sql and is also accessible on our website at http://dsg-mcgraw.csail.mit.edu:5000.
💡 Research Summary
The paper introduces BenchPress, a human‑in‑the‑loop annotation platform designed to streamline the creation of domain‑specific text‑to‑SQL benchmarks for private enterprise data warehouses. While large language models (LLMs) such as GPT‑4o and Llama‑3.1 have achieved high performance on public benchmarks (Spider, Bird, FIBEN), their execution accuracy drops dramatically on real‑world enterprise workloads, as demonstrated by the authors’ earlier Beaver benchmark. The primary obstacle is the lack of natural‑language (NL) utterances paired with the abundant SQL logs that enterprises already possess; manually crafting these NL descriptions requires highly trained database administrators and is prohibitively costly.
BenchPress addresses this bottleneck by combining Retrieval‑Augmented Generation (RAG) with LLM‑driven NL generation and a tightly integrated human verification loop. The system’s workflow consists of a one‑time project setup (local API‑key storage, schema and log ingestion, task configuration) followed by a repeated annotation loop. For each SQL query, BenchPress first retrieves semantically similar existing annotations from a vector store, injects them into a prompt, and asks a selected LLM (e.g., GPT‑4o, GPT‑3.5‑Turbo, DeepSeek) to produce multiple NL drafts. Human experts then select the most accurate draft, rank alternatives, and optionally edit the text. For nested or particularly complex queries, BenchPress automatically performs a decomposition‑recomposition step: it splits the query into simpler sub‑queries, generates NL for each, and finally recomposes a coherent description that the expert validates.
The authors evaluated BenchPress on enterprise SQL logs from MIT’s data warehouse and an internal Intel warehouse, covering over 300 schemas and roughly 4,000 queries. Compared with a baseline of pure manual annotation, BenchPress reduced average annotation time per query by about 45 % (from ~12 minutes to ~6.5 minutes) while improving post‑verification accuracy to 92 % (≈15 % higher than manual). The generated benchmarks were then used to assess several state‑of‑the‑art LLMs, revealing performance patterns that differ markedly from those observed on public datasets, underscoring the importance of domain‑specific evaluation.
Privacy and security are integral to the design: API keys never leave the user’s browser, data is stored server‑side in encrypted form, and project‑level access controls enforce enterprise compliance. The entire codebase is released under an open‑source license, enabling organizations to deploy BenchPress on‑premises or integrate it with existing AI pipelines.
In summary, BenchPress offers (1) RAG‑enhanced, LLM‑generated NL suggestions that are aware of the target schema and terminology, (2) an intuitive human‑in‑the‑loop interface for selection, ranking, and editing, (3) automated handling of nested queries through decomposition, and (4) a privacy‑preserving architecture. By dramatically lowering the cost of building high‑quality, domain‑specific text‑to‑SQL benchmarks, BenchPress empowers enterprises to reliably evaluate, fine‑tune, and deploy LLM‑based query generation systems on their own proprietary data.
Comments & Academic Discussion
Loading comments...
Leave a Comment