SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications

SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Resolution of complex SQL issues persists as a significant bottleneck in real-world database applications. Current Large Language Models (LLMs), while adept at text-to-SQL translation, have not been rigorously evaluated on the more challenging task of debugging SQL issues. To address this gap, we introduce BIRD-CRITIC, a new SQL issue debugging benchmark comprising 530 PostgreSQL tasks (BIRD-CRITIC-PG) and 570 multi-dialect tasks (BIRD-CRITIC-Multi), distilled from authentic user issues and replayed within new environments to facilitate rigorous evaluation. Baseline evaluations underscore the task’s complexity, with the leading reasoning model O3-Mini achieving only 38.87% success rate on BIRD-CRITIC-PG and 33.33% on BIRD-CRITIC-Multi. Meanwhile, advancing open-source models for database tasks is crucial for empowering local development while safeguarding data privacy. Therefore, we present Six-Gym (Sql-fIX-Gym), a training environment for elevating open-source model capabilities for SQL issue debugging. This environment leverages SQL-Rewind strategy, which automatically generates executable issue-solution datasets by reverse-engineering issues from verified SQLs. However, popular trajectory-based fine-tuning methods do not explore substantial supervisory signals. We further propose f-Plan Boosting, which extracts high-level debugging plans from SQL solutions, enabling teacher LLMs to produce 73.7% more successful trajectories for training. We integrate these components into an open-source agent, Bird-Fixer. Based on Qwen-2.5-Coder-14B, Bird-Fixer achieves 38.11% success rate on BIRD-CRITIC-PG and 29.65% on BIRD-CRITIC-Multi, surpassing leading proprietary models such as Claude-3.7-Sonnet and GPT-4.1, marking a significant step toward democratizing sophisticated SQL-debugging capabilities. The leaderboard and source code are available: https://bird-critic.github.io/


💡 Research Summary

The paper tackles the largely unexplored problem of automatically debugging user‑submitted SQL queries, a task that is far more complex than the well‑studied text‑to‑SQL translation. To enable systematic research, the authors introduce BIRD‑CRITIC, a benchmark built from real‑world Stack Overflow bug‑fix threads. After a multi‑stage curation process involving ten qualified annotators and three senior database experts, the benchmark contains 530 PostgreSQL‑only tasks (BIRD‑CRITIC‑PG) and 570 multi‑dialect tasks (BIRD‑CRITIC‑MULTI) covering PostgreSQL, MySQL, SQL Server, and Oracle. Each task is defined by a triple (P, S, σ_issue) – a natural‑language problem description, the database schema, and the erroneous SQL – together with a ground‑truth corrected SQL (σ*) and a Python‑SQL evaluation script that checks functional correctness via multiple test cases. This design moves beyond simple execution‑success metrics and captures whether the corrected query preserves the user’s intent.

Baseline experiments with state‑of‑the‑art chain‑of‑thought models (e.g., O3‑Mini) achieve only 38.87 % success on PG and 33.33 % on the multi‑dialect set, underscoring the difficulty of the task. To improve open‑source LLMs, the authors propose SIX‑GYM (SQL‑FIX‑Gym), a training environment that generates large‑scale, executable issue‑solution pairs and provides richer supervision. The core of SIX‑GYM is the SQL‑Rewind strategy: starting from a verified correct query, it programmatically injects realistic syntactic and logical errors (e.g., missing joins, wrong predicates, type mismatches) to synthesize plausible debugging scenarios. This automated data creation dramatically reduces annotation cost while preserving realism.

However, conventional trajectory‑based fine‑tuning underutilizes the information contained in the ground‑truth solution. To address this, the paper introduces Functional‑Plan (f‑plan) Boosting. By comparing σ_issue with σ*, an LLM extracts a high‑level debugging plan expressed as step‑by‑step pseudo‑code (e.g., “detect missing parent key → add LEFT JOIN → verify row count”). This plan guides a teacher LLM within a structured agent scaffold called SQL‑ACT to interact with the database environment, producing execution trajectories that are 73.7 % more likely to succeed than unguided sampling. The resulting successful trajectories form a high‑quality fine‑tuning corpus.

Using this corpus, the authors fine‑tune Qwen‑2.5‑Coder‑14B, yielding the Bird‑Fixer agent. Bird‑Fixer attains 38.11 % success on BIRD‑CRITIC‑PG and 29.65 % on BIRD‑CRITIC‑MULTI, surpassing proprietary models such as Claude‑3.7‑Sonnet and GPT‑4.1 under the same evaluation protocol. Notably, the performance gap narrows across dialects, indicating that the model learns dialect‑agnostic debugging reasoning.

The paper also provides extensive analysis of issue categories (recursive queries, JSON handling, window functions, etc.), dataset statistics (average query length, distinct test cases), and ablation studies confirming the contributions of SQL‑Rewind and f‑plan boosting. Limitations are acknowledged: the benchmark currently excludes cloud‑native dialects like Snowflake or BigQuery; automatically generated errors may not capture the full diversity of real user mistakes; and the evaluation focuses on functional correctness, leaving performance‑related aspects (e.g., query plan optimization) untouched.

In summary, this work establishes SQL debugging as a first‑class LLM research task, delivers a rigorously curated benchmark, proposes novel data‑generation and supervision techniques, and demonstrates that open‑source models can reach and even exceed the capabilities of leading commercial systems when equipped with targeted training pipelines. Future directions include expanding dialect coverage, incorporating non‑functional metrics, and exploring human‑in‑the‑loop debugging workflows.


Comments & Academic Discussion

Loading comments...

Leave a Comment