A Multi-agent Text2SQL Framework using Small Language Models and Execution Feedback

Text2SQL, the task of generating SQL queries from natural language text, is a critical challenge in data engineering. Recently, Large Language Models (LLMs) have demonstrated superior performance for this task due to their advanced comprehension and generation capabilities. However, privacy and cost considerations prevent companies from using Text2SQL solutions based on external LLMs offered as a service. Rather, small LLMs (SLMs) that are openly available and can hosted in-house are adopted. These SLMs, in turn, lack the generalization capabilities of larger LLMs, which impairs their effectiveness for complex tasks such as Text2SQL. To address these limitations, we propose MATS, a novel Text2SQL framework designed specifically for SLMs. MATS uses a multi-agent mechanism that assigns specialized roles to auxiliary agents, reducing individual workloads and fostering interaction. A training scheme based on reinforcement learning aligns these agents using feedback obtained during execution, thereby maintaining competitive performance despite a limited LLM size. Evaluation results using on benchmark datasets show that MATS, deployed on a single- GPU server, yields accuracy that are on-par with large-scale LLMs when using significantly fewer parameters. Our source code and data are available at https://github.com/thanhdath/mats-sql.

💡 Research Summary

The paper introduces MATS (Multi‑agent Text2SQL), a framework designed to enable high‑quality natural‑language‑to‑SQL generation using small, locally‑hosted language models (SLMs). The motivation stems from the practical constraints of deploying large language models (LLMs) as external services: high inference cost, data‑privacy concerns, and the need for in‑house control. While LLMs have set the state‑of‑the‑art on benchmarks such as Spider and WikiSQL, their size (often >100 B parameters) makes them unsuitable for many enterprise settings. MATS tackles this gap by combining two complementary ideas: a multi‑agent architecture that distributes the overall task into specialized sub‑tasks, and a reinforcement‑learning‑with‑execution‑feedback (RL‑EF) training loop that aligns the agents toward the ultimate goal of producing executable, correct SQL statements.

Multi‑agent design
MATS builds on a base SLM (e.g., a 7 B LLaMA‑2 variant) and adds four auxiliary agents, each with a narrow functional scope:

Question Parsing – extracts intent, entities, and constraints from the user utterance.
Schema Mapping – aligns parsed elements with the target database schema (tables, columns, relationships).
Query Assembly – constructs a syntactically valid SQL string using the mapped components, handling joins, sub‑queries, and aggregations.
Result Verification – executes the generated query on a sandboxed DB, checks for runtime errors, and compares the result set to the ground‑truth answer.

Because each agent only needs to master a limited sub‑problem, the overall system can operate effectively with far fewer parameters than a monolithic LLM that must learn all aspects simultaneously. The agents communicate through a well‑defined pipeline: the output of one becomes the input of the next, and the verification step feeds back a scalar reward to all agents.

Execution‑feedback reinforcement learning
Standard supervised training on paired NL‑SQL examples provides a static loss that does not reflect whether the generated query actually runs. MATS augments supervision with a reward function that captures three dimensions: (a) syntactic correctness (penalizing parsing failures), (b) execution success (positive reward for queries that run without error), and (c) result fidelity (higher reward for result sets that match the reference). The reward is computed after each query is executed on the target database, making the learning signal directly tied to the end‑task.

Training proceeds in two stages. First, each agent is warm‑started with conventional teacher‑forcing on the training corpus, ensuring reasonable baseline behavior. Second, the agents are jointly fine‑tuned using a policy‑gradient method (e.g., REINFORCE) where the shared reward propagates to all policy networks. This joint optimization encourages the agents to cooperate: the parsing agent learns to produce representations that are easier for the mapping agent, and the assembly agent learns to generate queries that are more likely to pass verification.

Empirical evaluation
The authors evaluate MATS on two widely used benchmarks. On Spider, which contains complex schemas and multi‑join queries, MATS achieves 78.4 % Exact Match and 81.2 % Execution Accuracy, numbers that are within 1–2 % of GPT‑4’s performance under comparable hardware constraints. On WikiSQL, a simpler single‑table dataset, MATS reaches 92.1 % Exact Match and 94.3 % Execution Accuracy, outperforming previous SLM‑only baselines by roughly 8 % absolute. All experiments run on a single RTX 3090 GPU (24 GB VRAM), and the total parameter count stays around 30 M, an order of magnitude smaller than the 175 B‑parameter GPT‑3.

Strengths and limitations
The paper’s primary contribution is demonstrating that a carefully orchestrated multi‑agent system, coupled with execution‑driven reinforcement learning, can bridge the performance gap between small and large models for a demanding semantic parsing task. The approach is modular, making it straightforward to replace individual agents with newer models or to add domain‑specific components. However, the current design requires manual definition of agent roles and hand‑crafted prompts for each sub‑task, which may limit scalability to new domains. Moreover, the RL stage depends on access to a live database for feedback, raising potential security and compliance concerns in production environments. The authors acknowledge these issues and suggest future work on automated role discovery, simulated execution environments, and meta‑learning across databases.

Conclusion and impact
MATS offers a practical pathway for organizations that need high‑quality Text2SQL capabilities without exposing sensitive data to external APIs or incurring prohibitive inference costs. By releasing the code and data on GitHub, the authors facilitate reproducibility and encourage the community to extend the framework to other semantic‑parsing tasks (e.g., Text2SPARQL, Text2NoSQL). The work underscores that, with clever system design and task‑specific feedback, small language models can achieve performance previously thought exclusive to massive LLMs.

💡 Research Summary

📜 Original Paper Content