KOINEU

February 10, 2026

Reading time: 34 minute

...

📝 Original Info

Title:
ArXiv ID: 2512.18622
Date:
Authors: Unknown

📝 Abstract

Text2SQL, the task of generating SQL queries from natural language text, is a critical challenge in data engineering. Recently, Large Language Models (LLMs) have demonstrated superior performance for this task due to their advanced comprehension and generation capabilities. However, privacy and cost considerations prevent companies from using Text2SQL solutions based on external LLMs offered as a service. Rather, small LLMs (SLMs) that are openly available and can hosted in-house are adopted. These SLMs, in turn, lack the generalization capabilities of larger LLMs, which impairs their effectiveness for complex tasks such as Text2SQL. To address these limitations, we propose MATS, a novel Text2SQL framework designed specifically for SLMs. MATS uses a multi-agent mechanism that assigns specialized roles to auxiliary agents, reducing individual workloads and fostering interaction. A training scheme based on reinforcement learning aligns these agents using feedback obtained during execution, thereby maintaining competitive performance despite a limited LLM size. Evaluation results using on benchmark datasets show that MATS, deployed on a single-GPU server, yields accuracy that are on-par with large-scale LLMs when using significantly fewer parameters. Our source code and data are available at https://github.com/thanhdath/mats-sql.

📄 Full Content

Text2SQL, the task of translating natural language into SQL queries, is a long-standing research challenge [49][50][51]54]. While an increasing complexity of user queries and database schemas contribute to the task's difficulty [39], recent solutions based on Large Language Models (LLMs) achieved notable results for Text2SQL [6,23,33,43]. Text2SQL approaches promise to enable non-experts to query databases using a natural language interface [30,45]. Most existing solutions for this task, however, rely on external LLMs offered as a service [28,41], primarily variants of OpenAI GPT, to generate SQL queries. These approaches combine user queries and schema representations with instructional text to generate SQL queries [22,40]. However, the use of external LLM services comes with drawbacks. Privacy concerns arise when sensitive data, such as database schemas or query logs, are shared with third-party platforms, potentially violating confidentiality or exposing data to security risks. Also, the recurring costs of such services can become a substantial financial burden, especially for small organizations.

To be independent of external LLM services, recent approaches for Text2SQL fine-tune open-source LLMs with instructional data [5,14,38]. Fine-tuning improves the task-specific model performance, but requires significant computational resources and technical expertise. In addition, even with models as large as 15 billion parameters, the accuracy obtained using open-source LLMs is considerably lower than the one achieved with external LLM services [16,19]. Striving for a cost-effective solution, one may adopt small Large Language Models (SLMs) with generally smaller numbers of parameters (typically 100M-5B) [25]. Such models are optimized to run on a single-node server that features a single GPU. While the use of SLMs enlarges the efficiency and, hence, applicability of a Text2SQL solution, one faces challenges in terms of model effectiveness. SLMs often struggle with tasks that require deep reasoning or understanding of complex contexts, such as required for Text2SQL [25]. Their limited capacity makes it challenging to maintain relationships between tables or fields, especially with large schemas, and they frequently miss nuances in natural language inputs, leading to syntax or semantic errors in SQL generation.

In this paper, we follow the idea of using SLMs for Text2SQL and propose the Multi-Agent Text2SQL (MATS) framework to operationalize it. MATS employs a multi-agent mechanism [11,13,35] to decompose the Text2SQL task into sub-tasks, each handled by a specialized agent: a schema investigator filters irrelevant schema elements and retrieves relevant column values; a query planner generates multiple SQL queries step-by-step; a validator evaluates SQL outputs using database responses; a fix agent refines SQL based on validator feedback; and a selection agent, at the end of the pipeline, selects the best SQL query from the final candidates. To enhance the collaboration between the agents, we design a collaborative training scheme, coined Reinforcement Learning with Execution Feedback (RLEF). Unlike traditional Reinforcement Learning from Human Feedback [29], RLEF generates multiple responses using automated database feedback, avoiding the need for costly human-labeled data.

The divide-and-conquer strategy realized in MATS is beneficial in terms of efficiency and effectiveness. Due to the specialization of agents and their focus on a single sub-task, the generalization capabilities of SLMs, which can be managed efficiently, are sufficient to yield high accuracy. At the same time, the integration of the agents using reinforcement learning enables our framework to effectively handle complex user queries and large-scale datasets. Furthermore, our framework facilitates the adaptation of opensource SLMs, thereby supporting wider applicability on resourceconstrained devices and under restricted budgets. We summarize the contributions of our paper as follows:

• We propose a novel multi-agent framework in which specialized agents rely on SLMs to collaboratively solve Text2SQL tasks. The framework defines sub-tasks for element filtering, query planing, validation of query results, refinement of queries, and query selection. • We introduce Reinforcement Learning with Execution Feedback (RLEF) as a mechanism to enable SLMs agents to collaborate during training, significantly improving their performance in Text2SQL tasks. It relies on recent advancements for preference optimization [12] and instantiates them based on a sampling scheme for appropriate responses.

SLMs agents by extending the Spider and BIRD datasets through manual labeling, few-shot prompting, and finetuning, ensuring high-quality examples for robust learning. • We evaluate MATS in comprehensive experiments and observe that it achieves results that are on-par with large-scale LLMs, such as GPT-4o + CoT [20] and CHESS [44], while relying on significantly smaller models. • Our source code and data are available at https://github. com/thanhdath/mats-sql. Fig. 1 illustrates the main insight from our experiments in terms of the relation of the execution accuracy and the total size of the model in terms of its parameters. With total model size of 9B, MATS is optimized for resource-constrained environments. Our experimental results show that MATS enables efficient inference without sacrificing performance, which renders it well-suited for cost-sensitive deployments.

In the remainder of the paper, §2 formulates the addressed problem. §3 outlines the key components of the MATS framework and their instantiation. §4 introduces our approach to Reinforcement Learning with Execution Feedback (RLEF). Evaluation results are presented in §5, before we review our contributions in the light of related work in §6 and conclude the paper in §7.

We first characterize the problem addressed in this work ( §2.1), before elaborating on requirements for solutions to it ( §2.2).

Text2SQL addresses the task of generating an SQL query Y that corresponds to a given natural language question 𝑞. This query is constructed based on a database schema 𝑆 and, optionally, an external knowledge base 𝐾. The database schema 𝑆 is defined by a set of tables {𝑇 1 ,𝑇 2 , . . . ,𝑇 𝑚 }, a set of columns {𝐶 1 , 𝐶 2 , . . . , 𝐶 𝑛 }, and a set of foreign key relations {𝑅 1 , 𝑅 2 , . . . , 𝑅 𝑘 }. The optional external knowledge base 𝐾 provides context for the schema, aiding in generating more accurate SQL in ambiguous situations. Mathematically, the Text2SQL task is formulated as:

where the function 𝑓 (• | 𝜽 ) represents a generative model (e.g., a neural network) with learnable parameters 𝜽 .

We argue that any SLMs-based solution for Text2SQL shall address the following requirements:

(R1) Large Database Schema. A Text2SQL solution shall handle large database schemas. This is challenging as the sheer number of tables and columns can exceed the model’s context length, impairing comprehension. Real-world schemas often include overlapping column names and extensive metadata, further complicating the respective task. For example, the BIRD dataset features databases with up to 65 tables and 455 columns, increasing the likelihood of errors in schema linking and SQL query generation [17].

(R2) Ambiguous Column Names and Values. A Text2SQL solution shall cope with the ambiguity in column names and values, especially when multiple columns share similar meanings or overlapping values. For example, names of organizations may appear in different roles, and hence, as different columns in database. Correctly linking queries to columns is then challenging and increases the risk of incorrect SQL generation.

(R3) Weak Reasoning Capability of SLMs. A Text2SQL solution based on SLMs needs to address the limited reasoning capabilities of the respective models. Specifically, chain-of-thought prompting, which enhances reasoning in large models, is less effective for SLMs and can even produce fluent, yet illogical reasoning outputs [47]. In Text2SQL tasks, this limitation becomes particularly important and SLMs have been observed to frequently generate inaccurate SQL queries for complex database schemas [17].

(R4) Low Instruction Following Capability. SLMs struggle with instruction following due to their limited parameter size [27], which needs to be incorporated in Text2SQL solutions using these models. SLMs tend to overfit to specific training formats and typically lack exposure to diverse instruction-tuning datasets, such as In-FoBench [36] or IFEval [55]. Hence, SLMs are limited in their generalization to new or varied instruction types, as well as instructions that involve dependencies and sequential logic.

This section presents our Multi-Agent Text2SQL (MATS) framework. As illustrated in Fig. 2, it adopts the paradigm of multi-agent collaboration, i.e., a splits the Text2SQL task into sub-tasks that are handled by individual agents. Given a user query and database schema, the process begins with the Schema Insight Agent, which extracts relevant tables and columns, even when the query does not exactly match stored values. As such, this agent explicitly addresses the requirements of handling large database schemas (R1) and ambiguous column names and values (R2). Next, the Planner Agent decomposes the reasoning process into a chain of thoughts. It addresses the weak reasoning capabilities (R3) of SLMs by generating SQL candidates. The Validator Agent then evaluates these candidates and their execution results, identifying potential errors. Any detected issues are refined by the Fix Agent. Finally, the Selection Agent chooses the best SQL query based on execution responses.

In the remainder of this section, we elaborate on the realization of the individual agents ( §3.1 - §3.5), before turning to the creation of training data for fine-tuning ( §3.6), which also caters for the weak reasoning capabilities (R3) of SLMs.

Given a question posed by a user as input, the Schema Insight Agent filters out irrelevant schema elements and retrieves relevant column values. To this end, it includes two components: Schema Filtering and Value Matching. Schema Filtering eliminates tables and columns that are unlikely to contribute to generating the correct SQL query. In our framework, we adopt CodeS [19] for this purpose. It uses a bidirectional encoder to rank tables and columns, and discards those with low relevance to the given user question. Value Matching leverages the BM-25 algorithm [1] to identify column values that closely align with the input query. This functionality is essential for selecting the appropriate columns for the generation of accurate SQL queries. Specifically, given an input question 𝑞, for each column 𝑐 𝑖 in the pruned schema 𝐶 = {𝑐 1 , 𝑐 2 , . . . , 𝑐 𝑛 } that is obtained by Schema Filtering, we retrieve a set of candidate values 𝑉 𝑐 𝑖 = {𝑣 𝑖1 , 𝑣 𝑖2 , . . . , 𝑣 𝑖𝑚 }. We compute the BM25 relevance score between the question and each value:

Then, we select the 𝑘 values with highest scores for each column:

, where typically 𝑘 = 2. If no values yield a positive BM25 score, we select a representative example value from 𝑉 𝑐 𝑖 . The selected values 𝑉 * 𝑐 𝑖 are then incorporated into the database schema prompt, providing the model with contextual cues for SQL generation.

The Planner Agent generates SQL queries by decomposing the reasoning process into small, step-by-step operations, enabling SLMs to construct accurate and well-structured queries. As the central component of the system, this agent is responsible for translating user questions into SQL queries to fulfill the given task.

We manually design a reasoning process based on few-shot examples that consists of three steps: 1) identifying the selection goal, 2) analyzing conditions for the WHERE clause, and 3) determining the necessary tables for the FROM and JOIN clauses. This systematic approach guides the model through the process of deciding what to select, which conditions to apply, and, thus, which tables to use. An example of the thought process is illustrated in Fig. 3.

Let 𝑥 represent a data sample containing a question, a database schema, and, optionally, an external knowledge base. The Planner Agent generates a plan 𝑝 and a SQL query 𝑠 as follows:

𝑠 ← 𝜋 𝑝 (𝑥, 𝑝)

(2) In our framework, the Planner produces 𝐾 SQL queries: one using greedy decoding and 𝐾 -1 using multinomial decoding with a temperature T : 𝑆 = {𝑠 1 , 𝑠 2 , . . . , 𝑠 𝐾 } ← 𝜋 𝑝 (𝑥, 𝑝, T ).

(

Figure 3: Example thought process of the Planner.

The Validator Agent re-evaluates the generated SQL query based on the response received from the database to identify errors. As such, the Validator not only verifies the Planner’s reasoning but also detects issues that require modifications, e.g., related to syntax errors in the query or queries that yield empty responses. While the Validator is similar to the Planner in terms of the goal to validate selection goals, to determine the relevant tables, and to analyze the conditions of a selection, this redundancy is crucial, as LLMs often fail on the first attempt on a task [26]. The Validator uses the responses from query evaluation to detect discrepancies, to refine the SQL queries, and to provide targeted feedback when queries fail. Specifically, we employ two specialized validator agents.

Validator Selection. The validator assesses that the generated SQL query accurately selects the correct columns based on the intent of the question. LLM-generated SQL queries often contain mistakes, such as the selection of incorrect or unnecessary columns, the omission of necessary ones, or the arrangement of columns in the wrong order.

We overcome these issues based on a dedicated thought process for the validator for the selection, as it is illustrated in Fig. 4. First, the validator checks the selected columns of the input SQL query. Next, it performs phrase extraction on the input question and maps it to the intended columns that need to be selected. Finally, the validator compares the selected columns with the intended ones and flags potential issues.

This validator considers only on queries that do not include certain operations such as min, max, count, avg, sum, divide, or case when. The reason for this restriction is that multiple queries selecting different columns may still produce the same correct result, which could otherwise lead to misjudgment by the validator. At the end of the validation process, the validator determines whether the SQL query is correct or not.

Validator Condition. This validator aims at identifying mistakes related to logical conditions in SQL queries. In an SQL query, conditions can be used in the WHERE clause or the SELECT clause (e.g., queries using CASE WHEN or IF statements).

The thought process behind the validation mechanism for conditions is illustrated in Fig. 5. First, the validator extracts the condition from the SELECT clause and interprets its meaning. Then, it analyzes the condition in the WHERE clause. After interpreting both conditions, the validator evaluates the execution response. In most cases, if the execution response contains None or an empty set, it likely indicates an incorrect condition (e.g., filtering incorrect values or using the wrong column in the filter). The validator then identifies potential mistakes in the condition and suggests ways to fix them. For example, it may recommend adding conditions such as “column A IS NOT NULL” to filter out None results or correcting mismatched conditions that were misinterpreted by the LLM. Finally, the validator takes a decision on whether the SQL query is considered correct or not.

Combined Validator. Let 𝑣 𝑠 , 𝑣 𝑐 denote the instructions for the validator selection and the validator condition, respectively. The validator agent, 𝜋 𝑣 , processes these instructions along with the input 𝑥, the SQL query 𝑠, and its corresponding execution response 𝑒𝑟 , generating feedback signals as follows:

Here, 𝑓 𝑠 , 𝑓 𝑐 represent the feedback for the validator selection and the validator condition, respectively.

The Fix Agent refines SQL queries using the feedback obtained from the Validator Agent. It processes the feedback to adjust and improve the SQL query, aiming at ensuring that no further errors occur in the final output.

The Fix Agent takes the initial input 𝑥, the SQL query 𝑠 as generated by the Planner, and a set of feedback signals as input to generate a corrected SQL query. This process of the Fix Agent 𝜋 𝑓 is captured as:

where 𝑠 𝑓 is the corrected SQL query.

The Selection Agent is an SLM that chooses the best SQL query from multiple candidates based on the responses obtained when evaluating them over the database.

The Selection Agent takes a prompt containing a list of SQL queries 𝑆 = {𝑠 1 , 𝑠 2 , . . . , 𝑠 𝐾 } and their execution responses 𝐸𝑅 = {𝑒𝑟 1 , 𝑒𝑟 2 , . . . , 𝑒𝑟 𝐾 } as input, and outputs the index of the best query (or indicates that no query is correct). When 𝐾 > 𝑘, the agent splits 𝑆 into smaller subsets (of size up to 𝑘), selects the best query in each subset, and repeats this process until only one candidate remains. Formally, the functionality of the Selection Agent 𝜋 𝑠 is captured as:

where 𝑠 * denotes the chosen query (and its index).

To mitigate limitations in the reasoning with SLMs, we devise an approach to generate training data for fine-tuning the models. To this end, we manually label a small set of training examples per task, enabling SLMs to adopt a structured reasoning process and reduce errors. For each task, we construct training data based on the Spider and BIRD datasets, as discussed in detail in our evaluation. Since these datasets only provide input questions 𝑥 and ground-truth SQL queries ŝ, we extend them with additional annotations: the planning process 𝑝 for the Planner, validator feedback 𝑣 𝑠 , 𝑣 𝑐 for different validation types, and corrected SQL queries based on the feedback. The data creation process follows three structured steps:

Manual Labeling: We annotate up to five representative examples for each task manually. These high-quality examples act as references to guide each agent in performing small, incremental reasoning steps, thereby minimizing error propagation.
Few-Shot Prompting: Building on the manually labeled examples, we use few-shot prompting techniques with OpenAI’s GPT-4o-mini to generate additional training samples for the rest of the dataset. This approach ensures broader coverage of various scenarios in the dataset.
Fine-Tuning on SLMs: We fine-tune SLMs using the promptresponse pairs generated during the few-shot prompting phase. The fine-tuning process focuses on optimizing the completion part of the model’s output. Here, we adopt a supervised fine-tuning loss, which is computed as:

where 𝐶 is the token index marking the end of the prompt; 𝜏 represents the total number of tokens in the sequence which includes the prompt and the completion tokens; 𝑦 𝑡 denotes the target token at position 𝑡 within the completion; 𝜒 represents the prompt tokens; and 𝑃 𝜃 is the probability distribution over the vocabulary predicted by the model parameters 𝜃 .

In the light of the limited capabilities of SLMs in terms of reasoning (R3) and instruction following (R4), we propose an approach for reinforcement learning from execution feedback (RLEF) to further refine the agents of the MATS framework (Fig. 6). RLEF targets cases where agents fail to generate correct SQL queries, and relies on feedback from the evaluation of the queries to identify issues and explore effective corrections. Below, we first introduce a respective model for the various agents ( §4.1), before we target the identification of actions to train them ( §4.2). Finally, we elaborate on the actual training process ( §4.3).

Given a user question and a database schema, we aim to generate an SQL query that best fulfills the intent. We formulate the task as a goal-augmented Partially Observable Markov Decision Process: 𝑀 = (S, 𝐴,𝑇 , 𝑅, 𝐺, 𝑂). The process definition includes the following components:

• S is a set of states;

• 𝐴 ⊂ 𝑉 𝐿 represents the action space sampled from the language model’s vocabulary 𝑉 with 𝐿 as the maximum length of the generated text; • 𝑇 : S × 𝐴 → S is the transition function;

• 𝐺 ⊂ 𝑉 𝑁 denotes the goal space;

• 𝑅 : S × 𝐴 × 𝐺 → R represents the goal-conditioned reward function; and • 𝑂 is a set of observations 𝑜 ∈ 𝑂. Note that the design of the reward function depends on the specific algorithm used for training. For instance, the DPO [37] and ORPO [12] algorithms do not require an explicit reward model, as the reward is inherently derived from the language model itself [37]. In this study, we adopt the ORPO algorithm, so that, in the remainder, we use the terms ‘chosen action’ and ‘rejected action’ to align with the algorithm’s framework.

We instantiate the process for the different types of agents in the MATS framework, as follows:

For the Planner, the goal, observations, and actions are defined as:

• 𝐺 is the set of completions containing correct SQL queries.

• The observation 𝑜 𝑝 ∈ 𝑂 𝑝 is the input sample 𝑥.

• The chosen action is a completion with a correct SQL query matching the ground-truth execution; the rejected action is a query resulting in a syntax error or differing execution.

For the Validator, the instantiation is done as follows:

• 𝐺 is the set of completions that enables the Fix Agent to modify the SQL query to a correct version. • The observation 𝑜 𝑣 ∈ 𝑂 𝑉 is the input prompt to the validator, consisting of 𝑥, the execution response of 𝑥, and an validation instruction 𝑣 𝑖 ∈ {𝑣 𝑠 , 𝑣 𝑐 }. • The chosen action is feedback that correctly identifies issues and assists the Fix Agent in modifying the SQL query. The rejected action is feedback that is either incorrect or insufficiently useful for the Fix Agent.

For the Fix Agent, we obtain the following realization:

• 𝐺 is the set of correct SQL queries.

• The observation 𝑜 𝑓 ∈ 𝑂 𝐹 consists of the input sample 𝑥, an execution response, and a list of feedback signals indicating that the SQL is incorrect. • The chosen action is an SQL query that matches the groundtruth result. The rejected action is a query that results in a syntax error or deviates from the ground truth.

The action space per agent is very large, with a size of |𝑉 | 𝐿 , so that the identification of effective actions to train the agents is important. However, random sampling or relying solely on the agent’s exploration of the action space will generally not yield effective actions. We therefore adopt two strategies for action generation:

• Multinomial Decoding Strategy: We use a temperature parameter T to introduce controlled randomness during the decoding process of the text generated by SLMs. This helps the trained agent in exploring more effective actions. • Advanced Agent Assistance: A larger, well-trained language model generates higher-quality actions for the task.

Using a set of chosen-rejected pairs from our sampling strategy, we improve each agent with ORPO [12]. We choose ORPO for its streamlined approach, which simplifies training and reduces computational complexity compared to methods like PPO [42] or DPO [42]. By integrating supervised fine-tuning loss with an odds ratiobased penalty, ORPO preserves the accuracy gains from supervised fine-tuning (SFT), while aligning model outputs to preferences. This effectively avoids catastrophic forgetting as commonly seen, when reinforcement learning follows SFT.

The ORPO algorithm introduces a monolithic preference alignment approach that eliminates the need for both, a reference model and a reward model, thereby streamlining the preference optimization process. It employs a loss function that integrates supervised fine-tuning loss with a log-odds ratio penalty, enabling the model to effectively differentiate between chosen and rejected actions. By default, ORPO computes SFT loss over the entire prompt and completion. However, we modify the loss function to apply SFT loss only to the completion, allowing the model to focus more directly on generating high-quality outputs rather than learning the structure of the prompt. The objective function of ORPO is defined as: 𝐷 𝑝 ← ∅ 7:

for (𝑥, ŝ ) ∈ D do 8:

true_response ← execute( ŝ ) 9:

𝐴 ← Sampling(𝑥, 𝜋 𝑝 , A P , T ) 10:

C ← ∅, R ← ∅ ⊲ Initialize chosen/rejected actions 11:

for 𝑎 ∈ 𝐴 do 12:

pred_response ← execute(𝑎) 13:

if true_response = pred_response then 14:

C ← C ∪ {𝑎} 15:

R ← {""} ⊲ Add empty string if R is empty 20:

for 𝑐 ∈ C do 21:

for 𝑟 ∈ R do 22:

𝐷 𝑝 ← 𝐷 𝑝 ∪ { (𝑥, 𝑐, 𝑟 ) } 23:

Update policy 𝜋 𝑝 using ORPO: 𝜋 𝑝,𝑖+1 ← ORPO(𝜋 𝑝 , 𝐷 𝑝 ) 24: return 𝜋 𝑝,𝑇 +1 where:

• L completion is the standard supervised fine-tuning loss that is computed only on the completion part, implemented as the negative log-likelihood of generating the favored response 𝑦 𝑤 :

• L OR is the log odds ratio penalty, encouraging the model to increase the odds of generating favored responses over disfavored ones:

• odds 𝜃 (𝑦 | 𝑥) is the likelihood odds of generating the sequence 𝑦, which is computed as:

• 𝜆 is a hyperparameter controlling the weight of the log odds ratio penalty.

Database. We rely on two English Text2SQL benchmarks: Spider [52] and BIRD [21].

• Spider is a popular benchmark for NL2SQL Evaluation Metrics. For the Spider benchmark, we measure the execution accuracy (EX). It evaluates whether the predicted and ground-truth SQL queries yield the same execution results on the database. However, EX can occasionally produce false positives, when incorrect SQL queries accidentally yield the same results as the ground truth. To address this, the Spider benchmark also utilizes the Test-Suite accuracy (TS) [19]. It assesses whether the predicted SQL query passes execution across multiple database instances generated through automated augmentations, making it a more reliable metric by reducing false positives.

The BIRD benchmark primarily relies on execution accuracy (EX) as its evaluation metric. Additionally, BIRD introduces the Valid Efficiency Score (VES) to evaluate the execution efficiency of correctly generated SQL queries. Unlike EX, in VES, the score for a correct SQL query is calculated as the execution time of the ground-truth divided by the execution time of the predicted SQL. If the predicted SQL is more efficient, its VES score surpasses EX.

Baselines. We compare our method with the following baselines:

• Open-Source Fine-Tuning Models. T5 [21] is a fine-tuned baseline for Text2SQL tasks, tested at various scales (Base, Large, 3B). CodeS [19] is an open-source model (1B-15B parameters) incrementally pre-trained on NL2SQL data to specialize in Text2SQL. We do not compare with DeepSeek-R1, even though it is considered an open-source model and has a large parameter size (671B), because its API was temporarily unavailable at the time of our research due to excessive traffic. • Closed-Source API-Based Methods. ChatGPT + CoT [21] enhances SQL generation with chain-of-thought reasoning for improved multi-step problem-solving. DIN-SQL [34] adopts a decomposed in-context learning approach, breaking queries into subtasks such as schema linking, query classification, and self-correction. DAIL-SQL [9] integrates prompt engineering with systematic fine-tuning, optimizing example selection based on question-query similarity and a balanced organizational strategy. MAC-SQL [46] employs a multi-agent framework featuring a Decomposer for sub-question processing. Finally, CHESS [44] utilizes a modular pipeline that includes entity/context retrieval, schema selection, and SQL generation, leveraging LLMs for schema pruning to minimize noise.

Setup. The Schema Insight Agent uses RoBERTa-large [24] (355M parameters), while the Planner and Selection Agents are fine-tuned on LLaMA-3.2 3B. The Validator and Fix Agents are fine-tuned on LLaMA-3.2 1B, resulting in a total parameter size of MATS of 9B. All agents, except the Schema Insight Agent, utilize bfloat16 precision and FlashAttention [3].

For inference, all experiments were conducted on a machine equipped with an A5000 GPU featuring 24GB of memory. The models were deployed using the VLLM framework [15]. For supervised fine-tuning, we set the learning rate to 2.0 × 10 -5 , a batch size of 128, and train the model for 4 epochs. For ORPO training, we set the learning rate to 5 × 10 -6 , 𝜆 to 0.5, and a batch size of 64, and the number of epochs is 1 or within 800 steps. For the Planner agent, we sample 𝐾 = 10 solutions, one by greedy decoding and the others from multinomial sampling with a temperature T = 1.0.

Table 1 and Table 2 present the performance of MATS in comparison to other baselines. In these tables, CHESS 𝑜𝑝𝑒𝑛-𝑠𝑜𝑢𝑟𝑐𝑒 refers to the CHESS method that uses Llama-3-70B and Fine-tuned DeepSeek in its workflow.

Accuracy. MATS shows a strong performance on both, the Spider and BIRD datasets. On the Spider development set, MATS reaches an EX% of 87.1, outperforming other methods, including larger models such as CodeS-15B and MAC-SQL + GPT-4. On the BIRD development dataset, the MATS model achieves an EX% of 64.73 and a VES% of 66.75. This performance is comparable to CHESS, which incorporates multiple modules leveraging OpenAI’s GPT-4-turbo. Overall, MATS outperforms all open-source models in both EX% and VES%, while being on-par with closed-source methods that rely heavily on proprietary models, such as OpenAI GPT. Time. We compare inference time only with other open-source fine-tuning methods, as closed-source approaches host LLMs on unknown physical devices. We note that the inference time of MATS is slower than that of CodeS-15B due to (1) structural differences between Llama and CodeS, and (2) the presence of multiple agents in MATS. Additionally, the inference time of MATS on Spider and BIRD differs significantly, averaging 13.9 s/sample on Spider compared to 22.03 s/sample on BIRD. This is because BIRD contains many large databases, leading to a higher execution times.

Model Complexity. The autoregressive models are served using VLLM, with memory requirements determined by model size and floating-point precision, as shown in Table 3. The MATS Planner requires 7.6 GB of GPU memory, comparable to CodeS-3B, while the full MATS framework requires 21.2 GB, so that, unlike CodeS-15B, it can be deployed on a single GPU. While the MATS framework has higher memory demands than most of the other open-source

In this experiment, we evaluate the execution accuracy of MATS on noisy and more realistic questions using Spider variants and the Dr.Spider datasets. Table 4 showcases the performance of MATS on Spider-Syn, Spider-Realistic, and Spider-DK, highlighting its capability to handle diverse and challenging query variations.

To further assess MATS’s robustness, we evaluate it on Dr.Spider. Recall that this benchmark comprising 17 types of perturbations across questions, database schemas, and SQL queries, including schema-synonym replacements, value synonyms, and abbreviation updates. However, our model is trained exclusively on the original Spider dataset, i.e., without any question or database augmentation.

Table 5 compares the execution accuracy of MATS with other baseline approaches. MATS outperforms all baselines on DB and SQL perturbations. However, on NLQ perturbations, MATS achieves a slightly lower accuracy (75.5%) compared to CodeS-15B (76.3%). This highlights the limitations of SLMs in understanding natural language, particularly when handling synonyms and linguistic variations. Overall, MATS proves robust and adaptable to noisy, realistic queries, making it effective for real-world Text2SQL tasks.

Number of candidates. As the Planner produces multiple responses, increasing the number of candidates may improve the chances of finding the correct SQL query. We explore the selection effectiveness as follows. We evaluate the upper bound by measuring recall when generating 𝐾 SQL queries and assessing the Selection Agent’s accuracy in identifying the correct one.

Fig. 7a presents the execution accuracy of MATS on the BIRD development dataset as the number of candidates increases. The upper bound represents the recall, indicating the maximum possible accuracy if the correct query is always included in the candidate set. Specifically, if at least one of the 𝐾 generated candidates contains the correct SQL query, the sample is considered correct. As expected, increasing 𝐾 improves recall, leading to a higher upper bound. However, the accuracy of MATS does not consistently follow the same trend. While recall reaches 77.8% at 𝐾 = 30, MATS achieves only 62.78%, highlighting a gap between potential and actual execution accuracy. The reason is likely that more incorrect SQL queries are generated, increasing the likelihood that the Selection Agent chooses an incorrect query.

𝜆 to balance the SFT loss and the log odds ratio penalty. Fig. 7b presents the impact of varying 𝜆 on EX% and the number of syntax errors after finetuning the Planner with RLEF Iteration 1. In this figure, 𝜆 = 0 means there is no log odds ratio penalty and only SFT is applied. Introducing 𝜆 > 0 with the log odds ratio greatly improves accuracy, reaching a peak at 𝜆 = 0.25. Meanwhile, the number of syntax errors decreases significantly for 𝜆 > 0.2. However, our results also highlight that the choice of 𝜆 needs to be made carefully to balance accuracy and syntactical correctness.

Temperature for candidate generation. Fig. 7c reports the mean and standard deviation (std) of EX% across different temperature settings during candidate generation. Lower temperatures exhibit lower std, indicating more stable outputs. However, they also make it harder to find a correct solution, as the generated candidates are less diverse. In contrast, higher temperatures increase exploration, making it easier to generate correct solutions among the candidates, though at the cost of greater variability. Note that even with temperature 0.0, the results remain slightly non-deterministic due to underlying sampling mechanisms.

Fig. 8 illustrates the impact of each RLEF iteration on the execution accuracy of both the Planner Agent and the entire framework. For the Planner Agent, we report accuracy using greedy decoding. The results demonstrate that RLEF significantly enhances the ability to generate correct SQL queries. On the BIRD dev dataset, RLEF improves the Planner Agent’s EX% from 53.65 to 59.32, while on the Spider dev dataset, it increases from 83.3 to 85.4. The entire framework, MATS, also benefits from RLEF, exhibiting notable improvements across iterations. On Spider dev, MATS progresses from an initial EX% of 85.5 to 87.1, while on BIRD dev, it improves from 59.06 to a peak of 64.73 before slightly stabilizing. These findings confirm that execution feedback is both helpful and essential for improving the execution accuracy of small language models.

We evaluate the effectiveness of different algorithms including PPO, DPO, and ORPO in aligning the Planner policy. Here, ORPO 𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 is the original version of ORPO, and ORPO 𝑜𝑢𝑟𝑠 is our modified version of ORPO with the loss computing on the completion part only (see §4). For PPO, we use binary rewards, assigning 1 for correct execution and 0 otherwise. For DPO, we set 𝛽 = 0.1 with a learning rate of 5 × 10 -6 . All models are trained for one epoch.

As shown in Table 6, our modified ORPO achieves the highest execution accuracy (56.32%), outperforming PPO and DPO. Additionally, our modification for ORPO shows that ORPO 𝑜𝑢𝑟𝑠 consumes less GPU Memory, while achieving better execution accuracy. PPO provides a slight improvement over the SFT baseline, increasing accuracy from 53.65% to 54.95%, but incurs a significantly high training time (12.2 hours per epoch). DPO, in contrast, suffers from catastrophic forgetting, causing accuracy to drop to 44.98%. Even if this issue is mitigated, its GPU memory consumption (67 GB) and training time (1.8 hours) remain higher than ORPO.

Notably, ORPO eliminates the reference model, reducing GPU memory usage by nearly half. This enables larger batch sizes, resulting in faster training (1.5 hours per epoch) while maintaining compatibility with limited GPU resources, making it a more practical choice in constrained environments.

In this experiment, we compare different prompting strategies for supervised fine-tuning the Planner. The input for supervised finetuning includes both the prompt and its corresponding completion. We construct three supervised fine-tuning datasets with distinct prompting strategies: (1) No Thought, where the ground-truth SQL queries from the BIRD Train dataset are directly used as completions;

(2) Chain-of-Thoughts, where SQL queries are generated using CoT prompting on GPT-4o-mini and used for fine-tuning; and (3) Few-shot Thoughts, where our proposed data generation mechanism, as described in §3.6, is used. The results in Fig. 9 show that Few-shot Thoughts (3) consistently outperforms CoT and No Thought. In simple queries, adding thought-based prompting provides only a slight improvement, with execution accuracy increasing from 59.46% (1) to 60.11% (2) and 63.03% (3). However, for more complex queries, the impact is more significant. In moderate queries, Few-shot Thoughts achieves 43.75%, improving by 6.25% over No Thought. In challenging queries, Few-shot Thoughts (3) reaches 38.62%, compared to 37.93% of CoT and 30.34% of No Thought, highlighting the importance of well-structured thought processes for SQL generation.

As shown in Fig. 10, our method MATS, achieves the best overall EX% across database domains in the Spider benchmark. While MATS performs better than existing baselines such as DAILSQL(SC) and REDSQL-3B + NatSQL in domains like Competition and Social, its performance is slightly lower in certain domains such as Geography. This may be due to larger models using in DAIL-SQL and CodeS-7B, Codes-15B having better background knowledge in Geography. Despite these variations, MATS demonstrates strong generalization across different domains.

Fig. 11 and Fig. 12 compare EX% across SQL characteristics on the Spider and BIRD datasets. MATS (Ours) consistently surpasses baselines like DAILSQL(SC), CodeS-15B, and REDSQL-3B, especially on simpler queries without JOINs, ORDER-BY, or logical connectors. However, subqueries remain a major challenge for all methods, with EX% dropping 10-30% compared to non-nested queries. While MATS outperforms other approaches in this aspect, it still experiences a noticeable decline, indicating the inherent difficulty of handling complex SQL structures.

While we already discussed existing methods for Text2SQL, we here focus on work that is related to the underlying techniques of MATS, i.e., for reinforcement learning, preference alignment, and the agent collaboration. Reinforcement Learning with Human Feedback. Reinforcement Learning with Human Feedback (RLHF) is a widely used approach for aligning language models with human preferences by integrating human evaluations directly into the training process. In RLHF, a reward model is trained on human-labeled data to score outputs based on their alignment with user intent. This reward signal is then used by reinforcement learning algorithms, such as Proximal Policy Optimization (PPO) [42], to optimize the model’s behavior. RLHF has been instrumental in enabling language models to generate responses that are more aligned with human expectations and preferences. For example, this methodology has been successfully applied to instruction-following models like In-structGPT, significantly improving their capability to adhere to user guidelines while performing a broad spectrum of tasks [29]. This technique bridges the gap between general-purpose pretraining and task-specific utility, solidifying RLHF’s role as a pivotal tool in modern AI systems.

Inspired by this paradigm, we propose a new reinforcement learning mechanism using execution feedback (RLEF) to further refine the MATS agents. By using the responses of query execution, we save human efforts in providing labels like RLHF.

Alignment without Reward Models. To overcome the complexities and instability associated with Reinforcement Learning with Human Feedback (RLHF), particularly the challenges posed by Proximal Policy Optimization (PPO) [42], alternative alignment methods have been proposed. Direct Policy Optimization (DPO) simplifies the alignment process by integrating reward modelling directly into the preference learning stage, eliminating the need for multi-stage training [37]. ORPO [12] builds on DPO by integrating an odds ratio-based penalty into supervised fine-tuning. ORPO combines SFT loss with a relative ratio loss, streamlining training by removing the need for separate reward models or alignment stages, thus enhancing efficiency and scalability.

In this paper, we propose a new application of ORPO in Text2SQL tasks by imposing the loss on the completion part only. This makes the framework still light-weight while improving the accuracy of agents and simplifying the training process without requiring a separate reward model.

Multi-Agent Systems. Recent advancements in multi-agent systems have demonstrated their effectiveness in solving complex tasks by breaking them down into smaller, specialized subtasks, each handled by an individual agent [32,48]. These systems leverage the capabilities of Large Language Models to enable efficient collaboration and communication among agents, leading to improved task accuracy and efficiency. Multi-agent frameworks have been successfully applied across various domains, such as software engineering, where tools like ChatDev [35] facilitate collaborative coding and debugging, and societal simulations, where MAS models human interactions to evaluate policy outcomes [31]. Moreover, multi-agent systems have also been applied in recommendation systems, where agents work collaboratively to process user preferences, filter content, and generate personalized recommendations through effective coordination and task allocation [53].

Inspired by these advancements, our framework adopts a multiagent approach for Text2SQL tasks, assigning dedicated agents for schema filtering, query generation, validation, and error correction. This collaborative structure significantly enhances SQL generation accuracy, especially for small language models, while improving scalability through reinforcement learning techniques.

In this work, we introduced MATS, a novel multi-agent framework designed for Text2SQL tasks. The framework is optimized for small language models. Unlike traditional LLM-based approaches, MATS efficiently decomposes SQL generation into specialized agent roles, allowing for effective execution in resource-constrained environments. That is, agents filter irrelevant schema elements and retrieve relevant column values; they generate multiple SQL queries; they evaluate their outputs using database responses; they refines SQL queries; and they eventually select the best SQL query from a set of candidates. Moreover, to further enhance performance, we proposed Reinforcement Learning with Execution Feedback, which improves SQL generation by leveraging execution results as feedback to guide agent alignment. Through extensive evaluations on the BIRD and Spider datasets, we demonstrated that MATS achieves execution accuracy comparable to larger, closed-source LLMs while being deployable on a single 24GB GPU. Our results indicate that MATS outperforms other open-source models in execution accuracy and robustness, narrowing the gap between closed-sourced solutions that employ proprietary, external models.

Despite these advancements, challenges remain in further improving accuracy. Future work shall explore enhancing agent collaboration and refining selection strategies to maximize execution accuracy. We can also improve the response of the database system itself to increase the quality of execution feedback, help the agents to learn faster and better. Additionally, we will integrate broader domain-specific knowledge to further improve SQL generation for real-world applications.

methods, it delivers superior execution accuracy and inference efficiency, as shown in Table1 and Table 2.

methods, it delivers superior execution accuracy and inference efficiency, as shown in Table

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on open access ArXiv data.

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Start searching

No results found