MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering
Recent advances in Large Language Models (LLMs) have significantly improved table understanding tasks such as Table Question Answering (TableQA), yet challenges remain in ensuring reliability, scalability, and efficiency, especially in resource-constrained or privacy-sensitive environments. In this paper, we introduce MATA, a multi-agent TableQA framework that leverages multiple complementary reasoning paths and a set of tools built with small language models. MATA generates candidate answers through diverse reasoning styles for a given table and question, then refines or selects the optimal answer with the help of these tools. Furthermore, it incorporates an algorithm designed to minimize expensive LLM agent calls, enhancing overall efficiency. MATA maintains strong performance with small, open-source models and adapts easily across various LLM types. Extensive experiments on two benchmarks of varying difficulty with ten different LLMs demonstrate that MATA achieves state-of-the-art accuracy and highly efficient reasoning while avoiding excessive LLM inference. Our results highlight that careful orchestration of multiple reasoning pathways yields scalable and reliable TableQA. The code is available at https://github.com/AIDAS-Lab/MATA.
💡 Research Summary
The paper introduces MATA, a multi‑agent framework designed to make Table Question Answering (TableQA) both reliable and efficient, especially in settings where resources are limited or privacy concerns preclude the use of large proprietary language models. MATA separates its system into lightweight “tools” (under 500 M parameters) and heavyweight “agents” (3 B+ parameters). The tools consist of a Scheduler, a Confidence Checker, and a Format Matcher. The Scheduler, built from MobileBERT and a small MLP (≈24 M parameters), examines table metadata (size, schema, data types) and the semantic content of the question to decide whether a Program‑of‑Thought (PoT) or a text‑to‑SQL (t2SQL) reasoning path should be executed first. Simultaneously, a Chain‑of‑Thought (CoT) agent generates a pure‑text answer and reasoning trace, but is invoked only once to keep costs low.
The PoT and t2SQL agents generate Python code or SQL queries respectively. Because code generation often yields syntactic or runtime errors, each is paired with a dedicated Debug Agent (Python Debug Agent for PoT, SQL Debug Agent for t2SQL). These debug agents iteratively refine the generated code up to N = 3 cycles, halting early when the code stabilizes and produces the same execution result. All candidate answers—including the CoT answer and the outputs of the debug loops—are fed to the Confidence Checker, a fine‑tuned DeBERTaV3‑large model (≈435 M parameters) that assigns a confidence score to each reasoning path.
If any candidate’s confidence exceeds a threshold θ (set to 0.1 after tuning), MATA skips the final Judge Agent and directly selects the highest‑scoring answer, thereby avoiding an extra LLM call. If no candidate passes the threshold, the Judge Agent aggregates the candidates and makes a final decision, possibly using the confidence scores as additional evidence. Finally, when the selected answer is overly verbose (over 100 characters), a Format Matcher (qwen2.5‑instruct, 0.5 B parameters) extracts a concise entity to match the typical short ground‑truth format of TableQA datasets.
To train the Scheduler and Confidence Checker, the authors constructed a large synthetic dataset (173 664 samples) by running three LLMs (phi‑4‑14B, Qwen2.5‑Coder‑14B, CodeLLaMA‑13B) on three public TableQA corpora (WikiTQ, TabMWP, TabFact) across three reasoning modes (CoT, PoT, t2SQL). Each (table, question) pair thus has labeled correctness for each reasoning path, enabling supervised fine‑tuning of the lightweight tools.
MATA was evaluated on two benchmarks of differing difficulty using ten LLMs ranging from small open‑source models (≈7 B parameters) to large proprietary ones (>30 B). Compared with the strongest baselines, MATA achieved up to 40.1 % absolute improvement in Exact Match, 46.7 % in fuzzy matching, and 33.1 % in token‑level F1. Notably, even the smallest 7 B model reached competitive scores, demonstrating the framework’s model‑agnostic strength. Efficiency analyses showed that MATA reduces the average number of LLM calls by more than 45 % relative to Self‑Consistency approaches, thanks to the Scheduler’s early path pruning and the Confidence Checker’s thresholding. Ablation studies confirmed that the Scheduler cuts unnecessary PoT/SQL calls by ~30 %, the debugging loop reduces code‑related failures by ~70 %, and the Confidence Checker eliminates ~55 % of Judge Agent invocations.
In summary, MATA offers a principled orchestration of diverse reasoning strategies—textual CoT, code‑based PoT, and SQL generation—augmented by lightweight verification tools. This design yields a TableQA system that is both highly accurate across a wide range of LLMs and substantially more cost‑effective, making it suitable for real‑world deployments where privacy, latency, and budget constraints are paramount. The code and the newly released training dataset are publicly available at https://github.com/AIDAS-Lab/MATA.
Comments & Academic Discussion
Loading comments...
Leave a Comment