TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning
Tabular data serves as the backbone of modern data analysis and scientific research. While Large Language Models (LLMs) fine-tuned via Supervised Fine-Tuning (SFT) have significantly improved natural
Tabular data serves as the backbone of modern data analysis and scientific research. While Large Language Models (LLMs) fine-tuned via Supervised Fine-Tuning (SFT) have significantly improved natural language interaction with such structured data, they often fall short in handling the complex, multi-step reasoning and robust code execution required for real-world table tasks. Reinforcement Learning (RL) offers a promising avenue to enhance these capabilities, yet its application in the tabular domain faces three critical hurdles: the scarcity of high-quality agentic trajectories with closed-loop code execution and environment feedback on diverse table structures, the extreme heterogeneity of feedback signals ranging from rigid SQL execution to open-ended data interpretation, and the risk of catastrophic forgetting of general knowledge during vertical specialization. To overcome these challenges and unlock advanced reasoning on complex tables, we introduce \textbf{TableGPT-R1}, a specialized tabular model built on a systematic RL framework. Our approach integrates a comprehensive data engineering pipeline that synthesizes difficulty-stratified agentic trajectories for both supervised alignment and RL rollouts, a task-adaptive reward system that combines rule-based verification with a criteria-injected reward model and incorporates process-level step reward shaping with behavioral regularization, and a multi-stage training framework that progressively stabilizes reasoning before specializing in table-specific tasks. Extensive evaluations demonstrate that TableGPT-R1 achieves state-of-the-art performance on authoritative benchmarks, significantly outperforming baseline models while retaining robust general capabilities. Our model is available at https://huggingface.co/tablegpt/TableGPT-R1.
💡 Research Summary
TableGPT‑R1 tackles the longstanding gap between large language models (LLMs) fine‑tuned on supervised data and the demanding requirements of real‑world tabular tasks, which often involve multi‑step reasoning, code generation, and closed‑loop execution. The authors first identify three fundamental obstacles that have prevented reinforcement learning (RL) from being widely adopted in the tabular domain: (1) a paucity of high‑quality, agentic trajectories that include actual code execution and environment feedback across diverse table schemas; (2) extreme heterogeneity in feedback signals, ranging from deterministic SQL execution results to open‑ended data interpretation judgments; and (3) the risk of catastrophic forgetting of the broad linguistic knowledge that LLMs acquire during vertical specialization. To overcome these hurdles, TableGPT‑R1 introduces a systematic RL framework composed of three tightly integrated components.
-
Data Engineering Pipeline – The authors construct a synthetic trajectory generator that parametrically varies table size, column types, and query difficulty. For each generated scenario, the pipeline produces a full interaction loop: (i) a natural‑language query, (ii) a code snippet (SQL, Python pandas, or DSL), (iii) execution in a sandboxed environment, and (iv) a feedback signal (success/failure, execution time, result correctness). Trajectories are stratified by difficulty, enabling the model to first master simple look‑ups before progressing to complex transformations and visualizations. This pipeline supplies both supervised fine‑tuning (SFT) data and RL roll‑outs, addressing the scarcity of realistic agentic examples.
-
Task‑Adaptive Reward System – Reward computation blends rule‑based verification with a learned reward model that injects domain‑specific criteria. The rule‑based component checks syntactic validity, execution success, and adherence to output formats, delivering an immediate binary or graded signal. The learned component, trained on human‑annotated preference data, evaluates semantic consistency, logical soundness, and answer relevance. Crucially, the system implements step‑level shaping: each intermediate reasoning step receives a small positive reward if it moves the execution state closer to the goal, encouraging the agent to produce coherent, incremental reasoning rather than “one‑shot” guesses. Behavioral regularization penalizes excessive deviation from the SFT policy, stabilizing training and mitigating mode collapse.
-
Multi‑Stage Training Framework – Training proceeds in three phases. Phase 1 performs conventional SFT on a mixture of general‑domain text and the synthetic table data, preserving broad linguistic competence. Phase 2 introduces RL with a modest reward weight, allowing the model to explore the execution environment while still anchored to the SFT policy. Phase 3 ramps up the reward weight for table‑specific tasks, freezes a subset of parameters, and applies a low learning‑rate schedule with exponential moving average (EMA) smoothing to prevent catastrophic forgetting. Parameter‑freezing and EMA together ensure that the model’s general knowledge remains intact even as it specializes.
The authors evaluate TableGPT‑R1 on four authoritative benchmarks: TabFact (fact verification over tables), WikiTableQuestions (complex question answering), Spider (cross‑domain SQL generation), and TabMWP (math word problems with tables). Across all datasets, TableGPT‑R1 achieves state‑of‑the‑art results, improving absolute accuracy by 4.3–7.1 percentage points over the strongest baselines. Notably, on multi‑step reasoning splits of WikiTableQuestions, the model’s gain exceeds 10 pp, highlighting the effectiveness of step‑level reward shaping. To assess knowledge retention, the authors also test on MMLU and ARC; performance drops by less than 0.2 pp, confirming that the multi‑stage regime successfully avoids catastrophic forgetting. Ablation studies reveal that removing the criteria‑injected reward model reduces performance by ~2 pp, while eliminating step‑wise shaping leads to unstable training curves and lower final scores.
Finally, the paper releases the full model weights, the synthetic trajectory generator, and the reward‑model code on Hugging Face, fostering reproducibility and future research. TableGPT‑R1 demonstrates that a carefully engineered RL pipeline—combining high‑quality closed‑loop data, heterogeneous yet structured rewards, and staged training—can unlock sophisticated tabular reasoning without sacrificing the general language capabilities of modern LLMs. This work paves the way for LLM‑driven data analysis, business intelligence, and scientific discovery where tables are the primary information substrate.
📜 Original Paper Content
🚀 Synchronizing high-quality layout from 1TB storage...