TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning

Reading time: 5 minute
...

📝 Original Info

  • Title: TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning
  • ArXiv ID: 2512.20312
  • Date: 2025-12-23
  • Authors: Saisai Yang, Qingyi Huang, Jing Yuan, Liangyu Zha, Kai Tang, Yuhang Yang, Ning Wang, Yucheng Wei, Liyao Li, Wentao Ye, Hao Chen, Tao Zhang, Junlin Zhou, Haobo Wang, Gang Chen, Junbo Zhao

📝 Abstract

Tabular data serves as the backbone of modern data analysis and scientific research. While Large Language Models (LLMs) fine-tuned via Supervised Fine-Tuning (SFT) have significantly improved natural language interaction with such structured data, they often fall short in handling the complex, multi-step reasoning and robust code execution required for real-world table tasks. Reinforcement Learning (RL) offers a promising avenue to enhance these capabilities, yet its application in the tabular domain faces three critical hurdles: the scarcity of high-quality agentic trajectories with closed-loop code execution and environment feedback on diverse table structures, the extreme heterogeneity of feedback signals ranging from rigid SQL execution to open-ended data interpretation, and the risk of catastrophic forgetting of general knowledge during vertical specialization. To overcome these challenges and unlock advanced reasoning on complex tables, we introduce \textbf{TableGPT-R1}, a specialized tabular model built on a systematic RL framework. Our approach integrates a comprehensive data engineering pipeline that synthesizes difficulty-stratified agentic trajectories for both supervised alignment and RL rollouts, a task-adaptive reward system that combines rule-based verification with a criteria-injected reward model and incorporates process-level step reward shaping with behavioral regularization, and a multi-stage training framework that progressively stabilizes reasoning before specializing in table-specific tasks. Extensive evaluations demonstrate that TableGPT-R1 achieves state-of-the-art performance on authoritative benchmarks, significantly outperforming baseline models while retaining robust general capabilities. Our model is available at https://huggingface.co/tablegpt/TableGPT-R1.

💡 Deep Analysis

Figure 1

📄 Full Content

Q3-8B: Qwen3-8B; Q3-32B: Qwen3-32B; TGPT2: TableGPT2-7B; TGPT-R1: TableGPT-R1-8B.

Tabular data provide a foundational structure for storing, organizing, and presenting information [4,25], underpinning data analysis and data engineering, which are essential in fields like business intelligence and scientific research [21]. The advent of Large Language Models (LLMs) [11,3,7,1] has revolutionized how users interact with this structured data, shifting the paradigm from manual coding to intuitive natural language interaction [10]. While Supervised Fine-Tuning (SFT) establishes basic instruction-following capabilities [33,30], it often falls short in handling the complex, multistep reasoning and robust code execution required for real-world table tasks [20]. Reinforcement Learning (RL) offers a promising path to enhance these reasoning capabilities and generalization power [23,17,36,14]. However, applying RL to the tabular domain introduces three distinct challenges. First, Data Scarcity for Table Agents remains a significant hurdle. Unlike general chat, tabular tasks require the model to master a closed-loop of thinking, coding, and execution with environment feedback on tables of diverse forms. Traditional corpora rarely contain these precise, executable trajectories (Question-Code-Observation-Answer), and real-world tables possess diverse structures that general data cannot cover. Second, Feedback Heterogeneity complicates the reward mechanism. Tabular tasks exhibit a high degree of heterogeneity in feedback signals, spanning a spectrum from rigid code execution (where correctness is binary and strict) to open-ended data interpretation (where quality is subjective) [16,13]. Moreover, agentic trajectories induce long-horizon credit assignment, making pure terminal supervision brittle; process-level step reward shaping is often necessary to stabilize optimization. A single reward mode cannot effectively guide the model across these disparate objectives. Third, Catastrophic Forgetting poses a risk to model performance [29]. Balancing domain-specific expertise with general intelligence is difficult, as naive training strategies often lead to catastrophic forgetting of general reasoning abilities or instability during the intensive optimization of tabular skills.

To address these challenges, we introduce TableGPT-R1, a specialized tabular model built upon a systematic RL framework. Our approach is grounded in three core technical pillars corresponding to the challenges above. To systematically tackle the issue of data scarcity, we develop a comprehensive Data Engineering Pipeline that encompasses data acquisition, synthetic agentic generation, and rigorous quality control. We first aggregate heterogeneous raw data from diverse sources-including general instruction corpora, table-specific benchmarks, and agent interaction logs. To enable robust code execution, we synthesize high-quality agentic trajectories that explicitly model the closed-loop process of reasoning, coding, and execution with environment feedback, and we leverage these trajectories for both supervised alignment and RL rollouts. To make the tabular setting concrete, we illustrate a typical agentic trajectory for table analysis. Given a user query and a table (often only partially visible in context), the model must plan, execute code to retrieve the necessary evidence, and then synthesize the final answer grounded in execution results. During training, we represent this process with a structured format that interleaves reasoning, code execution, and observations, enabling closed-loop learning for complex table tasks. Furthermore, we employ a difficulty-aware stratification strategy to balance sample complexity, ensuring the model receives a structured and learnable curriculum that covers both simple queries and complex analytical tasks.

To manage the heterogeneity of feedback signals, we design a Task-Adaptive Reward System that adaptively routes tasks to the most appropriate verification pathway. For open-ended reasoning tasks lacking definitive ground truth, we employ a Criteria-Injected Reward Model, which is trained via a rigorous pipeline involving teacher-student distillation and reinforcement learning to generate objective, criteria-based evaluations rather than subjective scalar scores. Conversely, for deterministic tasks with clear labels (e.g., SQL generation, math problems), we utilize a Rule-based Reward Function that integrates strict execution verification with composite regularization terms to enforce behavioral norms. In addition, we incorporate lightweight process-level step reward shaping during agentic RL rollouts to provide denser guidance for long-horizon optimization. This dual-track system ensures precise guidance across the full spectrum of tabular tasks, while robust constraints effectively mitigate reward hacking behaviors such as opportunistic plotting.

Finally, to resolve the optimization instability inherent in vertic

📸 Image Gallery

icon.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut