A Closer Look into LLMs for Table Understanding
Despite the success of Large Language Models (LLMs) in table understanding, their internal mechanisms remain unclear. In this paper, we conduct an empirical study on 16 LLMs, covering general LLMs, specialist tabular LLMs, and Mixture-of-Experts (MoE…
Authors: Jia Wang, Chuanyu Qin, Mingyu Zheng
A Closer Look into LLMs f or T able Understanding Jia W ang 1 , 2 ∗ , Chuanyu Qin 1 , 2 ∗ , Mingyu Zheng 1 , 2 ∗ , Qingyi Si 3 , Peize Li 1 , Zheng Lin 1 , 2 ‡ 1 Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 2 School of Cyber Security , Uni versity of Chinese Academy of Sciences, Beijing, China 3 JD.COM {wangjia,linzheng}@iie.ac.cn Abstract Despite the success of Large Language Models (LLMs) in table understanding, their internal mechanisms remain unclear . In this paper , we conduct an empirical study on 16 LLMs, co ver - ing general LLMs, specialist tabular LLMs, and Mixture-of-Experts (MoE) models, to e xplore how LLMs understand tabular data and per- form downstream tasks. Our analysis focus on 4 dimensions including the attention dynamics, the ef fectiv e layer depth, the expert acti v ation, and the impacts of input designs. Ke y find- ings include: (1) LLMs follow a three-phase attention pattern—early layers scan the table broadly , middle layers localize relev ant cells, and late layers amplify their contrib utions; (2) tabular tasks require deeper layers than math reasoning to reach stable predictions; (3) MoE models activ ate table-specific experts in mid- dle layers, with early and late layers sharing general-purpose experts; (4) Chain-of-Thought prompting increases table attention, further en- hanced by table-tuning. W e hope these find- ings and insights can f acilitate interpretability and future research on table-related tasks. Our code and analysis pipeline is publicly av ail- able at https://github.com/JiaWang2001/ closer- look- table- llm 1 Introduction T ables, as a representati ve form of structured data, are widely used across v arious real-world fields to store and present information. In recent years, Large Language Models (LLMs) hav e shown po werful instruction following and comple x rea- soning ability , and thus hav e become the domi- nant paradigm for table understanding technique, supporting a wide range of do wnstream applica- tion scenarios, such as table question answering (TQA) ( Pasupat and Liang , 2015 ; Zhang et al. , ∗ Indicates equal contribution. ‡ Corresponding author: Zheng Lin. 2024 ; Deng and Mihalcea , 2025 ), table fact ver - ification (TFV) ( Chen et al. , 2020 ; W ang et al. , 2024b ), advanced data analysis ( Li et al. , 2025 ; W eng et al. , 2025 ; OpenAI , 2024 ) and spreadsheet manipulation ( Li et al. , 2023 ; Microsoft , 2025 ). Although remarkable success has been achieved with existing LLM-based table understanding ap- proaches, they predominantly focus on how to continually improve performance metrics across do wnstream tasks, e.g., boosting LLMs’ perfor- mance with the most suitable prompt designs ( Sui et al. , 2024 ; Y e et al. , 2023 ) and dev eloping spe- cialist tabular LLMs with supervised fine-tuning (SFT) ( Deng and Mihalcea , 2025 ; Zhang et al. , 2024 , 2025 ) or reinforcement learning (RL) ( Y ang et al. , 2025b ; W u et al. , 2025 ; Kang et al. , 2025b ). Unlike pre vious performance-oriented w ork, in this paper , we conduct an in-depth empirical study to achie ve a deeper understanding of the underlying mechanisms of LLMs’ table understanding. Our experiments cov er a wide spectrum of LLMs of dif- ferent sizes and types, and focuses on four perspec- ti ves including the attention dynamics, the effecti v e layer depth, the impact of MoE architectures and the influence of input designs. The explored re- search questions are sho wn in below . Q1: How do intra-table and e xtra-table attention patterns of LLMs ev olve when performing table understanding tasks? Q2: How many ef fecti ve layers are utilized by LLMs to achieve stable predictions and are there dif ference between tabular and general tasks? Q3: Are there specialized table experts for MoE models and where are they located? Q4: The influence of different table formats and reasoning strategies on internal representations. W e conclude our key findings as follo ws: (1) Different LLM layers exhibit distinct at- tention patterns during table understanding. The early layers broadly scan the overall table con- tent, the middle layers then concentrate attention 1 on query-related cells and the top layers further amplify the focused content representations to pro- duce the final answer . (2) T abular tasks r equire mor e layers for out- put distribution refinement than math reasoning. While the depth at which final answer content crys- tallizes is broadly comparable across tasks, LLMs refine output distributions across more layers when processing tabular data—a pattern consistent across model scales and training strategies. (3) MoE models develop specialized experts f or tab ular tasks in the middle layers, which sho w minimal overlap with math-oriented e xperts, but early and top layers share general-purpose ex- perts across dif ferent tasks. (4) Input f ormats and reasoning strategies indeed influence attention patterns. HTML ta- bles result in more dispersed attention distrib utions in early layers than Markdown tables. Chain-of- Thought (CoT) prompting can allocate more atten- tion to the table content, which can be further am- plified by table-specific fine-tuning and thus leads to better performance. T o the best of our knowledge, we present the first thorough in v estigation of the mechanisms un- derlying LLM-based table understanding. Our find- ings not only provide new insights into LLM in- terpretability in the tab ular domain, b ut also of fer guidelines for future research—from optimal in- put configurations that maximize attention to rele- v ant table content, to inference-time interventions in later layers, to MoE optimizations for table- specialized experts. 2 Related W ork 2.1 LLM-based T able Understanding T able understanding (TU) technique aims to en- able models to automatically comprehend tables and perform various tab ular tasks based on user re- quests ( Shigaro v , 2023 ; Lu et al. , 2025 ). W ith the rapid iteration of LLMs, their ability has opened ne w possibilities for more intelligent TU applica- tions. One line of research endeav ours to enhance LLMs’ table understanding ability through dif fer - ent strategies. For instance, eliciting correct rea- soning steps with prompt engineering and carefully selected in-context examples ( Chen , 2023 ; Y e et al. , 2023 ; W ang et al. , 2024b ; Ji et al. , 2024 ; Jiang et al. , 2023 ), collecting table instruction tuning data for supervised fine-tuning ( Zhang et al. , 2024 , 2025 ), exploring ne w tasks and re ward strategies for rein- forcement learning ( Y ang et al. , 2025b ; W u et al. , 2025 ; Zhou et al. , 2025a ), and building powerful table agents for multi-turn data analysis and e xcel manipulation ( Li et al. , 2023 ; W eng et al. , 2025 ). In addition to the performance-oriented stud- ies mentioned above, another valuable direction is to explore the robustness and interpretability of LLMs’ table-related capabilities. For exam- ple, ev aluating models’ performance against di- ver gent perturbations such as table formats ( Sui et al. , 2024 ), noisy operators ( Singha et al. , 2023 ) and cell replacement ( Bhandari et al. , 2025 ). Com- pared with existing work that mainly focused on the performance robustness, we aim at seeking a bet- ter understanding of the underlying mechanisms of LLM-based table understanding and provide v alu- able insights for future follo w-ups. 2.2 Interpr etability and Analysis of LLMs Despite the strong capabilities of LLMs, under- standing their internal mechanisms remains a key challenge. Some studies focus on the roles of dif ferent layers in LLMs, examining how indi- vidual layers contribute to information process- ing ( Skean et al. , 2025 ) and how semantic rep- resentations are gradually constructed across lay- ers ( Csordás et al. , 2025 ; Hu et al. , 2025 ). Other lines of work in vestigate specific internal com- ponents, such as attention heads in self-attention mechanisms and feed-forw ard network (FFN) mod- ules, with the aim of clarifying their roles in knowl- edge representation ( Gev a et al. , 2021 ; Meng et al. , 2022 ) and information acquisition ( W u et al. , 2024 ; K obayashi et al. , 2023 ), rev ealing v arious intrigu- ing phenomena within LLMs, such as attention sink ( K obayashi et al. , 2020 ; Xiao et al. , 2023 ; Gu et al. , 2024 ; Qiu et al. , 2025 ; Queipo-de Llano et al. , 2025 ). Furthermore, with the growing adoption of Mixture-of-Experts (MoE) architectures, recent re- searches ( W ang et al. , 2024a ; Bai et al. , 2025 ; Lo et al. , 2025 ; Bandarkar et al. , 2025 ; Su et al. , 2025 ) analyze expert modules within MoE models, ex- ploring their specialization patterns and acti v ation behaviors during task e xecution. While these ef forts have provided v aluable in- sights ( Si et al. , 2023 ; Bav aresco et al. , 2025 ), they predominantly focus on unstructured text tasks such as math reasoning and factual recall. In this work, we e xtend interpretability research to struc- tured table data, conducting analyses to explore ho w LLMs process tables and offering preliminary findings across se veral dimensions. 2 T able 1: Summary of 2,000 ev aluation samples used in our empirical study . Dataset T ask Structure A vg T okens #Samples WTQ TQA Flat 1012 500 HiT ab TQA Hierarchical 964 500 AITQA TQA Hierarchical 651 500 T abFact TFV Flat 681 500 3 Experimental Setups Evaluation Data W e randomly select 500 sam- ples from 3 TQA benchmarks and 1 TFV bench- mark to perform empirical study , which include WTQ ( Pasupat and Liang , 2015 ), HiT ab ( Cheng et al. , 2021 ), AIT -QA ( Katsis et al. , 2021 ) and T ab- Fact ( Chen et al. , 2020 ). The resulting 2,000 test samples cover common table structures with flat headers (WTQ, T abFact) and complex hierarchical headers (HiT ab, AIT -QA). The input tables are se- rialized into Markdo wn format by default, and we also analyze the impact of alternative formats (e.g., HTML) in Section 7.1 . The complete input prompt templates are sho wn in Appendix C.1 . Analyzed LLMs W e analyze 16 LLMs of three categories. (1) general-purpose instruct LLMs such as Llama-3.1-8B-Instruct and Qwen-2.5-7B- Instruct. (2) fine-tuned tabular LLMs via SFT or RL such as T AMA ( Deng and Mihalcea , 2025 ) and T able-R1-Zero ( Y ang et al. , 2025b ). (3) Mixture-of-Experts (MoE) models like DeepSeek- V2-Lite ( DeepSeek-AI et al. , 2024 ) and Qwen3- 30B-A3B ( Y ang et al. , 2025a ). The complete model list is provided in Appendix C.3 . 4 Analysis of Attention Dynamics T o understand how LLMs allocate attention during table question answering, we design a controlled experimental setup where each input instance com- prises three se gments: (1) a system pr ompt pro vid- ing general instructions, (2) a table content con- taining the serialized table, and (3) a user question specifying the query . This segmentation allo ws us to trace attention flow across dif ferent input compo- nents. W e in v estigate three progressi ve questions: § 4.1 : How does the model attend to three input se g- ments, especially dif ferent table cells? § 4.2 : How much does the table content actually contrib ute to the model’ s final output? § 4.3 : Does this attention pattern causally af fect model predictions? The following metrics are employed in our anal- ysis. (1) se gment attention ratio measures the pro- portion of attention allocated to each se gment per layer; (2) table attention entr opy quantifies ho w concentrated the attention is within the table (lo wer entropy indicates a more focused attention distribu- tion on specific cells); and (3) attention contribu- tion captures the actual influence of each se gment on the final output via L2 norm of v alue-weighted representations. Formal definitions and formulas of these metrics are provided in Appendix B . For all metrics, we first average across all generated tokens of each sample, and then a verage across all 2,000 test samples. The attention dynamics of Llama3.1-8B-Instruct and Qwen2.5-7B-Instruct is sho wn in Figure 1 . Results of more models are sho wn in Appendix F . 4.1 How Does the Model Attend to the T able? T akeaway 1 Although the ov erall attention ratio to table content is modest compared to other seg- ments, LLMs can precisely focus on query- rele vant cells in the middle layers, suggest- ing that accurate localization matters more than total attention volume. Allocated attention ratio across input segments. As sho wn in Figure 1 (lo wer left), the attention allocated to is highest in the ear- liest layers, then it drops before rising again to a secondary peak in the lower-middle layers (layer 10-15) and finally decreases to a stable lev el in the top layers (layer 20-30). Notably , the ov erall changes of table-oriented attention across different layers are modest. W e observe a similar trend of attention allocated to
tags) disperses the 7 2 6 10 14 18 22 26 0 8 16 24 32 40 48 56 Expert ID T abular Expert Activation 3 6 9 12 15 18 21 24 27 3.95 4.00 4.05 4.10 Entropy DeepSeek- V2 Expert Distribution Entropy 3 6 9 12 15 18 21 24 27 0 1 2 3 4 Count Overlap w/ Math Experts 1 9 17 25 33 41 0 16 32 48 64 80 96 112 Expert ID T abular Expert Activation 0 6 12 18 24 30 36 42 48 4.0 4.2 4.4 Entropy Qwen3-30B-A3B Expert Distribution Entropy 0 6 12 18 24 30 36 42 48 0 1 2 3 4 5 6 Count Overlap w/ Math Experts 1 3 5 7 9 11 13 15 Layer Index 0 8 16 24 32 40 48 56 Expert ID T abular Expert Activation 2 4 6 8 10 12 14 16 Layer Index 3.6 3.8 4.0 Entropy OLMoE Expert Distribution Entropy 0 2 4 6 8 10 12 14 16 Layer Index 0 1 2 3 Count Overlap w/ Math Experts T abular Domain Overlap with Math Domain Figure 4: Expert activ ation analysis in MoE models across three architectures (DeepSeek-V2, Qwen3-30B-A3B, OLMoE). Left : Activ ation heatmaps sho wing table-specific experts per layer (darker = higher acti v ation probability). Middle : Entropy of expert activ ation distribution for table tasks (lo wer entropy indicates concentrated activ ation on fewer e xperts). Right : Number of ov erlapping experts between table and math (GSM8K) tasks per layer . 0 5 10 15 20 25 30 Layer Index 4.50 4.75 5.00 5.25 5.50 5.75 Entropy LLaMA-3.1-8B-Ins T able Attention Entropy Markdown HTML 0 5 10 15 20 25 30 Layer Index 5 10 15 20 25 30 35 Ratio (%) LLaMA-3.1-8B-Ins T able Attn Ratio Markdown HTML t-SNE Visualization LLaMA-3.1-8B-Ins Layer 1 Markdown HTML LLaMA-3.1-8B-Ins Layer 15 LLaMA-3.1-8B-Ins Layer 31 0 5 10 15 20 25 Layer Index 3.75 4.00 4.25 4.50 4.75 5.00 5.25 5.50 Entropy Qwen-2.5-7B-Ins T able Attention Entropy 0 5 10 15 20 25 Layer Index 10 20 30 40 Ratio (%) Qwen-2.5-7B-Ins T able Attn Ratio t-SNE Visualization Qwen-2.5-7B-Ins Layer 1 Markdown HTML Qwen-2.5-7B-Ins Layer 15 Qwen-2.5-7B-Ins Layer 27 Figure 5: Comparison between Markdown and HTML table formats. model’ s attention across more tok ens, making it harder to concentrate on specific cells. Despite these differences in early layers, t - SNE visualizations re veal that the internal rep- resentations of different formats gradually con- ver ge as layer depth increases. This suggests that while early layers process format-specific features, deeper layers extract format-agnostic semantic in- formation. This con ver gence may explain why LLMs can achie ve reasonable performance across dif ferent table formats. As shown in T able 3 , the performance gap between Markdown and HTML formats is minimal across all models (typically within 1-2%), empirically confirming that LLMs can effecti vely handle both formats despite their dif ferent syntactic structures. 7.2 How Does Reasoning Strategy Affect Attention? T akeaway 6 Chain-of-Thought (CoT) reasoning leads to higher attention to table content in middle and late layers compared to direct answer- ing, an effect that persists ev en when analyz- ing only final answer tokens. T able-specific fine-tuning further amplifies this ef fect. W e compare the attention patterns under direct answering and Chain-of-Thought (CoT) prompt- ing for both v anilla LLMs and their table-tuned v ariants. As sho wn in Figure 6 (bottom), CoT rea- soning leads to a higher proportion of attention 8 0 5 10 15 20 25 30 Layer Index 5 10 15 20 25 30 T able Attn Ratio (%) LLaMA-based Models (8B) | Direct LLaMA-3.1-8B-Ins T ama-8B T able-R1-Zero-8B 0 5 10 15 20 25 30 Layer Index 5 10 15 20 25 30 LLaMA-based Models (8B) | CoT 0 5 10 15 20 25 30 Layer Index -1 0 1 2 3 4 5 T able Attn Ratio Gain (%) Δ Gain: LLaMA-3.1-8B-Ins (CoT - Direct) 0 5 10 15 20 25 Layer Index 10 20 30 40 50 T able Attn Ratio (%) Qwen-based Models (7B) | Direct Qwen-2.5-7B-Ins T ama-Qwen-7B T able-R1-Zero-7B 0 5 10 15 20 25 Layer Index 10 20 30 40 50 60 Qwen-based Models (7B) | CoT 0 5 10 15 20 25 Layer Index -2 0 2 4 6 8 T able Attn Ratio Gain (%) Δ Gain: Qwen-2.5-7B-Ins (CoT - Direct) Figure 6: Comparison between direct answering and Chain-of-Thought (CoT) reasoning. T able 3: A verage performance ov er 2,000 test samples under different input formats and inference strate gies. Markdown HTML Model V anilla CoT V anilla CoT Instruct LLMs Qwen2.5-7B 62.15 67.20 61.35 67.80 Llama3.1-8B 56.95 62.35 57.95 58.35 T able-Specific Fine-tuned LLMs T able-R1-7B 69.75 82.30 70.30 81.60 T able-R1-8B 70.85 82.25 69.90 80.60 MoE LLMs Qwen3-30B-A3B 66.30 70.15 68.35 72.10 allocated to table content, especially in the middle and late layers. Comparing vanilla and table-tuned models (Figure 6 , top), table-specific fine-tuning generally increases attention to tables across all lay- ers, with RL-tuned models sho wing the lar gest in- crease when combined with CoT . The performance results in T able 3 corroborate this: CoT consistently improv es accuracy across all models, with partic- ularly striking gains for table-tuned models (e.g., T able-R1-7B improves from 69.75% to 82.30%). 1 T o further isolate CoT’ s effect on table ground- ing from the natural table references within rea- soning chains, we recompute the T able Attn Ratio using only the attention weights at final answer 1 All CoT vs. Direct differences are statistically signifi- cant (non-overlapping 95% bootstrap CIs, 1,000 iterations, N =2,000). Markdown vs. HTML diff erences fall within the margin of uncertainty for most models. Complete results with confidence intervals are pro vided in Appendix D.1 . tokens , excluding all reasoning chain tokens. In early layers—layer 1 for LLaMA-3.1-8B (+1.85%) and layers 1–4 for Qwen2.5-7B (+13.56%, +9.12%, +1.52%, +7.45%)—CoT maintains a substantially higher T able Attn Ratio than direct answering ev en under this restricted setting, confirming that CoT genuinely induces stronger table grounding at the moment of answer generation. In middle-to-late layers, the final-answer-only ratio under CoT falls slightly below that of direct answering, suggesting a division of labor: the reasoning chain has already distilled rele v ant table information into the residual stream, allo wing later layers to attend to the nearby reasoning trajectory rather than re-attending to dis- tant table tokens. 8 Conclusion W e present the first systematic empirical study on the internal mechanisms of LLM-based table under- standing across 16 LLMs. Our analysis re veals a co- herent three-phase workflo w: early layers broadly encode the table, middle layers precisely localize query-rele vant cells, and late layers amplify the fo- cused content for answer generation. This pattern manifests consistently across architectures—MoE models mirror this through table-specific expert acti vation in middle layers, while CoT reasoning and table-specific fine-tuning further enhance table engagement. These interconnected findings offer actionable guidance: from optimal input config- urations, to inference-time interv entions in later layers, to tar geted optimization of table-specialized experts. 9 9 Limitations Though this paper presents a systematic empirical study of the internal mechanisms of LLM-based ta- ble understanding, there are certain limitations and promising directions that deserve future research. (1) Analyzing LLMs of larger scales and differ - ent series. Due to resource limitations, we mainly analyze advanced open-source models with param- eters below 32B. Models of large scales such as DeepSeek-V3-671B and Qwen3-235B-A22B also deserves further in vestigation to analyze the scal- ing effect on table understanding. (2) Extending to MLLMs. Previous w ork and recent proprietary models hav e demonstrated that multimodal LLMs (MLLMs) also possess strong table understanding ability based on table images ( Zheng et al. , 2024 ; Deng et al. , 2024 ; Zhou et al. , 2025b ). As a result, it is worthwhile to extend our analysis to MLLM scope to discover unique characteristics of multi- modal table understanding. (3) Analyzing more tabular tasks. This study primarily focuses on table question answering and table fact verification, two most typical tasks of existing studies. Howe v er , it is pertinent to in v estigate whether our findings can generalize to more complex tabular tasks, such as data analysis and table summarization. (4) En- riching table formats and reasoning strategies. W e do not intend to exhaust every possible table for- mats such as CSV and JSON formats and every reasoning strategies such as program-of-thoughts with in-context e xamples, which could be further explored by future follo w-ups. 10 Ethical Considerations Our empirical study and all experiments are con- ducted based 4 public academic benchmarks, which are free and open-source data for research use. The analyzed LLMs are also open-source mod- els that we downloaded from their of ficial websites. As a result, the authors foresee no ethical concerns. References Jun Bai, Minghao T ong, Y ang Liu, Zixia Jia, and Zi- long Zheng. 2025. Understanding and lev eraging the expert specialization of context faithfulness in mixture-of-experts LLMs . In Pr oceedings of the 2025 Confer ence on Empirical Methods in Natural Language Pr ocessing , pages 21938–21953, Suzhou, China. Association for Computational Linguistics. Lucas Bandarkar , Chenyuan Y ang, Mohsen Fayyaz, Junlin Hu, and Nanyun Peng. 2025. Multilin- gual routing in mixture-of-experts. arXiv preprint arXiv:2510.04694 . Anna Bav aresco, Raffaella Bernardi, Leonardo Berto- lazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller , Andre Martins, Philipp Mondorf, V era Neplenbroek, Sandro Pezzelle, Bar- bara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece T akmaz, and Alberto T estoni. 2025. LLMs instead of human judges? a large scale empirical study across 20 NLP ev alua- tion tasks . In Pr oceedings of the 63r d Annual Meet- ing of the Association for Computational Linguistics (V olume 2: Short P apers) , pages 238–255, V ienna, Austria. Association for Computational Linguistics. Kushal Raj Bhandari, Sixue Xing, Soham Dan, and Jianxi Gao. 2025. Exploring the robustness of lan- guage models for tab ular question answering via at- tention analysis . Pr eprint , W enhu Chen. 2023. Lar ge language models are few(1)- shot table reasoners . In F indings of the Associa- tion for Computational Linguistics: EACL 2023 , pages 1120–1130, Dubro vnik, Croatia. Association for Computational Linguistics. W enhu Chen, Hongmin W ang, Jianshu Chen, Y unkai Zhang, Hong W ang, Shiyang Li, Xiyou Zhou, and W illiam Y ang W ang. 2020. T abfact: A large-scale dataset for table-based fact verification . Preprint , Zhoujun Cheng, Haoyu Dong, Zhiruo W ang, Ran Jia, Jiaqi Guo, Y an Gao, Shi Han, Jian-Guang Lou, and Dongmei Zhang. 2021. Hitab: A hierarchical table dataset for question answering and natural language generation. arXiv pr eprint arXiv:2108.06712 . Róbert Csordás, Christopher D Manning, and Christo- pher Potts. 2025. Do language models use their depth efficiently? arXiv pr eprint arXiv:2505.13898 . DeepSeek-AI, Aixin Liu, Bei Feng, Bin W ang, Bingx- uan W ang, Bo Liu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Daya Guo, Dejian Y ang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, and 81 others. 2024. Deepseek-v2: A strong, economical, and ef- ficient mixture-of-experts language model. CoRR , abs/2405.04434. Naihao Deng and Rada Mihalcea. 2025. Rethinking table instruction tuning . In Findings of the Asso- ciation for Computational Linguistics: ACL 2025 , pages 21757–21780, V ienna, Austria. Association for Computational Linguistics. Naihao Deng, Zhenjie Sun, Ruiqi He, Aman Sikka, Y u- long Chen, Lin Ma, Y ue Zhang, and Rada Mihalcea. 2024. T ables as texts or images: Evaluating the table reasoning ability of LLMs and MLLMs . In F ind- ings of the Association for Computational Linguis- tics: ACL 2024 , pages 407–426, Bangkok, Thailand. Association for Computational Linguistics. 10 Abhimanyu Dubey , Abhinav Jauhri, Abhinav Pandey , Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Y ang, Angela Fan, and 1 others. 2024. The llama 3 herd of models. arXiv pr eprint arXiv:2407.21783 . Mor Ge va, Roei Schuster , Jonathan Berant, and Omer Levy . 2021. Transformer feed-forward layers are key-v alue memories. In Proceedings of the 2021 Confer ence on Empirical Methods in Natural Lan- guage Pr ocessing , pages 5484–5495. Xiangming Gu, T ianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Y e W ang, and Min Lin. 2024. When attention sink emerges in lan- guage models: An empirical view . arXiv pr eprint arXiv:2410.10781 . Y i Hu, Cai Zhou, and Muhan Zhang. 2025. What affects the effe ctiv e depth of large language models? arXiv pr eprint arXiv:2512.14064 . Deyi Ji, Lan yun Zhu, Siqi Gao, Peng Xu, Hongtao Lu, Jieping Y e, and Feng Zhao. 2024. T ree-of-table: Un- leashing the po wer of llms for enhanced lar ge-scale table understanding . Pr eprint , Jinhao Jiang, Kun Zhou, Zican Dong, K eming Y e, Xin Zhao, and Ji-Rong W en. 2023. StructGPT : A general framew ork for lar ge language model to reason o ver structured data . In Proceedings of the 2023 Con- fer ence on Empirical Methods in Natur al Language Pr ocessing , pages 9237–9251, Singapore. Associa- tion for Computational Linguistics. Seil Kang, Jinyeong Kim, Junh yeok Kim, and Seong Jae Hwang. 2025a. See what you are told: V isual atten- tion sink in large multimodal models. arXiv preprint arXiv:2503.03321 . Xiaoqiang Kang, Shengen W u, Zimu W ang, Y ilin Liu, Xiaobo Jin, Kaizhu Huang, W ei W ang, Y utao Y ue, Xiaowei Huang, and Qiufeng W ang. 2025b. Can GRPO boost complex multimodal table understand- ing? In Pr oceedings of the 2025 Confer ence on Empirical Methods in Natural Languag e Pr ocessing , pages 12631–12644, Suzhou, China. Association for Computational Linguistics. Y annis Katsis, Saneem Chemmengath, V ishwajeet Ku- mar , Samarth Bharadwaj, Mustafa Canim, Michael Glass, Alfio Gliozzo, Feifei Pan, Jaydeep Sen, Karthik Sankaranarayanan, and Soumen Chakrabarti. 2021. Ait-qa: Question answering dataset over complex tables in the airline industry . Pr eprint , Goro K obayashi, T atsuki Kuribayashi, Sho Y okoi, and Kentaro Inui. 2020. Attention is not only a weight: Analyzing transformers with vector norms . In Pr oceedings of the 2020 Confer ence on Empirical Methods in Natural Languag e Pr ocessing (EMNLP) , pages 7057–7075, Online. Association for Computa- tional Linguistics. Goro K obayashi, T atsuki Kuribayashi, Sho Y okoi, and Kentaro Inui. 2023. Analyzing feed-forw ard blocks in transformers through the lens of attention maps. arXiv pr eprint arXiv:2302.00456 . Hongxin Li, Jingran Su, Y untao Chen, Qing Li, and Zhaoxiang Zhang. 2023. Sheetcopilot: Bringing software producti vity to the next le vel through lar ge language models . In Thirty-seventh Conference on Neural Information Pr ocessing Systems . Jinyang Li, Nan Huo, Y an Gao, Jiayi Shi, Y ingxiu Zhao, Ge Qu, Bowen Qin, Y urong W u, Xiaodong Li, Chen- hao Ma, Jian-Guang Lou, and Reynold Cheng. 2025. Are large language models ready for multi-turn tabu- lar data analysis? In Pr oceedings of the 42nd Inter - national Confer ence on Machine Learning , volume 267 of Pr oceedings of Machine Learning Resear ch , pages 34795–34835. PMLR. Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili W ang, and Jie Fu. 2025. A closer look into mixture-of-experts in large language models. In F indings of the Associ- ation for Computational Linguistics: NAA CL 2025 , pages 4427–4447. W eizheng Lu, Jing Zhang, Ju Fan, Zihao Fu, Y ueguo Chen, and Xiaoyong Du. 2025. Lar ge language model for table processing: a surve y . F r ontiers of Computer Science , 19(2). Ke vin Meng, Da vid Bau, Alex Andonian, and Y onatan Belinko v . 2022. Locating and editing factual associa- tions in gpt. Advances in neural information pr ocess- ing systems , 35:17359–17372. Microsoft. 2025. Copilot in excel . Accessed: 2025-12- 29. Niklas Muennighof f, Luca Soldaini, Dirk Groene veld, Kyle Lo, Jacob Morrison, Sewon Min, W eijia Shi, Pete W alsh, Oyvind T afjord, Nathan Lambert, Y uling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David W adden, Alexander W ettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, and 5 others. 2024. Olmoe: Open mixture-of-experts language models. CoRR , abs/2409.02060. Nostalgebraist. 2020. Interpreting gpt: the logit lens. https://www. lesswrong.com/posts/AcKRB8wDpdaN6v6ru/ interpreting- gpt- the- logit- lens . OpenAI. 2024. Improv ements to data analysis in chat- gpt . Accessed: 2025-12-29. Panupong Pasupat and Percy Liang. 2015. Composi- tional semantic parsing on semi-structured tables . In Pr oceedings of the 53r d Annual Meeting of the As- sociation for Computational Linguistics and the 7th International Joint Confer ence on Natural Language Pr ocessing (V olume 1: Long P apers) , pages 1470– 1480, Beijing, China. Association for Computational Linguistics. 11 Zihan Qiu, Zekun W ang, Bo Zheng, Zeyu Huang, Kaiyue W en, Songlin Y ang, Rui Men, Le Y u, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. 2025. Gated attention for large lan- guage models: Non-linearity , sparsity , and attention- sink-free . Pr eprint , Enrique Queipo-de Llano, Álvaro Arroyo, Federico Bar - bero, Xiaowen Dong, Michael Bronstein, Y ann Le- Cun, and Ra vid Shwartz-Zi v . 2025. Attention sinks and compression valle ys in llms are two sides of the same coin. arXiv pr eprint arXiv:2510.06477 . Alex ey Shigarov . 2023. T able understanding: Prob- lem ov erview . WIREs Data Mining and Knowledge Discovery , 13(1):e1482. Qingyi Si, T ong W ang, Zheng Lin, Xu Zhang, Y anan Cao, and W eiping W ang. 2023. An empirical study of instruction-tuning large language models in Chinese . In Findings of the Association for Computational Lin- guistics: EMNLP 2023 , pages 4086–4107, Sing apore. Association for Computational Linguistics. Ananya Singha, José Cambronero, Sumit Gulwani, V u Le, and Chris Parnin. 2023. T abular repre- sentation, noisy operators, and impacts on table structure understanding tasks in llms . Pr eprint , Oscar Skean, Md Rif at Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev , Y ann LeCun, and Ra vid Shwartz- Ziv . 2025. Layer by layer: Uncovering hidden rep- resentations in language models. arXiv preprint arXiv:2502.02013 . Aofeng Su, Aowen W ang, Chao Y e, Chen Zhou, Ga Zhang, Gang Chen, Guangcheng Zhu, Haobo W ang, Haokai Xu, Hao Chen, Haoze Li, Haoxuan Lan, Jiaming T ian, Jing Y uan, Junbo Zhao, Jun- lin Zhou, Kaizhe Shou, Liangyu Zha, Lin Long, and 14 others. 2024. T ablegpt2: A large multi- modal model with tabular data inte gration . Pr eprint , Zunhai Su, Qingyuan Li, Hao Zhang, W eihao Y e, Qibo Xue, Y uLei Qian, Y uchen Xie, Ngai W ong, and Ke- hong Y uan. 2025. Un v eiling super e xperts in mixture- of-experts large language models. arXiv pr eprint arXiv:2507.23279 . Y uan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, and Dongmei Zhang. 2024. T able meets llm: Can large language models understand structured table data? a benchmark and empirical study . Pr eprint , Ashish V aswani, Noam Shazeer , Niki Parmar , Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser , and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information pr ocessing systems , 30. Zihan W ang, Deli Chen, Damai Dai, Runxin Xu, Zhu- oshu Li, and Y u W u. 2024a. Let the e xpert stick to his last: Expert-specialized fine-tuning for sparse archi- tectural lar ge language models. In Pr oceedings of the 2024 Confer ence on Empirical Methods in Natural Language Pr ocessing , pages 784–801. Zilong W ang, Hao Zhang, Chun-Liang Li, Julian Martin Eisenschlos, V incent Perot, Zifeng W ang, Lesly Mi- culicich, Y asuhisa Fujii, Jingbo Shang, Chen-Y u Lee, and T omas Pfister . 2024b. Chain-of-table: Evolving tables in the reasoning chain for table understanding . In ICLR . Han W eng, Zhou Liu, Y uanfeng Song, Xiaoming Y in, Xing Chen, and W entao Zhang. 2025. Unidatabench: Evaluating data analytics agents across structured and unstructured data . Pr eprint , W enhao Wu, Y izhong W ang, Guangxuan Xiao, Hao Peng, and Y ao Fu. 2024. Retriev al head mechanisti- cally explains long-conte xt factuality . arXiv pr eprint arXiv:2404.15574 . Zhenhe W u, Jian Y ang, Jiaheng Liu, Xianjie W u, Changzai Pan, Jie Zhang, Y u Zhao, Shuangyong Song, Y ongxiang Li, and Zhoujun Li. 2025. T able- r1: Region-based reinforcement learning for table understanding . Pr eprint , Guangxuan Xiao, Jiaming T ang, Jingwei Zuo, Shang Y ang, Haotian T ang, Y ao Fu, Song Han, and 1 others. Duoattention: Efficient long-context llm inference with retrie v al and streaming heads. In The Thirteenth International Confer ence on Learning Repr esenta- tions . Guangxuan Xiao, Y uandong Tian, Beidi Chen, Song Han, and Mik e Le wis. 2023. Efficient streaming language models with attention sinks. arXiv . An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bo wen Y u, Chang Gao, Chengen Huang, Chenxu Lv , and 1 others. 2025a. Qwen3 technical report. arXiv pr eprint arXiv:2505.09388 . An Y ang, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Y u, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran W ei, Huan Lin, Jian Y ang, Jian- hong Tu, Jianwei Zhang, Jianxin Y ang, Jiaxi Y ang, Jingren Zhou, Junyang Lin, Kai Dang, and 22 oth- ers. 2024. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115 . Zheyuan Y ang, L yuhao Chen, Arman Cohan, and Y ilun Zhao. 2025b. T able-r1: Inference-time scaling for ta- ble reasoning tasks . In Proceedings of the 2025 Con- fer ence on Empirical Methods in Natur al Language Pr ocessing , pages 20616–20635, Suzhou, China. As- sociation for Computational Linguistics. Y unhu Y e, Binyuan Hui, Min Y ang, Binhua Li, Fei Huang, and Y ongbin Li. 2023. Large language mod- els are versatile decomposers: Decomposing e vi- dence and questions for table-based reasoning . In Pr oceedings of the 46th International ACM SIGIR 12 Confer ence on Resear ch and De velopment in Infor- mation Retrieval , SIGIR ’23, page 174–184, Ne w Y ork, NY , USA. Association for Computing Machin- ery . T ianshu Zhang, Xiang Y ue, Y ifei Li, and Huan Sun. 2024. T ableLlama: T o wards open large general- ist models for tables . In Proceedings of the 2024 Confer ence of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies (V olume 1: Long P apers) , pages 6024–6044, Mexico City , Mexico. Association for Computational Linguistics. Xiaokang Zhang, Sijia Luo, Bohan Zhang, Ze yao Ma, Jing Zhang, Y ang Li, Guanlin Li, Zijun Y ao, Kangli Xu, Jinchang Zhou, Daniel Zhang-Li, Jifan Y u, Shu Zhao, Juanzi Li, and Jie T ang. 2025. TableLLM: Enabling tabular data manipulation by LLMs in real office usage scenarios . In F indings of the Associa- tion for Computational Linguistics: A CL 2025 , pages 10315–10344, V ienna, Austria. Association for Com- putational Linguistics. Mingyu Zheng, Xinwei Feng, Qingyi Si, Qiaoqiao She, Zheng Lin, W enbin Jiang, and W eiping W ang. 2024. Multimodal table understanding . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 9102–9124, Bangkok, Thailand. Association for Computational Linguistics. W ei Zhou, Mohsen Mesgar, Heike Adel, and Annemarie Friedrich. 2025a. p2-tqa: A process-based preference learning frame work for self-improving table question answering models . Pr eprint , W ei Zhou, Mohsen Mesgar, Heike Adel, and Annemarie Friedrich. 2025b. T e xts or images? a fine-grained analysis on the effecti veness of input representations and models for table question answering . In F ind- ings of the Association for Computational Linguis- tics: ACL 2025 , pages 2307–2318, V ienna, Austria. Association for Computational Linguistics. A Background A.1 T ask Formalization Gi ven a table T and a natural language question Q , T able Question Answering (T ableQA) requires the model G ( · ) to generate a final answer A based on the abov e information. W ithin the context of Lar ge Language Models (LLMs), the input sequence x is composed of three segments serving distinct se- mantic roles: • System Prompt Segment ( S sy s ) : Com- prises the system instructions and model- specific con versation templates (e.g., <|begin_of_text|> for Llama-3.1-8B- Instruct ( Dubey et al. , 2024 )). • T able Content Segment ( S tab ) : Contains the table data serialized in a specific format (e.g., Markdo wn or HTML). • Question Description Segment ( S q st ) : Con- tains the specific user query Q . Let I sy s , I tab , and I q denote the sets of in- dices corresponding to these segments in the in- put sequence, the total input index set is I in = I sy s ∪ I tab ∪ I q . The set of indices for tokens generated by the model is denoted as I g en . A.2 Architectur e of LLMs The input sequence is processed by the LLM through L transformer blocks ( V aswani et al. , 2017 ). The update process of residual stream at each layer l can be unified as follows: ˆ x l = MHA l ( LN ( x l − 1 )) + x l − 1 (1) x l = T rans l ( LN ( ˆ x l )) + ˆ x l (2) where x l ∈ R n × d represents the hidden state at layer l , d is the embedding dimension, and n is the total sequence length. MHA denotes the Multi-Head Attention module, and T rans repre- sents the transition layer , which is a Feed-Forward Network (FFN) in dense LLMs or a Mixture-of- Experts (MoE) layer in MoE LLMs. In the l -th layer , the MHA mechanism allocates attention weights by computing the similarity be- tween Queries ( Q ) and K eys ( K ): α l,h i,j = softmax ( x l − 1 i W l,h Q )( x l − 1 j W l,h K ) T √ d k ! (3) where α l,h i,j denotes the attention weight from tok en i to token j in the h -th attention head of the l -th layer . The final output of a single attention head is the aggregation of contrib utions from all preceding tokens: MHA l,h ( x i ) = X j ≤ i α l,h i,j ( x l − 1 j W l,h OV ) (4) Here, W l,h OV = W l,h V W l,h O represents the joint value- output projection matrix, which defines the specific path through which information is mapped from input tokens into the residual stream. For MoE LLMs, the transition layer introduces a sparse acti vation mechanism where a routing func- tion G selects the T op- k experts for each token. 13 The output is scaled by the routing weights and aggregated: MoE ( x i ) = X e ∈ T op-k ( G i ) G i,e · E e ( LN ( x i )) (5) where G i = softmax ( x i W G ) and E e denotes the e -th expert. Through this hierarchical process, the LLM transforms internal representations into the final response. B Interpr etability Analysis Methods This section introduces our core methodology for re vealing the internal mechanisms of LLMs during tabular reasoning, categorized into three dimen- sions: Inner Pr ompt , Inner T able , and Contribu- tion . B.1 Inner Prompt: Segment Attention Ratio T o quantify the allocation of attention resources across dif ferent input segments during inference, we define the segment attention ratio D , which measures the de gree of attention weight assigned to a specific segment S at layer l , av eraged across all attention heads and the entire generation sequence I g en : D l ( S ) = 1 |I g en | · H X i ∈I gen H X h =1 X j ∈S ˆ α l,h i,j (6) where ˆ α l,h i,j = α l,h i,j P k ∈I in α l,h i,k represents the rescaled weights f or the input segment I in at generation step i ∈ I g en . By observing the trend of D l ( S ) across layers, we can identify the dynamic transi- tion of attention from the structured background ( S tab ) to the task objecti ve ( S q ). B.2 Inner T able: T able Attention Entropy T o qualitatively describe the de gree of focus on rel- e vant cells within the table, we define the a verage table attention entropy at layer l as: H l table = 1 |I gen | X i ∈I gen − X j ∈I tab ¯ P l i ( j ) log ¯ P l i ( j ) (7) where the inner-table attention distribution ¯ P l i ( j ) is defined as: ¯ P l i ( j ) = P H h =1 α l,h i,j P k ∈I tbl P H h =1 α l,h i,k , ∀ j ∈ I tab (8) A higher entropy indicates a uniform distribu- tion of attention across the table, whereas a lower entropy signifies that attention is concentrated in a fe w specific cells. This implies that the layer may identify the task-rele vant information. B.3 V alue-W eighted Contribution Follo wing the perspecti ve of previous re- searches ( K obayashi et al. , 2020 ; Gu et al. , 2024 ; Kang et al. , 2025a ) that attention weights alone do not quantify the actual magnitude of change in the residual stream, we introduce the V alue-W eighted Contribution analysis. T o capture the model’ s behavior throughout the generation process, we compute the av eraged contribution o ver I g en : C l ( S ) = 1 |I gen | · H X i ∈I gen X j ∈S H X h =1 α l,h i,j x l − 1 j W l,h OV 2 (9) where S ∈ {I sy s , I tab , I q } . By integrating the hidden states x j with the projection matrix W OV , this formula faithfully reconstructs the intensity of the information v olume injected by a specific seg- ment S at layer l to driv e the follo wing reasoning. C Experimental Details C.1 Prompt T emplates W e provide the complete prompt templates used in our experiments. The input structure follo ws the standard practice of prompting LLMs for table tasks, consisting of three segments: (1) a prompt template, (2) a table content se gment, and (3) a user question segment. Overall Pr ompt T emplate System Prompt: You are a helpful assistant. User Prompt: TABLE: {table_content} INPUT: {question_content} 14 Prompt T emplate for T able Question Answering (WTQ, HiT ab, AITQA) T ABLE: {table_content (markdown / html)} INPUT : Given the table titled ‘{table_title}’, answer the following question based on the given table. The final answer should be concise and use the following format: ``` json { "answer": ["answer1", "answer2", ...] } ``` Question: {question_content} Prompt T emplate for T able Fact V erification (T ab- Fact) T ABLE: {table_content (markdown / html)} INPUT : Given the table titled ‘{table_title}’, determine whether the following statement is entailed or refuted by the given table (Output 1 for entailed and 0 for refuted): Statement: {statement_content} C.2 T able Serialization Formats Unless otherwise specified, we use Markdown as the default table serialization format. Figure 7 il- lustrates the two serialization formats (Markdo wn and HTML) used in our experiments. Markdown F ormat | Name | Age | City | |-------|-----|---------| | Alice | 25 | Beijing | | Bob | 30 | Shanghai| HTML Format
Original PaperLoading high-quality paper... |
Comments & Academic Discussion
Loading comments...
Leave a Comment