A Closer Look into LLMs for Table Understanding

A Closer Look into LLMs f or T able Understanding Jia W ang 1 , 2 ∗ , Chuanyu Qin 1 , 2 ∗ , Mingyu Zheng 1 , 2 ∗ , Qingyi Si 3 , Peize Li 1 , Zheng Lin 1 , 2 ‡ 1 Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 2 School of Cyber Security , Uni versity of Chinese Academy of Sciences, Beijing, China 3 JD.COM {wangjia,linzheng}@iie.ac.cn Abstract Despite the success of Large Language Models (LLMs) in table understanding, their internal mechanisms remain unclear . In this paper , we conduct an empirical study on 16 LLMs, co ver - ing general LLMs, specialist tabular LLMs, and Mixture-of-Experts (MoE) models, to e xplore how LLMs understand tabular data and per- form downstream tasks. Our analysis focus on 4 dimensions including the attention dynamics, the ef fectiv e layer depth, the expert acti v ation, and the impacts of input designs. Ke y ﬁnd- ings include: (1) LLMs follow a three-phase attention pattern—early layers scan the table broadly , middle layers localize relev ant cells, and late layers amplify their contrib utions; (2) tabular tasks require deeper layers than math reasoning to reach stable predictions; (3) MoE models activ ate table-speciﬁc experts in mid- dle layers, with early and late layers sharing general-purpose experts; (4) Chain-of-Thought prompting increases table attention, further en- hanced by table-tuning. W e hope these ﬁnd- ings and insights can f acilitate interpretability and future research on table-related tasks. Our code and analysis pipeline is publicly av ail- able at https://github.com/JiaWang2001/ closer- look- table- llm 1 Introduction T ables, as a representati ve form of structured data, are widely used across v arious real-world ﬁelds to store and present information. In recent years, Large Language Models (LLMs) hav e shown po werful instruction following and comple x rea- soning ability , and thus hav e become the domi- nant paradigm for table understanding technique, supporting a wide range of do wnstream applica- tion scenarios, such as table question answering (TQA) ( Pasupat and Liang , 2015 ; Zhang et al. , ∗ Indicates equal contribution. ‡ Corresponding author: Zheng Lin. 2024 ; Deng and Mihalcea , 2025 ), table fact ver - iﬁcation (TFV) ( Chen et al. , 2020 ; W ang et al. , 2024b ), advanced data analysis ( Li et al. , 2025 ; W eng et al. , 2025 ; OpenAI , 2024 ) and spreadsheet manipulation ( Li et al. , 2023 ; Microsoft , 2025 ). Although remarkable success has been achieved with existing LLM-based table understanding ap- proaches, they predominantly focus on how to continually improve performance metrics across do wnstream tasks, e.g., boosting LLMs’ perfor- mance with the most suitable prompt designs ( Sui et al. , 2024 ; Y e et al. , 2023 ) and dev eloping spe- cialist tabular LLMs with supervised ﬁne-tuning (SFT) ( Deng and Mihalcea , 2025 ; Zhang et al. , 2024 , 2025 ) or reinforcement learning (RL) ( Y ang et al. , 2025b ; W u et al. , 2025 ; Kang et al. , 2025b ). Unlike pre vious performance-oriented w ork, in this paper , we conduct an in-depth empirical study to achie ve a deeper understanding of the underlying mechanisms of LLMs’ table understanding. Our experiments cov er a wide spectrum of LLMs of dif- ferent sizes and types, and focuses on four perspec- ti ves including the attention dynamics, the effecti v e layer depth, the impact of MoE architectures and the inﬂuence of input designs. The explored re- search questions are sho wn in below . Q1: How do intra-table and e xtra-table attention patterns of LLMs ev olve when performing table understanding tasks? Q2: How many ef fecti ve layers are utilized by LLMs to achieve stable predictions and are there dif ference between tabular and general tasks? Q3: Are there specialized table experts for MoE models and where are they located? Q4: The inﬂuence of different table formats and reasoning strategies on internal representations. W e conclude our key ﬁndings as follo ws: (1) Different LLM layers exhibit distinct at- tention patterns during table understanding. The early layers broadly scan the overall table con- tent, the middle layers then concentrate attention 1 on query-related cells and the top layers further amplify the focused content representations to pro- duce the ﬁnal answer . (2) T abular tasks r equire mor e layers for out- put distribution reﬁnement than math reasoning. While the depth at which ﬁnal answer content crys- tallizes is broadly comparable across tasks, LLMs reﬁne output distributions across more layers when processing tabular data—a pattern consistent across model scales and training strategies. (3) MoE models develop specialized experts f or tab ular tasks in the middle layers, which sho w minimal overlap with math-oriented e xperts, but early and top layers share general-purpose ex- perts across dif ferent tasks. (4) Input f ormats and reasoning strategies indeed inﬂuence attention patterns. HTML ta- bles result in more dispersed attention distrib utions in early layers than Markdown tables. Chain-of- Thought (CoT) prompting can allocate more atten- tion to the table content, which can be further am- pliﬁed by table-speciﬁc ﬁne-tuning and thus leads to better performance. T o the best of our knowledge, we present the ﬁrst thorough in v estigation of the mechanisms un- derlying LLM-based table understanding. Our ﬁnd- ings not only provide new insights into LLM in- terpretability in the tab ular domain, b ut also of fer guidelines for future research—from optimal in- put conﬁgurations that maximize attention to rele- v ant table content, to inference-time interventions in later layers, to MoE optimizations for table- specialized experts. 2 Related W ork 2.1 LLM-based T able Understanding T able understanding (TU) technique aims to en- able models to automatically comprehend tables and perform various tab ular tasks based on user re- quests ( Shigaro v , 2023 ; Lu et al. , 2025 ). W ith the rapid iteration of LLMs, their ability has opened ne w possibilities for more intelligent TU applica- tions. One line of research endeav ours to enhance LLMs’ table understanding ability through dif fer - ent strategies. For instance, eliciting correct rea- soning steps with prompt engineering and carefully selected in-context examples ( Chen , 2023 ; Y e et al. , 2023 ; W ang et al. , 2024b ; Ji et al. , 2024 ; Jiang et al. , 2023 ), collecting table instruction tuning data for supervised ﬁne-tuning ( Zhang et al. , 2024 , 2025 ), exploring ne w tasks and re ward strategies for rein- forcement learning ( Y ang et al. , 2025b ; W u et al. , 2025 ; Zhou et al. , 2025a ), and building powerful table agents for multi-turn data analysis and e xcel manipulation ( Li et al. , 2023 ; W eng et al. , 2025 ). In addition to the performance-oriented stud- ies mentioned above, another valuable direction is to explore the robustness and interpretability of LLMs’ table-related capabilities. For exam- ple, ev aluating models’ performance against di- ver gent perturbations such as table formats ( Sui et al. , 2024 ), noisy operators ( Singha et al. , 2023 ) and cell replacement ( Bhandari et al. , 2025 ). Com- pared with existing work that mainly focused on the performance robustness, we aim at seeking a bet- ter understanding of the underlying mechanisms of LLM-based table understanding and provide v alu- able insights for future follo w-ups. 2.2 Interpr etability and Analysis of LLMs Despite the strong capabilities of LLMs, under- standing their internal mechanisms remains a key challenge. Some studies focus on the roles of dif ferent layers in LLMs, examining how indi- vidual layers contribute to information process- ing ( Skean et al. , 2025 ) and how semantic rep- resentations are gradually constructed across lay- ers ( Csordás et al. , 2025 ; Hu et al. , 2025 ). Other lines of work in vestigate speciﬁc internal com- ponents, such as attention heads in self-attention mechanisms and feed-forw ard network (FFN) mod- ules, with the aim of clarifying their roles in knowl- edge representation ( Gev a et al. , 2021 ; Meng et al. , 2022 ) and information acquisition ( W u et al. , 2024 ; K obayashi et al. , 2023 ), rev ealing v arious intrigu- ing phenomena within LLMs, such as attention sink ( K obayashi et al. , 2020 ; Xiao et al. , 2023 ; Gu et al. , 2024 ; Qiu et al. , 2025 ; Queipo-de Llano et al. , 2025 ). Furthermore, with the growing adoption of Mixture-of-Experts (MoE) architectures, recent re- searches ( W ang et al. , 2024a ; Bai et al. , 2025 ; Lo et al. , 2025 ; Bandarkar et al. , 2025 ; Su et al. , 2025 ) analyze expert modules within MoE models, ex- ploring their specialization patterns and acti v ation behaviors during task e xecution. While these ef forts have provided v aluable in- sights ( Si et al. , 2023 ; Bav aresco et al. , 2025 ), they predominantly focus on unstructured text tasks such as math reasoning and factual recall. In this work, we e xtend interpretability research to struc- tured table data, conducting analyses to explore ho w LLMs process tables and offering preliminary ﬁndings across se veral dimensions. 2 T able 1: Summary of 2,000 ev aluation samples used in our empirical study . Dataset T ask Structure A vg T okens #Samples WTQ TQA Flat 1012 500 HiT ab TQA Hierarchical 964 500 AITQA TQA Hierarchical 651 500 T abFact TFV Flat 681 500 3 Experimental Setups Evaluation Data W e randomly select 500 sam- ples from 3 TQA benchmarks and 1 TFV bench- mark to perform empirical study , which include WTQ ( Pasupat and Liang , 2015 ), HiT ab ( Cheng et al. , 2021 ), AIT -QA ( Katsis et al. , 2021 ) and T ab- Fact ( Chen et al. , 2020 ). The resulting 2,000 test samples cover common table structures with ﬂat headers (WTQ, T abFact) and complex hierarchical headers (HiT ab, AIT -QA). The input tables are se- rialized into Markdo wn format by default, and we also analyze the impact of alternative formats (e.g., HTML) in Section 7.1 . The complete input prompt templates are sho wn in Appendix C.1 . Analyzed LLMs W e analyze 16 LLMs of three categories. (1) general-purpose instruct LLMs such as Llama-3.1-8B-Instruct and Qwen-2.5-7B- Instruct. (2) ﬁne-tuned tabular LLMs via SFT or RL such as T AMA ( Deng and Mihalcea , 2025 ) and T able-R1-Zero ( Y ang et al. , 2025b ). (3) Mixture-of-Experts (MoE) models like DeepSeek- V2-Lite ( DeepSeek-AI et al. , 2024 ) and Qwen3- 30B-A3B ( Y ang et al. , 2025a ). The complete model list is provided in Appendix C.3 . 4 Analysis of Attention Dynamics T o understand how LLMs allocate attention during table question answering, we design a controlled experimental setup where each input instance com- prises three se gments: (1) a system pr ompt pro vid- ing general instructions, (2) a table content con- taining the serialized table, and (3) a user question specifying the query . This segmentation allo ws us to trace attention ﬂow across dif ferent input compo- nents. W e in v estigate three progressi ve questions: § 4.1 : How does the model attend to three input se g- ments, especially dif ferent table cells? § 4.2 : How much does the table content actually contrib ute to the model’ s ﬁnal output? § 4.3 : Does this attention pattern causally af fect model predictions? The following metrics are employed in our anal- ysis. (1) se gment attention ratio measures the pro- portion of attention allocated to each se gment per layer; (2) table attention entr opy quantiﬁes ho w concentrated the attention is within the table (lo wer entropy indicates a more focused attention distribu- tion on speciﬁc cells); and (3) attention contribu- tion captures the actual inﬂuence of each se gment on the ﬁnal output via L2 norm of v alue-weighted representations. Formal deﬁnitions and formulas of these metrics are provided in Appendix B . For all metrics, we ﬁrst average across all generated tokens of each sample, and then a verage across all 2,000 test samples. The attention dynamics of Llama3.1-8B-Instruct and Qwen2.5-7B-Instruct is sho wn in Figure 1 . Results of more models are sho wn in Appendix F . 4.1 How Does the Model Attend to the T able? T akeaway 1 Although the ov erall attention ratio to table content is modest compared to other seg- ments, LLMs can precisely focus on query- rele vant cells in the middle layers, suggest- ing that accurate localization matters more than total attention volume. Allocated attention ratio across input segments. As sho wn in Figure 1 (lo wer left), the attention allocated to is highest in the ear- liest layers, then it drops before rising again to a secondary peak in the lower-middle layers (layer 10-15) and ﬁnally decreases to a stable lev el in the top layers (layer 20-30). Notably , the ov erall changes of table-oriented attention across different layers are modest. W e observe a similar trend of attention allocated to , with both reaching their peak in the middle layers, which may suggest that the model is integrating question information to interpret table structures and iden- tify critical table content. Except for middle layers, the consistently attracts rela- ti vely higher attention scores across all layers than user question and table content, which could be at- tributed to the attention sink phenomenon observ ed in prior work ( Xiao et al. , 2023 ; Gu et al. , 2024 ; Kang et al. , 2025a ). The step-wise segment atten- tion ratio in speciﬁc layers throughout the token- by-token generation process of one sample is also visualized in the upper left sub-panels in Figure 1 , which provides e vidence of observed patterns. The inﬂuence of model scales. Comparing mod- els of dif ferent scales within the same family (de- 3 Q: which nation won the most gold medals? M: Meta-LLaMA-3.1-8B-Instruct (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L0 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L5 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L10 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L10 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L14 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L14 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L25 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L25 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L31 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L31 T able Cells Attn Distri 0 5 10 15 20 25 30 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0 5 10 15 20 25 30 Layer Index 4.4 4.6 4.8 5.0 Entropy T able Attn Entropy 0 5 10 15 20 25 30 Layer Index 0.0 0.2 0.4 0.6 0.8 L2 Norm Attn Contribution Q: which nation won the most gold medals? M: Qwen2.5-7B-Instruct (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L0 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L5 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L10 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L10 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L15 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L15 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L22 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L22 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L27 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L27 T able Cells Attn Distri 0 5 10 15 20 25 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0 5 10 15 20 25 Layer Index 4.0 4.5 5.0 Entropy T able Attn Entropy 0 5 10 15 20 25 Layer Index 0 2 4 6 L2 Norm Attn Contribution System Instruction T able Content User Question Figure 1: The attention dynamics of Llama3.1-8B-Instruct and Qwen2.5-7B-Instruct on tab ular tasks. Each input consists of three se gments: system prompt , table content , and user question . Upper : A case study at selected layers—left sub-panels show step-wise attention ratio trend throughout generation steps; right sub-panels show cell-lev el attention heatmaps. Lower : Aggregated results— Layer -wise Se gment Attn Ratio shows the proportion of attention allocated to each segment per layer; T able Attn Entr opy measures the degree of focus towards table cells, with lo wer entropy indicating a more concentrated attention distrib ution on speciﬁc cells); Attn Contribution measures the inﬂuence of each segment on the ﬁnal output. Notably , the entropy minimum at Layer 10 and Layer 14 for Llama3.1-8B-Instruct align with the concentrated attention on the answer cell “China” in the upper heatmap. tailed results in Appendix F ), we ﬁnd that ov erall at- tention distribution trends remain consistent. How- e ver , two notable dif ferences emerge: (1) the sec- ondary peak of table attention shifts toward deeper layers as model size increases, and (2) the layer range with lo w table attention entrop y becomes broader in larger models, suggesting they maintain focused attention across more layers. The inﬂuence of model architectur es. Compar- ing Llama3.1-8B and Qwen2.5-7B (Figure 1 ), we observe that Qwen allocates higher attention to table content in early layers. MoE models ex- hibit a more gradual decline in table attention en- tropy , with a distinctive three-phase pattern: atten- tion ﬁrst cov ers partial cells, then spreads across the entire table, and ﬁnally con ver ges to answer- 4 rele vant cells (detailed in § 6 ). For tabular LLMs like T ableGPT2 ( Su et al. , 2024 ), table-speciﬁc ﬁne- tuning leads to higher table attention and weaker attention sink effects, though ov erall patterns re- main similar to their base models. Attention distribution within the table. While the overall attention allocated to table is limited, does the model at least focus on the query-related table cells? T o answer this, we measure the table attention entr opy o ver table tokens at each layer . As sho wn in Figure 1 (lo wer middle), the table attention entropy follo ws a U-shaped pattern: it decreases from early layers, and reaches a mini- mum in the middle layers, and ﬁnally increases again. The attention heatmaps in Figure 1 (upper right sub-panels) also illustrate this pattern. In early layers (e.g., Layer 0), attention is spread broadly across the entire table, suggesting that the model scans all content. In middle layers (e.g., Layer 14 for LLaMA-3.1-8B-Instruct), attention becomes concentrated on speciﬁc cells—for the question “which nation w on the most gold medals?”, the model focuses on the “China” cell and the “Gold” column header , which directly correspond to the answer . In later layers, attention spreads slightly again as the model prepares to generate the answer . This pattern is consistent across different models and benchmarks. W e pro vide more case studies and model comparisons in the Appendix F . These ﬁndings suggest that LLMs do not require heavy attention to all table content. Instead, they selecti vely focus on query-rele vant cells, much like humans skimming a table for key information. This implies that low-attention table regions could be pruned to dev elop ef ﬁcient tabular LLMs with less performance loss. 4.2 How Much Does the T able Content Contribute to the Final Output? T akeaway 2 Despite recei ving modest attention, table content makes substantial contributions to the model’ s output, with its inﬂuence in- creasing in later layers and ultimately dri v- ing answer generation. Attention weights sho w where the model "looks", b ut not how much the attended content actually inﬂuences the output. As noted by prior work ( K obayashi et al. , 2020 ; Gu et al. , 2024 ), the Llama-3.1-8B (Instruct) Qwen2.5-7B (Instruct) T AMA (Llama-based) T able-R1-7B (Qwen-based) T able-R1-8B (Llama-based) 0 10 20 30 40 50 60 70 Accuracy Drop ( ) 50 54 59 64 65 46 44 56 52 60 19 52 25 60 19 7 52 6 61 4 Llama-family: Deep masking has minimal impact Qwen-family: uniformly severe across all ranges Shallow (1 L/4) S -M (L/4 L/2) M-D (L/2 3L/4) Deep (3L/4 L) Figure 2: Performance drop when masking table atten- tion in different layer ranges across 5 models (a veraged ov er 3 benchmarks × 2 formats). Llama-family models sho w minimal impact from Deep masking, while Qwen- family models degrade uniformly across all ranges. true impact depends on both the attention weight and the transformed v alue vector . Therefore, we measure the value-weighted contrib ution —the L2 norm of each segment’ s representation injected into the transformer residual stream (see Appendix B for formal deﬁnition). As sho wn in Figure 1 (lo wer right), despite the modest attention ratio observed in § 4.1 , the table segment makes substantial contrib utions that in- crease in later layers, peaking near the ﬁnal lay- ers. In contrast, the system prompt’ s contribution remains lo w across all layers despite its high atten- tion ratio, conﬁrming its role as an attention sink rather than an information source. This ﬁnding connects previous observ ations into a coherent narrati ve of the model’ s internal w ork- ﬂo w . While early layers broadly encode the table (high ratio, high entropy), middle layers act as a critical reasoning phase characterized by "concen- trated inﬂuence": the model narrows focus to rele- v ant cells (low entrop y) and ampliﬁes their impact on the residual stream (rising norm). Finally , in late layers, the question and focused table content jointly dri ve answer generation. 4.3 Does the Attention Patter n Causally Affect Output? T akeaway 3 The three-phase attention pattern is func- tionally essential rather than a mere pro- cessing correlate: masking table attention in early-to-middle layers causes se vere per - formance degradation, and middle-layer at- tention focus on answer cells is predictiv e of output correctness. 5 Causal validation via attention masking. W e zero out attention weights to all table tokens within consecuti ve quarter -layer blocks—Shallow (1– L /4), S-M ( L /4– L /2), M-D ( L /2–3 L /4), and Deep (3 L /4– L )—across 5 models, 3 benchmarks, and 2 formats. As sho wn in Figure 2 , masking early layers (Shallo w , S-M) univ ersally causes the largest performance drops (44–65 points), directly demon- strating that the broad scanning phase is function- ally essential to do wnstream prediction. Llama- family models (Llama-3.1-8B, T AMA, T able-R1- 8B) sho w the most sev ere degradation from Shal- lo w and S-M masking with minimal late-layer im- pact (T AMA drops by 58.8 and 56.3 points for Shallo w and S-M, respectively , but only 6.5 points for Deep), empirically validating our interpretation that late layers serve an ampliﬁcation rather than retrie val role. In contrast, Qwen-family models (Qwen2.5-7B, T able-R1-7B) show uniformly se- vere de gradation across all masking ranges, reﬂect- ing their more distributed attention proﬁles (§ 4.1 ). Attention-answer correspondence. T o comple- ment the causal analysis, we select 200 lookup- type examples and identify the T op-20 attention- weighted cells in low-entrop y layers (layers 10–15) for Llama-3.1-8B and Qwen2.5-7B. Cells focused on by low-entropy layers directly correspond to the model’ s ﬁnal response in 53.0% (Llama) and 43.9% (Qwen) of cases. When low-entropy layers successfully attend to the correct answer cell, ﬁnal prediction accurac y reaches 65.6% (Llama) and 71.2% (Qwen). Con versely , when models answer incorrectly , 52.8% (Llama) and 74.4% (Qwen) of errors are associated with lo w-entropy layers fail- ing to attend to the correct cell. This suggests that misfocused middle-layer attention is a frequent pre- dictor of incorrect predictions, pointing to potential utility for attention-based error detection. 5 Effective Depth f or T abular T asks The previous analysis rev ealed that middle lay- ers concentrate attention on query-relev ant cells while later layers amplify their contributions. A natural follo w-up question is: how many layers do LLMs actually need to complete table under- standing tasks? T o in v estigate this, we apply Log- itLens ( Nostalgebraist. , 2020 ), which decodes hid- den representations at each layer into vocab ulary distributions. W e measure the KL di ver gence be- tween each layer’ s prediction and the ﬁnal output, along with the top-5 token o verlap. 0 4 8 12 16 20 24 28 32 Layer Index 0 2 4 6 8 10 12 14 KL Divergence Meta-Llama-3.1-8B-Instruct 0 4 8 12 16 20 24 28 Layer Index 0 5 10 15 20 25 30 35 40 KL Divergence Qwen2.5-7B-Instruct 0 2 4 6 8 10 12 14 16 Layer Index 0 2 4 6 8 10 12 14 KL Divergence OLMoE-1B-7B -Instruct 0 6 12 18 24 30 36 42 48 Layer Index 0 5 10 15 20 25 30 35 40 KL Divergence Qwen3-30B-A3B-Instruct 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap Figure 3: Prediction stability analysis using LogitLens. Bars sho w KL div ergence between each layer’ s decoded distribution and the ﬁnal output (lo wer = closer to ﬁnal prediction). Lines show top-5 token overlap with the ﬁnal output. V ertical dashed lines mark where predic- tions stabilize. T akeaway 3 T able tasks engage more layers for output distribution reﬁnement than math reasoning, though the depth at which answer content crystallizes is comparable across tasks. This pattern is consistent across model scales and ﬁne-tuning strategies, suggesting that layer- wise functionality is largely determined dur- ing pre-training. As shown in Figure 3 , KL di ver gence remains relati vely high throughout the middle layers and only drops sharply in ﬁnal layers. T o directly vali- date whether this pattern distinguishes table tasks from other reasoning tasks, we run the same Log- itLens pipeline on GSM8K across 7 representativ e models spanning general instruction-tuned, table- speciﬁc ﬁne-tuned, and MoE categories. W e de- compose stabilization depth into two complemen- tary metrics: KL-based depth (where output distri- bution con verges) and T op- k overlap depth (where the ﬁnal answer content crystallizes). As sho wn in T able 2 , KL-based stabilization depth is consistently higher for table tasks than math across the majority of models (5 of 7, marked with † ), indicating that LLMs continue reﬁning their output distrib utions across more layers when processing tabular data. T op- k ov erlap depth, by contrast, is broadly comparable across both task types, suggesting that while the distribution reﬁne- 6 T able 2: Stabilization depth comparison between table and math (GSM8K) tasks. V alues are reported as effec- ti ve layer / total layers (ratio). KL-based depth is higher for table tasks in 5 of 7 models ( † ), while T op- k ov erlap depth is broadly comparable. KL Depth Overlap Depth Model Layers T able Math T able Math Instruct LLMs Qwen2.5-3B 36 32 (0.89) † 27 (0.75) 33 (0.92) 31 (0.86) Qwen2.5-7B 28 25 (0.89) † 23 (0.82) 25 (0.89) 25 (0.89) Qwen2.5-14B 48 44 (0.92) † 43 (0.90) 44 (0.92) 43 (0.90) Llama-3.1-8B 32 26 (0.81) 29 (0.91) 25 (0.78) 29 (0.91) T able-Speciﬁc Fine-tuned LLMs T AMA 32 26 (0.81) † 23 (0.72) 25 (0.78) 26 (0.81) T able-R1-7B 28 25 (0.89) † 24 (0.86) 25 (0.89) 25 (0.89) MoE LLMs Qwen3-30B-A3B 48 43 (0.90) 43 (0.90) 45 (0.94) 43 (0.90) ment process engages more layers for tables, the point at which ﬁnal answer content crystallizes is similar . This decomposition pro vides a more pre- cise characterization: table understanding does not simply “take longer” uniformly , but speciﬁcally re- quires more layers for the ﬁne-grained distribution adjustments that follo w answer identiﬁcation. Notably , this pattern holds consistently across dif ferent model scales (7B to 32B) and training strategies (SFT and RL), suggesting that the layer - wise functionality for table understanding is lar gely established during pre-training (detailed results in Appendix D.2 ). This ﬁnding has practical impli- cations: unlike math reasoning where early-exit strategies can reduce computation ( Csordás et al. , 2025 ; Hu et al. , 2025 ), table tasks may beneﬁt less from such optimizations. Con v ersely , it suggests that tar geted interventions in the later layers (e.g., inference-time steering) could be particularly ef fec- ti ve for improving table understanding capabilities. 6 Seeking T able Experts of MoE Models T akeaway 4 MoE models acti vate a distinct set of table- speciﬁc experts concentrated in the middle layers, mirroring the attention patterns ob- served in RQ1. These experts show minimal ov erlap with math-related experts. Mixture-of-Experts (MoE) models route each to- ken to a subset of e xperts based on learned gating functions. A natural question arises: do MoE mod- els de velop specialized experts for tab ular tasks? T o identify table-speciﬁc experts, we record the acti vation frequency of each expert across all ta- ble tokens and layers, follo wing prior work ( W ang et al. , 2024a ; Bai et al. , 2025 ). As shown in Fig- ure 4 (left), table-relev ant experts are distributed across all layers, b ut their acti vation patterns vary by depth. The middle column of Figure 4 rev eals a striking pattern: the entropy of expert acti vation dis- tribution reaches its minimum in the middle layers across all three MoE architectures. Lower entrop y indicates that the model concentrates its routing on fe wer , more specialized e xperts. This pattern mir- rors the attention entrop y ﬁndings in RQ1, where attention within the table also becomes most con- centrated in the middle layers. Furthermore, these table-speciﬁc experts in the middle layers sho w minimal ov erlap with experts activ ated for math reasoning tasks (Figure 4 , right), conﬁrming their domain-speciﬁc nature. T ogether , these ﬁndings suggest that MoE mod- els achie ve ef ﬁcient multi-task processing by shar- ing general-purpose experts in early and late layers while routing to specialized e xperts in the middle layers for domain-speciﬁc reasoning. 7 Impact of Input F ormats and Reasoning Strategies The previous sections analyzed LLMs’ internal mechanisms under a ﬁx ed setting (Markdown for- mat, direct answering). In practice, howe ver , ta- bles can be represented in v arious formats (e.g., Markdo wn, HTML), and models can be prompted with dif ferent reasoning strategies (e.g., direct an- swering, Chain-of-Thought). Here we examine ho w these choices inﬂuence internal processing and model behavior . 7.1 How Does T able F ormat Affect Attention? T akeaway 5 HTML tables lead to higher attention en- tropy than Markdo wn tables. Howe ver , in- ternal representations of dif ferent formats con ver ge in deeper layers. T o inv estigate ho w table format af fects inter- nal processing, we con vert our analysis data from Markdo wn to HTML format and compare the at- tention patterns. As shown in Figure 5 , LLMs as- sign slightly higher attention ratio to HTML tables, but the attention entropy is signiﬁcantly higher . This may be because the verbose structural syn- tax in HTML (e.g., ,

tags) disperses the 7 2 6 10 14 18 22 26 0 8 16 24 32 40 48 56 Expert ID T abular Expert Activation 3 6 9 12 15 18 21 24 27 3.95 4.00 4.05 4.10 Entropy DeepSeek- V2 Expert Distribution Entropy 3 6 9 12 15 18 21 24 27 0 1 2 3 4 Count Overlap w/ Math Experts 1 9 17 25 33 41 0 16 32 48 64 80 96 112 Expert ID T abular Expert Activation 0 6 12 18 24 30 36 42 48 4.0 4.2 4.4 Entropy Qwen3-30B-A3B Expert Distribution Entropy 0 6 12 18 24 30 36 42 48 0 1 2 3 4 5 6 Count Overlap w/ Math Experts 1 3 5 7 9 11 13 15 Layer Index 0 8 16 24 32 40 48 56 Expert ID T abular Expert Activation 2 4 6 8 10 12 14 16 Layer Index 3.6 3.8 4.0 Entropy OLMoE Expert Distribution Entropy 0 2 4 6 8 10 12 14 16 Layer Index 0 1 2 3 Count Overlap w/ Math Experts T abular Domain Overlap with Math Domain Figure 4: Expert activ ation analysis in MoE models across three architectures (DeepSeek-V2, Qwen3-30B-A3B, OLMoE). Left : Activ ation heatmaps sho wing table-speciﬁc experts per layer (darker = higher acti v ation probability). Middle : Entropy of expert activ ation distribution for table tasks (lo wer entropy indicates concentrated activ ation on fewer e xperts). Right : Number of ov erlapping experts between table and math (GSM8K) tasks per layer . 0 5 10 15 20 25 30 Layer Index 4.50 4.75 5.00 5.25 5.50 5.75 Entropy  LLaMA-3.1-8B-Ins T able Attention Entropy Markdown HTML 0 5 10 15 20 25 30 Layer Index 5 10 15 20 25 30 35 Ratio (%) LLaMA-3.1-8B-Ins T able Attn Ratio Markdown HTML t-SNE Visualization LLaMA-3.1-8B-Ins Layer 1 Markdown HTML LLaMA-3.1-8B-Ins Layer 15 LLaMA-3.1-8B-Ins Layer 31 0 5 10 15 20 25 Layer Index 3.75 4.00 4.25 4.50 4.75 5.00 5.25 5.50 Entropy  Qwen-2.5-7B-Ins T able Attention Entropy 0 5 10 15 20 25 Layer Index 10 20 30 40 Ratio (%) Qwen-2.5-7B-Ins T able Attn Ratio t-SNE Visualization Qwen-2.5-7B-Ins Layer 1 Markdown HTML Qwen-2.5-7B-Ins Layer 15 Qwen-2.5-7B-Ins Layer 27 Figure 5: Comparison between Markdown and HTML table formats. model’ s attention across more tok ens, making it harder to concentrate on speciﬁc cells. Despite these differences in early layers, t - SNE visualizations re veal that the internal rep- resentations of different formats gradually con- ver ge as layer depth increases. This suggests that while early layers process format-speciﬁc features, deeper layers extract format-agnostic semantic in- formation. This con ver gence may explain why LLMs can achie ve reasonable performance across dif ferent table formats. As shown in T able 3 , the performance gap between Markdown and HTML formats is minimal across all models (typically within 1-2%), empirically conﬁrming that LLMs can effecti vely handle both formats despite their dif ferent syntactic structures. 7.2 How Does Reasoning Strategy Affect Attention? T akeaway 6 Chain-of-Thought (CoT) reasoning leads to higher attention to table content in middle and late layers compared to direct answer- ing, an effect that persists ev en when analyz- ing only ﬁnal answer tokens. T able-speciﬁc ﬁne-tuning further ampliﬁes this ef fect. W e compare the attention patterns under direct answering and Chain-of-Thought (CoT) prompt- ing for both v anilla LLMs and their table-tuned v ariants. As sho wn in Figure 6 (bottom), CoT rea- soning leads to a higher proportion of attention 8 0 5 10 15 20 25 30 Layer Index 5 10 15 20 25 30 T able Attn Ratio (%) LLaMA-based Models (8B) | Direct LLaMA-3.1-8B-Ins T ama-8B T able-R1-Zero-8B 0 5 10 15 20 25 30 Layer Index 5 10 15 20 25 30 LLaMA-based Models (8B) | CoT 0 5 10 15 20 25 30 Layer Index -1 0 1 2 3 4 5 T able Attn Ratio Gain (%) Δ Gain: LLaMA-3.1-8B-Ins (CoT - Direct) 0 5 10 15 20 25 Layer Index 10 20 30 40 50 T able Attn Ratio (%) Qwen-based Models (7B) | Direct Qwen-2.5-7B-Ins T ama-Qwen-7B T able-R1-Zero-7B 0 5 10 15 20 25 Layer Index 10 20 30 40 50 60 Qwen-based Models (7B) | CoT 0 5 10 15 20 25 Layer Index -2 0 2 4 6 8 T able Attn Ratio Gain (%) Δ Gain: Qwen-2.5-7B-Ins (CoT - Direct) Figure 6: Comparison between direct answering and Chain-of-Thought (CoT) reasoning. T able 3: A verage performance ov er 2,000 test samples under different input formats and inference strate gies. Markdown HTML Model V anilla CoT V anilla CoT Instruct LLMs Qwen2.5-7B 62.15 67.20 61.35 67.80 Llama3.1-8B 56.95 62.35 57.95 58.35 T able-Speciﬁc Fine-tuned LLMs T able-R1-7B 69.75 82.30 70.30 81.60 T able-R1-8B 70.85 82.25 69.90 80.60 MoE LLMs Qwen3-30B-A3B 66.30 70.15 68.35 72.10 allocated to table content, especially in the middle and late layers. Comparing vanilla and table-tuned models (Figure 6 , top), table-speciﬁc ﬁne-tuning generally increases attention to tables across all lay- ers, with RL-tuned models sho wing the lar gest in- crease when combined with CoT . The performance results in T able 3 corroborate this: CoT consistently improv es accuracy across all models, with partic- ularly striking gains for table-tuned models (e.g., T able-R1-7B improves from 69.75% to 82.30%). 1 T o further isolate CoT’ s effect on table ground- ing from the natural table references within rea- soning chains, we recompute the T able Attn Ratio using only the attention weights at ﬁnal answer 1 All CoT vs. Direct differences are statistically signiﬁ- cant (non-overlapping 95% bootstrap CIs, 1,000 iterations, N =2,000). Markdown vs. HTML diff erences fall within the margin of uncertainty for most models. Complete results with conﬁdence intervals are pro vided in Appendix D.1 . tokens , excluding all reasoning chain tokens. In early layers—layer 1 for LLaMA-3.1-8B (+1.85%) and layers 1–4 for Qwen2.5-7B (+13.56%, +9.12%, +1.52%, +7.45%)—CoT maintains a substantially higher T able Attn Ratio than direct answering ev en under this restricted setting, conﬁrming that CoT genuinely induces stronger table grounding at the moment of answer generation. In middle-to-late layers, the ﬁnal-answer-only ratio under CoT falls slightly below that of direct answering, suggesting a division of labor: the reasoning chain has already distilled rele v ant table information into the residual stream, allo wing later layers to attend to the nearby reasoning trajectory rather than re-attending to dis- tant table tokens. 8 Conclusion W e present the ﬁrst systematic empirical study on the internal mechanisms of LLM-based table under- standing across 16 LLMs. Our analysis re veals a co- herent three-phase workﬂo w: early layers broadly encode the table, middle layers precisely localize query-rele vant cells, and late layers amplify the fo- cused content for answer generation. This pattern manifests consistently across architectures—MoE models mirror this through table-speciﬁc expert acti vation in middle layers, while CoT reasoning and table-speciﬁc ﬁne-tuning further enhance table engagement. These interconnected ﬁndings offer actionable guidance: from optimal input conﬁg- urations, to inference-time interv entions in later layers, to tar geted optimization of table-specialized experts. 9 9 Limitations Though this paper presents a systematic empirical study of the internal mechanisms of LLM-based ta- ble understanding, there are certain limitations and promising directions that deserve future research. (1) Analyzing LLMs of larger scales and differ - ent series. Due to resource limitations, we mainly analyze advanced open-source models with param- eters below 32B. Models of large scales such as DeepSeek-V3-671B and Qwen3-235B-A22B also deserves further in vestigation to analyze the scal- ing effect on table understanding. (2) Extending to MLLMs. Previous w ork and recent proprietary models hav e demonstrated that multimodal LLMs (MLLMs) also possess strong table understanding ability based on table images ( Zheng et al. , 2024 ; Deng et al. , 2024 ; Zhou et al. , 2025b ). As a result, it is worthwhile to extend our analysis to MLLM scope to discover unique characteristics of multi- modal table understanding. (3) Analyzing more tabular tasks. This study primarily focuses on table question answering and table fact veriﬁcation, two most typical tasks of existing studies. Howe v er , it is pertinent to in v estigate whether our ﬁndings can generalize to more complex tabular tasks, such as data analysis and table summarization. (4) En- riching table formats and reasoning strategies. W e do not intend to exhaust every possible table for- mats such as CSV and JSON formats and every reasoning strategies such as program-of-thoughts with in-context e xamples, which could be further explored by future follo w-ups. 10 Ethical Considerations Our empirical study and all experiments are con- ducted based 4 public academic benchmarks, which are free and open-source data for research use. The analyzed LLMs are also open-source mod- els that we downloaded from their of ﬁcial websites. As a result, the authors foresee no ethical concerns. References Jun Bai, Minghao T ong, Y ang Liu, Zixia Jia, and Zi- long Zheng. 2025. Understanding and lev eraging the expert specialization of context faithfulness in mixture-of-experts LLMs . In Pr oceedings of the 2025 Confer ence on Empirical Methods in Natural Language Pr ocessing , pages 21938–21953, Suzhou, China. Association for Computational Linguistics. Lucas Bandarkar , Chenyuan Y ang, Mohsen Fayyaz, Junlin Hu, and Nanyun Peng. 2025. Multilin- gual routing in mixture-of-experts. arXiv preprint arXiv:2510.04694 . Anna Bav aresco, Raffaella Bernardi, Leonardo Berto- lazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller , Andre Martins, Philipp Mondorf, V era Neplenbroek, Sandro Pezzelle, Bar- bara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece T akmaz, and Alberto T estoni. 2025. LLMs instead of human judges? a large scale empirical study across 20 NLP ev alua- tion tasks . In Pr oceedings of the 63r d Annual Meet- ing of the Association for Computational Linguistics (V olume 2: Short P apers) , pages 238–255, V ienna, Austria. Association for Computational Linguistics. Kushal Raj Bhandari, Sixue Xing, Soham Dan, and Jianxi Gao. 2025. Exploring the robustness of lan- guage models for tab ular question answering via at- tention analysis . Pr eprint , W enhu Chen. 2023. Lar ge language models are few(1)- shot table reasoners . In F indings of the Associa- tion for Computational Linguistics: EACL 2023 , pages 1120–1130, Dubro vnik, Croatia. Association for Computational Linguistics. W enhu Chen, Hongmin W ang, Jianshu Chen, Y unkai Zhang, Hong W ang, Shiyang Li, Xiyou Zhou, and W illiam Y ang W ang. 2020. T abfact: A large-scale dataset for table-based fact veriﬁcation . Preprint , Zhoujun Cheng, Haoyu Dong, Zhiruo W ang, Ran Jia, Jiaqi Guo, Y an Gao, Shi Han, Jian-Guang Lou, and Dongmei Zhang. 2021. Hitab: A hierarchical table dataset for question answering and natural language generation. arXiv pr eprint arXiv:2108.06712 . Róbert Csordás, Christopher D Manning, and Christo- pher Potts. 2025. Do language models use their depth efﬁciently? arXiv pr eprint arXiv:2505.13898 . DeepSeek-AI, Aixin Liu, Bei Feng, Bin W ang, Bingx- uan W ang, Bo Liu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Daya Guo, Dejian Y ang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, and 81 others. 2024. Deepseek-v2: A strong, economical, and ef- ﬁcient mixture-of-experts language model. CoRR , abs/2405.04434. Naihao Deng and Rada Mihalcea. 2025. Rethinking table instruction tuning . In Findings of the Asso- ciation for Computational Linguistics: ACL 2025 , pages 21757–21780, V ienna, Austria. Association for Computational Linguistics. Naihao Deng, Zhenjie Sun, Ruiqi He, Aman Sikka, Y u- long Chen, Lin Ma, Y ue Zhang, and Rada Mihalcea. 2024. T ables as texts or images: Evaluating the table reasoning ability of LLMs and MLLMs . In F ind- ings of the Association for Computational Linguis- tics: ACL 2024 , pages 407–426, Bangkok, Thailand. Association for Computational Linguistics. 10 Abhimanyu Dubey , Abhinav Jauhri, Abhinav Pandey , Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Y ang, Angela Fan, and 1 others. 2024. The llama 3 herd of models. arXiv pr eprint arXiv:2407.21783 . Mor Ge va, Roei Schuster , Jonathan Berant, and Omer Levy . 2021. Transformer feed-forward layers are key-v alue memories. In Proceedings of the 2021 Confer ence on Empirical Methods in Natural Lan- guage Pr ocessing , pages 5484–5495. Xiangming Gu, T ianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Y e W ang, and Min Lin. 2024. When attention sink emerges in lan- guage models: An empirical view . arXiv pr eprint arXiv:2410.10781 . Y i Hu, Cai Zhou, and Muhan Zhang. 2025. What affects the effe ctiv e depth of large language models? arXiv pr eprint arXiv:2512.14064 . Deyi Ji, Lan yun Zhu, Siqi Gao, Peng Xu, Hongtao Lu, Jieping Y e, and Feng Zhao. 2024. T ree-of-table: Un- leashing the po wer of llms for enhanced lar ge-scale table understanding . Pr eprint , Jinhao Jiang, Kun Zhou, Zican Dong, K eming Y e, Xin Zhao, and Ji-Rong W en. 2023. StructGPT : A general framew ork for lar ge language model to reason o ver structured data . In Proceedings of the 2023 Con- fer ence on Empirical Methods in Natur al Language Pr ocessing , pages 9237–9251, Singapore. Associa- tion for Computational Linguistics. Seil Kang, Jinyeong Kim, Junh yeok Kim, and Seong Jae Hwang. 2025a. See what you are told: V isual atten- tion sink in large multimodal models. arXiv preprint arXiv:2503.03321 . Xiaoqiang Kang, Shengen W u, Zimu W ang, Y ilin Liu, Xiaobo Jin, Kaizhu Huang, W ei W ang, Y utao Y ue, Xiaowei Huang, and Qiufeng W ang. 2025b. Can GRPO boost complex multimodal table understand- ing? In Pr oceedings of the 2025 Confer ence on Empirical Methods in Natural Languag e Pr ocessing , pages 12631–12644, Suzhou, China. Association for Computational Linguistics. Y annis Katsis, Saneem Chemmengath, V ishwajeet Ku- mar , Samarth Bharadwaj, Mustafa Canim, Michael Glass, Alﬁo Gliozzo, Feifei Pan, Jaydeep Sen, Karthik Sankaranarayanan, and Soumen Chakrabarti. 2021. Ait-qa: Question answering dataset over complex tables in the airline industry . Pr eprint , Goro K obayashi, T atsuki Kuribayashi, Sho Y okoi, and Kentaro Inui. 2020. Attention is not only a weight: Analyzing transformers with vector norms . In Pr oceedings of the 2020 Confer ence on Empirical Methods in Natural Languag e Pr ocessing (EMNLP) , pages 7057–7075, Online. Association for Computa- tional Linguistics. Goro K obayashi, T atsuki Kuribayashi, Sho Y okoi, and Kentaro Inui. 2023. Analyzing feed-forw ard blocks in transformers through the lens of attention maps. arXiv pr eprint arXiv:2302.00456 . Hongxin Li, Jingran Su, Y untao Chen, Qing Li, and Zhaoxiang Zhang. 2023. Sheetcopilot: Bringing software producti vity to the next le vel through lar ge language models . In Thirty-seventh Conference on Neural Information Pr ocessing Systems . Jinyang Li, Nan Huo, Y an Gao, Jiayi Shi, Y ingxiu Zhao, Ge Qu, Bowen Qin, Y urong W u, Xiaodong Li, Chen- hao Ma, Jian-Guang Lou, and Reynold Cheng. 2025. Are large language models ready for multi-turn tabu- lar data analysis? In Pr oceedings of the 42nd Inter - national Confer ence on Machine Learning , volume 267 of Pr oceedings of Machine Learning Resear ch , pages 34795–34835. PMLR. Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili W ang, and Jie Fu. 2025. A closer look into mixture-of-experts in large language models. In F indings of the Associ- ation for Computational Linguistics: NAA CL 2025 , pages 4427–4447. W eizheng Lu, Jing Zhang, Ju Fan, Zihao Fu, Y ueguo Chen, and Xiaoyong Du. 2025. Lar ge language model for table processing: a surve y . F r ontiers of Computer Science , 19(2). Ke vin Meng, Da vid Bau, Alex Andonian, and Y onatan Belinko v . 2022. Locating and editing factual associa- tions in gpt. Advances in neural information pr ocess- ing systems , 35:17359–17372. Microsoft. 2025. Copilot in excel . Accessed: 2025-12- 29. Niklas Muennighof f, Luca Soldaini, Dirk Groene veld, Kyle Lo, Jacob Morrison, Sewon Min, W eijia Shi, Pete W alsh, Oyvind T afjord, Nathan Lambert, Y uling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David W adden, Alexander W ettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, and 5 others. 2024. Olmoe: Open mixture-of-experts language models. CoRR , abs/2409.02060. Nostalgebraist. 2020. Interpreting gpt: the logit lens. https://www. lesswrong.com/posts/AcKRB8wDpdaN6v6ru/ interpreting- gpt- the- logit- lens . OpenAI. 2024. Improv ements to data analysis in chat- gpt . Accessed: 2025-12-29. Panupong Pasupat and Percy Liang. 2015. Composi- tional semantic parsing on semi-structured tables . In Pr oceedings of the 53r d Annual Meeting of the As- sociation for Computational Linguistics and the 7th International Joint Confer ence on Natural Language Pr ocessing (V olume 1: Long P apers) , pages 1470– 1480, Beijing, China. Association for Computational Linguistics. 11 Zihan Qiu, Zekun W ang, Bo Zheng, Zeyu Huang, Kaiyue W en, Songlin Y ang, Rui Men, Le Y u, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. 2025. Gated attention for large lan- guage models: Non-linearity , sparsity , and attention- sink-free . Pr eprint , Enrique Queipo-de Llano, Álvaro Arroyo, Federico Bar - bero, Xiaowen Dong, Michael Bronstein, Y ann Le- Cun, and Ra vid Shwartz-Zi v . 2025. Attention sinks and compression valle ys in llms are two sides of the same coin. arXiv pr eprint arXiv:2510.06477 . Alex ey Shigarov . 2023. T able understanding: Prob- lem ov erview . WIREs Data Mining and Knowledge Discovery , 13(1):e1482. Qingyi Si, T ong W ang, Zheng Lin, Xu Zhang, Y anan Cao, and W eiping W ang. 2023. An empirical study of instruction-tuning large language models in Chinese . In Findings of the Association for Computational Lin- guistics: EMNLP 2023 , pages 4086–4107, Sing apore. Association for Computational Linguistics. Ananya Singha, José Cambronero, Sumit Gulwani, V u Le, and Chris Parnin. 2023. T abular repre- sentation, noisy operators, and impacts on table structure understanding tasks in llms . Pr eprint , Oscar Skean, Md Rif at Areﬁn, Dan Zhao, Niket Patel, Jalal Naghiyev , Y ann LeCun, and Ra vid Shwartz- Ziv . 2025. Layer by layer: Uncovering hidden rep- resentations in language models. arXiv preprint arXiv:2502.02013 . Aofeng Su, Aowen W ang, Chao Y e, Chen Zhou, Ga Zhang, Gang Chen, Guangcheng Zhu, Haobo W ang, Haokai Xu, Hao Chen, Haoze Li, Haoxuan Lan, Jiaming T ian, Jing Y uan, Junbo Zhao, Jun- lin Zhou, Kaizhe Shou, Liangyu Zha, Lin Long, and 14 others. 2024. T ablegpt2: A large multi- modal model with tabular data inte gration . Pr eprint , Zunhai Su, Qingyuan Li, Hao Zhang, W eihao Y e, Qibo Xue, Y uLei Qian, Y uchen Xie, Ngai W ong, and Ke- hong Y uan. 2025. Un v eiling super e xperts in mixture- of-experts large language models. arXiv pr eprint arXiv:2507.23279 . Y uan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, and Dongmei Zhang. 2024. T able meets llm: Can large language models understand structured table data? a benchmark and empirical study . Pr eprint , Ashish V aswani, Noam Shazeer , Niki Parmar , Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser , and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information pr ocessing systems , 30. Zihan W ang, Deli Chen, Damai Dai, Runxin Xu, Zhu- oshu Li, and Y u W u. 2024a. Let the e xpert stick to his last: Expert-specialized ﬁne-tuning for sparse archi- tectural lar ge language models. In Pr oceedings of the 2024 Confer ence on Empirical Methods in Natural Language Pr ocessing , pages 784–801. Zilong W ang, Hao Zhang, Chun-Liang Li, Julian Martin Eisenschlos, V incent Perot, Zifeng W ang, Lesly Mi- culicich, Y asuhisa Fujii, Jingbo Shang, Chen-Y u Lee, and T omas Pﬁster . 2024b. Chain-of-table: Evolving tables in the reasoning chain for table understanding . In ICLR . Han W eng, Zhou Liu, Y uanfeng Song, Xiaoming Y in, Xing Chen, and W entao Zhang. 2025. Unidatabench: Evaluating data analytics agents across structured and unstructured data . Pr eprint , W enhao Wu, Y izhong W ang, Guangxuan Xiao, Hao Peng, and Y ao Fu. 2024. Retriev al head mechanisti- cally explains long-conte xt factuality . arXiv pr eprint arXiv:2404.15574 . Zhenhe W u, Jian Y ang, Jiaheng Liu, Xianjie W u, Changzai Pan, Jie Zhang, Y u Zhao, Shuangyong Song, Y ongxiang Li, and Zhoujun Li. 2025. T able- r1: Region-based reinforcement learning for table understanding . Pr eprint , Guangxuan Xiao, Jiaming T ang, Jingwei Zuo, Shang Y ang, Haotian T ang, Y ao Fu, Song Han, and 1 others. Duoattention: Efﬁcient long-context llm inference with retrie v al and streaming heads. In The Thirteenth International Confer ence on Learning Repr esenta- tions . Guangxuan Xiao, Y uandong Tian, Beidi Chen, Song Han, and Mik e Le wis. 2023. Efﬁcient streaming language models with attention sinks. arXiv . An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bo wen Y u, Chang Gao, Chengen Huang, Chenxu Lv , and 1 others. 2025a. Qwen3 technical report. arXiv pr eprint arXiv:2505.09388 . An Y ang, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Y u, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran W ei, Huan Lin, Jian Y ang, Jian- hong Tu, Jianwei Zhang, Jianxin Y ang, Jiaxi Y ang, Jingren Zhou, Junyang Lin, Kai Dang, and 22 oth- ers. 2024. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115 . Zheyuan Y ang, L yuhao Chen, Arman Cohan, and Y ilun Zhao. 2025b. T able-r1: Inference-time scaling for ta- ble reasoning tasks . In Proceedings of the 2025 Con- fer ence on Empirical Methods in Natur al Language Pr ocessing , pages 20616–20635, Suzhou, China. As- sociation for Computational Linguistics. Y unhu Y e, Binyuan Hui, Min Y ang, Binhua Li, Fei Huang, and Y ongbin Li. 2023. Large language mod- els are versatile decomposers: Decomposing e vi- dence and questions for table-based reasoning . In Pr oceedings of the 46th International ACM SIGIR 12 Confer ence on Resear ch and De velopment in Infor- mation Retrieval , SIGIR ’23, page 174–184, Ne w Y ork, NY , USA. Association for Computing Machin- ery . T ianshu Zhang, Xiang Y ue, Y ifei Li, and Huan Sun. 2024. T ableLlama: T o wards open large general- ist models for tables . In Proceedings of the 2024 Confer ence of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies (V olume 1: Long P apers) , pages 6024–6044, Mexico City , Mexico. Association for Computational Linguistics. Xiaokang Zhang, Sijia Luo, Bohan Zhang, Ze yao Ma, Jing Zhang, Y ang Li, Guanlin Li, Zijun Y ao, Kangli Xu, Jinchang Zhou, Daniel Zhang-Li, Jifan Y u, Shu Zhao, Juanzi Li, and Jie T ang. 2025. TableLLM: Enabling tabular data manipulation by LLMs in real ofﬁce usage scenarios . In F indings of the Associa- tion for Computational Linguistics: A CL 2025 , pages 10315–10344, V ienna, Austria. Association for Com- putational Linguistics. Mingyu Zheng, Xinwei Feng, Qingyi Si, Qiaoqiao She, Zheng Lin, W enbin Jiang, and W eiping W ang. 2024. Multimodal table understanding . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 9102–9124, Bangkok, Thailand. Association for Computational Linguistics. W ei Zhou, Mohsen Mesgar, Heike Adel, and Annemarie Friedrich. 2025a. p2-tqa: A process-based preference learning frame work for self-improving table question answering models . Pr eprint , W ei Zhou, Mohsen Mesgar, Heike Adel, and Annemarie Friedrich. 2025b. T e xts or images? a ﬁne-grained analysis on the effecti veness of input representations and models for table question answering . In F ind- ings of the Association for Computational Linguis- tics: ACL 2025 , pages 2307–2318, V ienna, Austria. Association for Computational Linguistics. A Background A.1 T ask Formalization Gi ven a table T and a natural language question Q , T able Question Answering (T ableQA) requires the model G ( · ) to generate a ﬁnal answer A based on the abov e information. W ithin the context of Lar ge Language Models (LLMs), the input sequence x is composed of three segments serving distinct se- mantic roles: • System Prompt Segment ( S sy s ) : Com- prises the system instructions and model- speciﬁc con versation templates (e.g., <|begin_of_text|> for Llama-3.1-8B- Instruct ( Dubey et al. , 2024 )). • T able Content Segment ( S tab ) : Contains the table data serialized in a speciﬁc format (e.g., Markdo wn or HTML). • Question Description Segment ( S q st ) : Con- tains the speciﬁc user query Q . Let I sy s , I tab , and I q denote the sets of in- dices corresponding to these segments in the in- put sequence, the total input index set is I in = I sy s ∪ I tab ∪ I q . The set of indices for tokens generated by the model is denoted as I g en . A.2 Architectur e of LLMs The input sequence is processed by the LLM through L transformer blocks ( V aswani et al. , 2017 ). The update process of residual stream at each layer l can be uniﬁed as follows: ˆ x l = MHA l ( LN ( x l − 1 )) + x l − 1 (1) x l = T rans l ( LN ( ˆ x l )) + ˆ x l (2) where x l ∈ R n × d represents the hidden state at layer l , d is the embedding dimension, and n is the total sequence length. MHA denotes the Multi-Head Attention module, and T rans repre- sents the transition layer , which is a Feed-Forward Network (FFN) in dense LLMs or a Mixture-of- Experts (MoE) layer in MoE LLMs. In the l -th layer , the MHA mechanism allocates attention weights by computing the similarity be- tween Queries ( Q ) and K eys ( K ): α l,h i,j = softmax ( x l − 1 i W l,h Q )( x l − 1 j W l,h K ) T √ d k ! (3) where α l,h i,j denotes the attention weight from tok en i to token j in the h -th attention head of the l -th layer . The ﬁnal output of a single attention head is the aggregation of contrib utions from all preceding tokens: MHA l,h ( x i ) = X j ≤ i α l,h i,j ( x l − 1 j W l,h OV ) (4) Here, W l,h OV = W l,h V W l,h O represents the joint value- output projection matrix, which deﬁnes the speciﬁc path through which information is mapped from input tokens into the residual stream. For MoE LLMs, the transition layer introduces a sparse acti vation mechanism where a routing func- tion G selects the T op- k experts for each token. 13 The output is scaled by the routing weights and aggregated: MoE ( x i ) = X e ∈ T op-k ( G i ) G i,e · E e ( LN ( x i )) (5) where G i = softmax ( x i W G ) and E e denotes the e -th expert. Through this hierarchical process, the LLM transforms internal representations into the ﬁnal response. B Interpr etability Analysis Methods This section introduces our core methodology for re vealing the internal mechanisms of LLMs during tabular reasoning, categorized into three dimen- sions: Inner Pr ompt , Inner T able , and Contribu- tion . B.1 Inner Prompt: Segment Attention Ratio T o quantify the allocation of attention resources across dif ferent input segments during inference, we deﬁne the segment attention ratio D , which measures the de gree of attention weight assigned to a speciﬁc segment S at layer l , av eraged across all attention heads and the entire generation sequence I g en : D l ( S ) = 1 |I g en | · H X i ∈I gen H X h =1 X j ∈S ˆ α l,h i,j (6) where ˆ α l,h i,j = α l,h i,j P k ∈I in α l,h i,k represents the rescaled weights f or the input segment I in at generation step i ∈ I g en . By observing the trend of D l ( S ) across layers, we can identify the dynamic transi- tion of attention from the structured background ( S tab ) to the task objecti ve ( S q ). B.2 Inner T able: T able Attention Entropy T o qualitatively describe the de gree of focus on rel- e vant cells within the table, we deﬁne the a verage table attention entropy at layer l as: H l table = 1 |I gen | X i ∈I gen   − X j ∈I tab ¯ P l i ( j ) log ¯ P l i ( j )   (7) where the inner-table attention distribution ¯ P l i ( j ) is deﬁned as: ¯ P l i ( j ) = P H h =1 α l,h i,j P k ∈I tbl P H h =1 α l,h i,k , ∀ j ∈ I tab (8) A higher entropy indicates a uniform distribu- tion of attention across the table, whereas a lower entropy signiﬁes that attention is concentrated in a fe w speciﬁc cells. This implies that the layer may identify the task-rele vant information. B.3 V alue-W eighted Contribution Follo wing the perspecti ve of previous re- searches ( K obayashi et al. , 2020 ; Gu et al. , 2024 ; Kang et al. , 2025a ) that attention weights alone do not quantify the actual magnitude of change in the residual stream, we introduce the V alue-W eighted Contribution analysis. T o capture the model’ s behavior throughout the generation process, we compute the av eraged contribution o ver I g en : C l ( S ) = 1 |I gen | · H X i ∈I gen      X j ∈S H X h =1 α l,h i,j x l − 1 j W l,h OV      2 (9) where S ∈ {I sy s , I tab , I q } . By integrating the hidden states x j with the projection matrix W OV , this formula faithfully reconstructs the intensity of the information v olume injected by a speciﬁc seg- ment S at layer l to driv e the follo wing reasoning. C Experimental Details C.1 Prompt T emplates W e provide the complete prompt templates used in our experiments. The input structure follo ws the standard practice of prompting LLMs for table tasks, consisting of three segments: (1) a prompt template, (2) a table content se gment, and (3) a user question segment. Overall Pr ompt T emplate System Prompt: You are a helpful assistant. User Prompt: TABLE: {table_content} INPUT: {question_content} 14 Prompt T emplate for T able Question Answering (WTQ, HiT ab, AITQA) T ABLE: {table_content (markdown / html)} INPUT : Given the table titled ‘{table_title}’, answer the following question based on the given table. The final answer should be concise and use the following format: ``` json { "answer": ["answer1", "answer2", ...] } ``` Question: {question_content} Prompt T emplate for T able Fact V eriﬁcation (T ab- Fact) T ABLE: {table_content (markdown / html)} INPUT : Given the table titled ‘{table_title}’, determine whether the following statement is entailed or refuted by the given table (Output 1 for entailed and 0 for refuted): Statement: {statement_content} C.2 T able Serialization Formats Unless otherwise speciﬁed, we use Markdown as the default table serialization format. Figure 7 il- lustrates the two serialization formats (Markdo wn and HTML) used in our experiments. Markdown F ormat | Name | Age | City | |-------|-----|---------| | Alice | 25 | Beijing | | Bob | 30 | Shanghai| HTML Format

Name	Age	City
Alice	25	Beijing
Bob	30	Shanghai

Figure 7: Examples of table serialization formats. C.3 Analyzed Models T o provide a comprehensiv e analysis across dif fer- ent model architectures and training paradigms, we categorize the in v estigated LLMs into three groups: (1) Instruct LLMs. W e analyze general-purpose instruction-tuned models that represent the main- stream LLM paradigm, including the Llama se- ries ( Dubey et al. , 2024 ) (e.g., Llama-3.1-8B- Instruct) and the Qwen series ( Y ang et al. , 2024 ) (e.g., Qwen-2.5-7B-Instruct). (2) T able-Speciﬁc Fine-tuned LLMs. T o un- derstand how specialized training af fects internal mechanisms, we include models speciﬁcally op- timized for tabular tasks. W e further distinguish between SFT -only models trained through super- vised ﬁne-tuning (e.g., the T AMA series ( Deng and Mihalcea , 2025 )) and RL-only models optimized via reinforcement learning (e.g., the T able-R1-Zero series ( Y ang et al. , 2025b )). (3) Mixture-of-Experts (MoE) LLMs. Be yond dense architectures, we e xtend our analysis to MoE models to examine whether sparse expert activ a- tion leads to dif ferent table understanding behav- iors. Representati v e models include OLMoE-1B- 7B ( Muennighof f et al. , 2024 ), DeepSeek-V2-Lite- Chat ( DeepSeek-AI et al. , 2024 ), and Qwen3-30B- A3B ( Y ang et al. , 2025a ). D Model Analysis Details D.1 Benchmark Perf ormance T able 4 reports the accuracy of all analyzed LLMs on our four e valuation benchmarks. Se veral obser - v ations emerge: (1) larger models generally achie ve better performance; (2) table-speciﬁc ﬁne-tuned models (T AMA, T ableGPT2, T able-R1) sho w no- table improv ements on hierarchical table datasets (HiT ab, AITQA); and (3) RL-tuned models (T able- R1 series) demonstrate particularly strong perfor - mance on complex table structures. T able 5 reports the performance of representa- ti ve models under dif ferent input format and infer - ence strategy combinations, with 95% bootstrap conﬁdence interv als (1,000 iterations, N =2,000 per condition). All CoT vs. Direct dif ferences are statistically signiﬁcant (non-ov erlapping CIs), con- ﬁrming the reliability of prompting strategy con- clusions. Markdo wn vs. HTML differences fall within the margin of uncertainty for most models, consistent with the observ ation that format-speciﬁc dif ferences in early layers gradually con ver ge in deeper layers, resulting in minimal o verall perfor - mance gaps. 15 Model WTQ T abFact HiT ab AITQA Instruct LLMs LLaMA3.2-1B-Instruct 6.00 5.40 6.40 9.80 LLaMA3.2-3B-Instruct 33.00 58.00 30.80 52.00 LLaMA3.1-8B-Instruct 50.20 70.80 49.00 57.80 Qwen2.5-3B-Instruct 52.40 78.00 49.80 55.60 Qwen3-4B-Instruct-2507 52.00 75.20 49.60 60.00 Qwen2.5-7B-Instruct 51.00 73.00 53.40 71.20 Qwen2.5-14B-Instruct 56.40 88.60 55.80 74.40 Qwen2.5-32B-Instruct 64.60 85.80 68.80 83.60 T able-Speciﬁc F ine-tuned LLMs T AMA-Qwen2.5 52.40 74.80 56.80 71.60 T AMA 50.60 72.60 66.80 82.00 T ableGPT2 56.80 74.80 59.80 84.20 T able-R1-7B-Zero 53.80 76.20 63.80 85.20 T able-R1-8B-Zero 60.40 70.40 74.00 78.60 MoE LLMs OLMoE-1B-7B-Instruct 6.82 2.20 5.01 14.60 DeepSeek-V2-Lite-Chat 28.60 63.40 30.40 53.00 Qwen3-30B-A3B-Instruct-2507 56.40 78.20 57.40 73.20 T able 4: LLMs’ performance (accuracy %) on ev aluation data of four benchmarks. Best results are bolded. T able 5: Performance (%) with 95% bootstrap conﬁ- dence interv als under different input formats and infer - ence strategies. † marks statistically signiﬁcant differ - ences between CoT and Direct (non-ov erlapping CIs). Markdown HTML Model Direct CoT Direct CoT Instruct LLMs Qwen2.5-7B 62.2 ± 2.1 67.2 ± 2.0 † 61.4 ± 2.1 67.8 ± 2.0 † Llama-3.1-8B 57.0 ± 2.2 62.4 ± 2.1 † 58.0 ± 2.1 58.4 ± 2.1 † T able-Speciﬁc F ine-tuned LLMs T able-R1-7B 69.8 ± 2.0 82.3 ± 1.7 † 70.3 ± 1.9 81.6 ± 1.7 † T able-R1-8B 70.8 ± 2.0 82.2 ± 1.7 † 69.9 ± 2.1 80.6 ± 1.8 † MoE LLMs Qwen3-30B-A3B 66.3 ± 2.1 70.2 ± 2.0 † 68.3 ± 2.0 72.1 ± 1.9 † D.2 Analysis of Effective Depth W e extend the effecti ve depth analysis from Sec- tion 5 to all analyzed models. T able 6 summarizes the layer effecti veness statistics, while Figure 27 provides detailed layer-wise prediction stability curves for each model. As described in the main text, we deﬁne the effective layer as the ﬁrst layer where the top-5 token o verlap between the intermediate prediction and the ﬁnal output e xceeds 0.3. The ef fective ratio (Ef f. Ratio = Eff. Layers / T otal Layers) indicates ho w deeply the model processes before reaching a stable prediction; a lower ratio suggests earlier stabilization. K ey observ ations from the extended analysis: • Most models stabilize their predictions in the later 75–90% of their total layers, conﬁrming that table tasks require deep processing. • T able-speciﬁc ﬁne-tuning (T AMA, T able-R1) does not signiﬁcantly alter the effecti ve depth compared to their base models, suggesting that layer-wise functionality is largely deter- mined during pre-training. • MoE models (OLMoE, DeepSeek-V2-Lite, Qwen3-30B-A3B) exhibit similar ef fectiv e ra- tios to dense models of comparable capacity . E More Discussions E.1 Connection to Retriev al Heads Recent work has identiﬁed r etrieval heads which is a sparse set of attention heads responsible for copying rele v ant information from conte xt to out- put ( W u et al. , 2024 ; Xiao et al. ). Giv en that table 16 Model T otal Layers Eff. Layers Eff. Ratio Instruct LLMs Llama-3.2-1B-Instruct 16 14 0.88 Llama-3.2-3B-Instruct 28 23 0.82 Meta-Llama-3.1-8B-Instruct 32 25 0.78 Qwen2.5-3B-Instruct 36 33 0.92 Qwen2.5-7B-Instruct 28 25 0.89 Qwen2.5-14B-Instruct 48 44 0.92 Qwen2.5-32B-Instruct 64 57 0.89 Qwen3-4B-Instruct-2507 36 24 0.67 T able-Speciﬁc F ine-tuned LLMs T ableGPT2-7B 28 25 0.89 T AMA-QW en2.5 28 25 0.89 T able-R1-Zero-7B 28 25 0.89 T AMA-1e-6 32 25 0.78 T able-R1-Zero-8B 32 24 0.75 MoE LLMs OLMoE-1B-7B-0924-Instruct 16 14 0.88 DeepSeek-V2-Lite-Chat 27 21 0.78 Qwen3-30B-A3B-Instruct-2507 48 45 0.94 T able 6: Layer effecti v eness analysis on table-related tasks using LogitLens. Eff. Layers identiﬁes the ﬁrst layer where the predicted vocab ulary distribution aligns with the ﬁnal output (deﬁned by the top-5 token overlap ﬁrst exceeds 0.3). Eff. Ratio (Eff. Layers / T otal Layers) reﬂects the stage at which the model stabilizes its prediction; a lower ratio indicates that the model reaches its ﬁnal decision in its shallo wer layers. QA inherently requires locating and extracting spe- ciﬁc cell values, we hypothesize that table under- standing may share similar mechanisms. T o in vestig ate this, we compare the attention heads most critical for table tasks (measured by attention norm intensity) with the T op-32 retriev al heads identiﬁed by W u et al. ( 2024 ) using our anal- ysis tab ular data. As sho wn in Figure 9 , the re- trie val heads (marked with red boxes) predomi- nantly align with regions of high attention norm intensity (yello w areas) across all e xamined mod- els. This overlap is particularly pronounced in the middle-to-late layers: for instance, in Qwen2.5- 7B, retrie val heads cluster around layers 19–27; in LLaMA3.1-8B, they concentrate in layers 14–15 and 26–28; in T AMA and T able-R1-8B, similar patterns emerge in layers 13–17 and 21–28. Notably , this alignment pattern persists across both vanilla instruction-tuned models (Qwen2.5, LLaMA3) and table-speciﬁc ﬁne-tuned models (T AMA, T able-R1), suggesting that table-speciﬁc training does not fundamentally alter which heads are responsible for information retriev al—it may instead enhance the efﬁcienc y of these pre-existing circuits. This ﬁnding suggests that LLMs lev erage their pre-existing retrie v al mechanisms for table under - standing, rather than de veloping entirely separate circuits. It also has practical implications: KV cache compression techniques designed for re- trie val heads ( Xiao et al. ) may be directly appli- cable to accelerate table reasoning tasks without signiﬁcant performance degradation. F Attention Dynamics f or All Models T o validate the generalizability of our ﬁndings in Section 4 , we present the complete attention dy- namics analysis for all 16 LLMs e xamined in this study . Each ﬁgure follows the same layout as Fig- ure 1 : upper panels show case-lev el attention pat- terns at selected layers, while lower panels dis- play aggregated metrics including segment atten- tion ratio, table attention entropy , and attention contribution across all layers. Despite variations in model architecture (dense vs. MoE), scale (1B to 32B), and training paradigm (instruction-tuning, 17 T AB LE : | Rank | Nati on | Gold | Silve r | Bronze | T ota l | | --- | --- | --- | --- | --- | --- | | 1 | China | 13 | 9 | 13 | 35 | | 2 | Japa n | 7 | 10 | 7 | 24 | | 3 | Uzbekistan | 1 | 2 | 3 | 6 | | 4 | Kaz akhsta n | 2 | 2 | 0 | 4 | | 5 | North Korea | 1 | 0 | 1 | 2 | | 6 | South Korea | 0 | 0 | 2 | 2 | | T ota l | T ota l | 24 | 23 | 26 | 73 | INP UT : Given the ta ble ti tl ed ' Figure skati ng at the As ia n W int er Gam es' , Ans wer the foll owing questi on base d on the give n ta ble . The fina l answer should be conc ise and us e the foll owing form at : ``` json { "answer": [ "answer1", "answer2", .. . ] } ``` Qu estion: w hic h nat ion wo n the m os t gold m eda ls? Qw en2. 5 - 7B - Instruc t: ```j son \ n{ \ n \ "answer \ ": [ \ n \ " China \ " \ n ] \ n} \ n``` Meta - LL aMA - 3. 1 - 8B - Instruc t: { \ n \ "a nswe r \ ": [ \ n \ " China \ " \ n ] \ n} Prompt Response Figure 8: The qualitative case e xample for Q1. SFT , RL), all models exhibit the consistent three- phase pattern: dispersed attention in early layers, concentrated focus on query-rele vant cells in mid- dle layers, and ampliﬁed contributions in late layers. Notably , table-speciﬁc ﬁne-tuned models (T AMA, T able-R1, T ableGPT2) show slightly higher atten- tion ratios to table content compared to their base models, while MoE models (DeepSeek-V2-Lite, OLMoE, Qwen3-30B-A3B) display more gradual entropy transitions across layers. 18 Attention Heads 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Layers QWEN2.5-3B Attention Heads 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Layers QWEN2.5-7B Attention Heads 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Layers T AMA.QWEN2.5 Attention Heads 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Layers T ABLE-R1-7B Attention Heads 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Layers LLAMA3.2-3B Attention Heads 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Layers LLAMA3.1-8B Attention Heads 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Layers T AMA Attention Heads 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Layers T ABLE-R1-8B T op-32 Key Retrieval Heads Localization across LLMs Attention Norm Intensity Search-Located K ey Retrieval Heads Figure 9: T op-32 retriev al heads’ overlap with attention norm heatmaps across LLMs. Background color indicates attention norm intensity (yellow = higher); red boxes mark the T op-32 retriev al heads identiﬁed by prior work. The substantial o verlap, especially in middle-to-late layers, suggests that table understanding reuses pre-existing retriev al circuits. Q: which nation won the most gold medals? M: Qwen3-30B-A3B-Instruct (MoE) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L0 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L4 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L4 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L12 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L12 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L22 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L22 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L34 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L34 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L47 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L47 T able Cells Attn Distri 0 10 20 30 40 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0 10 20 30 40 Layer Index 3.75 4.00 4.25 4.50 Entropy T able Attn Entropy 0 10 20 30 40 Layer Index 0.0 0.5 1.0 1.5 2.0 L2 Norm Attn Contribution System Instruction T able Content User Question Figure 10: Attention dynamics of Qwen3-30B-A3B-Instruct-2507. 19 Q: which nation won the most gold medals? M: T AMA (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L0 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L5 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L10 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L10 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L14 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L14 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L25 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L25 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L31 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L31 T able Cells Attn Distri 0 5 10 15 20 25 30 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0 5 10 15 20 25 30 Layer Index 4.4 4.6 4.8 5.0 Entropy T able Attn Entropy 0 5 10 15 20 25 30 Layer Index 0.00 0.25 0.50 0.75 1.00 L2 Norm Attn Contribution System Instruction T able Content User Question Figure 11: Attention dynamics of T AMA. Q: which nation won the most gold medals? M: T able-R1-Zero-8B (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L0 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L5 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L10 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L10 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L14 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L14 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L25 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L25 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L31 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L31 T able Cells Attn Distri 0 5 10 15 20 25 30 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0 5 10 15 20 25 30 Layer Index 4.4 4.6 4.8 5.0 Entropy T able Attn Entropy 0 5 10 15 20 25 30 Layer Index 0.0 0.2 0.4 0.6 L2 Norm Attn Contribution System Instruction T able Content User Question Figure 12: Attention dynamics of T able-R1-Zero-8B. 20 Q: which nation won the most gold medals? M: T able-R1-Zero-7B (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L0 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L5 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L10 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L10 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L15 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L15 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L22 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L22 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L27 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L27 T able Cells Attn Distri 0 5 10 15 20 25 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0 5 10 15 20 25 Layer Index 4.0 4.5 5.0 Entropy T able Attn Entropy 0 5 10 15 20 25 Layer Index 0 2 4 6 L2 Norm Attn Contribution System Instruction T able Content User Question Figure 13: Attention dynamics of T able-R1-Zero-7B. Q: which nation won the most gold medals? M: T AMA-Qwen2.5 (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L0 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L5 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L10 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L10 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L15 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L15 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L22 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L22 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L27 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L27 T able Cells Attn Distri 0 5 10 15 20 25 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0 5 10 15 20 25 Layer Index 4.0 4.5 5.0 Entropy T able Attn Entropy 0 5 10 15 20 25 Layer Index 0.0 2.5 5.0 7.5 10.0 L2 Norm Attn Contribution System Instruction T able Content User Question Figure 14: Attention dynamics of T AMA-Qwen2.5. 21 Q: which nation won the most gold medals? M: Qwen3-4B-Instruct (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L0 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L3 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L3 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L10 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L10 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L20 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L20 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L30 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L30 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L35 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L35 T able Cells Attn Distri 0 5 10 15 20 25 30 35 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0 5 10 15 20 25 30 35 Layer Index 2 3 4 5 Entropy T able Attn Entropy 0 5 10 15 20 25 30 35 Layer Index 0 2 4 6 L2 Norm Attn Contribution System Instruction T able Content User Question Figure 15: Attention dynamics of Qwen3-4B-Instruct. Q: which nation won the most gold medals? M: Qwen2.5-3B-Instruct (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L0 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L5 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L17 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L17 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L27 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L27 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L33 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L33 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L35 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L35 T able Cells Attn Distri 0 5 10 15 20 25 30 35 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0 5 10 15 20 25 30 35 Layer Index 2 3 4 5 Entropy T able Attn Entropy 0 5 10 15 20 25 30 35 Layer Index 0 2 4 6 L2 Norm Attn Contribution System Instruction T able Content User Question Figure 16: Attention dynamics of Qwen2.5-3B-Instruct. 22 Q: which nation won the most gold medals? M: Qwen2.5-14B-Instruct (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L0 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L5 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L18 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L18 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L29 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L29 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L40 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L40 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L47 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L47 T able Cells Attn Distri 0 10 20 30 40 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0 10 20 30 40 Layer Index 4.0 4.5 5.0 5.5 Entropy T able Attn Entropy 0 10 20 30 40 Layer Index 0 2 4 6 8 L2 Norm Attn Contribution System Instruction T able Content User Question Figure 17: Attention dynamics of Qwen2.5-14B-Instruct. Q: which nation won the most gold medals? M: Qwen2.5-32B-Instruct (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L0 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L7 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L7 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L22 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L22 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L35 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L35 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L45 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L45 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L63 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L63 T able Cells Attn Distri 0 10 20 30 40 50 60 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0 10 20 30 40 50 60 Layer Index 4.0 4.5 5.0 5.5 Entropy T able Attn Entropy 0 10 20 30 40 50 60 Layer Index 0 5 10 15 L2 Norm Attn Contribution System Instruction T able Content User Question Figure 18: Attention dynamics of Qwen2.5-32B-Instruct. 23 Q: which nation won the most gold medals? M: T ableGPT2-7B (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L0 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L4 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L4 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L14 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L14 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L18 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L18 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L22 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L22 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L27 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L27 T able Cells Attn Distri 0 5 10 15 20 25 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0 5 10 15 20 25 Layer Index 4.00 4.25 4.50 4.75 5.00 Entropy T able Attn Entropy 0 5 10 15 20 25 Layer Index 0.0 2.5 5.0 7.5 10.0 L2 Norm Attn Contribution System Instruction T able Content User Question Figure 19: Attention dynamics of T ableGPT2-7B. Q: which nation won the most gold medals? M: LLaMA-3.2-3B-Instruct (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L0 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L5 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L10 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L10 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L15 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L15 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L22 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L22 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L27 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L27 T able Cells Attn Distri 0 5 10 15 20 25 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0 5 10 15 20 25 Layer Index 4.6 4.8 5.0 Entropy T able Attn Entropy 0 5 10 15 20 25 Layer Index 0.0 0.2 0.4 0.6 0.8 L2 Norm Attn Contribution System Instruction T able Content User Question Figure 20: Attention dynamics of LLaMA3.2-3B-Instruct. 24 Q: which nation won the most gold medals? M: LLaMA-3.2-1B-Instruct (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L0 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L2 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L2 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L5 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L8 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L8 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L12 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L12 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L15 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L15 T able Cells Attn Distri 0 2 4 6 8 10 12 14 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0 2 4 6 8 10 12 14 Layer Index 4.4 4.6 4.8 Entropy T able Attn Entropy 0 2 4 6 8 10 12 14 Layer Index 0.0 0.2 0.4 L2 Norm Attn Contribution System Instruction T able Content User Question Figure 21: Attention dynamics of LLaMA3.2-1B-Instruct. Q: which nation won the most gold medals? M: DeepSeek- V2-Lite-Chat (MoE) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L0 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L3 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L3 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L5 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L12 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L12 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L20 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L20 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L26 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L26 T able Cells Attn Distri 0 5 10 15 20 25 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0 5 10 15 20 25 Layer Index 3.0 3.5 4.0 Entropy T able Attn Entropy 0 5 10 15 20 25 Layer Index 0.0 0.5 1.0 1.5 2.0 L2 Norm Attn Contribution System Instruction T able Content User Question Figure 22: Attention dynamics of DeepSeek-V2-Lite-Chat. 25 Q: which nation won the most gold medals? M: OLMoE-1B-7B-Instruct (MoE) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L0 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L3 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L3 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L6 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L6 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L10 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L10 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L12 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L12 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L15 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L15 T able Cells Attn Distri 0 2 4 6 8 10 12 14 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0 2 4 6 8 10 12 14 Layer Index 4.5 5.0 Entropy T able Attn Entropy 0 2 4 6 8 10 12 14 Layer Index 0.0 0.1 0.2 0.3 L2 Norm Attn Contribution System Instruction T able Content User Question Figure 23: Attention dynamics of OLMoE-1B-7B-Instruct. Q: which nation won the most gold medals? M: Gated_Attn_1B_Baseline (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L0 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L3 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L3 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L8 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L8 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L13 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L13 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L21 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L21 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L27 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L27 T able Cells Attn Distri 0 5 10 15 20 25 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0 5 10 15 20 25 Layer Index 4.0 4.5 5.0 Entropy T able Attn Entropy 0 5 10 15 20 25 Layer Index 0 5 10 L2 Norm Attn Contribution System Instruction T able Content User Question Figure 24: Attention dynamics of Qwen’ s 1B gated attention baseline. 26 Q: which nation won the most gold medals? M: Gated_Attn_1B_Elementwise (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L0 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L6 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L6 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L11 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L11 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L18 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L18 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L21 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L21 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L27 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L27 T able Cells Attn Distri 0 5 10 15 20 25 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0 5 10 15 20 25 Layer Index 3.0 3.5 4.0 4.5 Entropy T able Attn Entropy 0 5 10 15 20 25 Layer Index 0 10 20 30 40 L2 Norm Attn Contribution System Instruction T able Content User Question Figure 25: Attention dynamics of Qwen’ s 1B model with element-wise gated attention. Q: which nation won the most gold medals? M: Gated_Attn_1B_Headwise (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L0 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L6 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L6 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L13 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L13 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L18 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L18 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L21 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L21 T able Cells Attn Distri Start End Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L27 Attn Ratio Dyna Rank Natio Gold Silve Bronz T otal 1 China 13 9 13 35 2 Japan 7 10 7 24 3 Uzbek 1 2 3 6 4 Kazak 2 2 0 4 5 North 1 0 1 2 6 South 0 0 2 2 T otal T otal 24 23 26 73 L27 T able Cells Attn Distri 0 5 10 15 20 25 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0 5 10 15 20 25 Layer Index 3 4 5 Entropy T able Attn Entropy 0 5 10 15 20 25 Layer Index 0 5 10 15 L2 Norm Attn Contribution System Instruction T able Content User Question Figure 26: Attention dynamics of Qwen’ s 1B model with Headwise Gated Attention. 27 0 4 8 12 16 20 24 28 Layer Index 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 KL Divergence Llama-3.2-3B-Instruct 0 4 8 12 16 20 24 28 32 Layer Index 0 2 4 6 8 10 12 14 KL Divergence Meta-Llama-3.1-8B-Instruct 0 4 8 12 16 20 24 28 32 Layer Index 0 2 4 6 8 10 12 14 KL Divergence T AMA 0 4 8 12 16 20 24 28 32 Layer Index 0 5 10 15 20 25 KL Divergence T able-R1-Zero-8B 0 4 8 12 16 20 24 28 Layer Index 0 5 10 15 20 25 KL Divergence T ableGPT2-7B 0 4 8 12 16 20 24 28 Layer Index 0 5 10 15 20 25 30 35 40 KL Divergence Qwen2.5-7B-Instruct 0 4 8 12 16 20 24 28 Layer Index 0 5 10 15 20 25 30 35 KL Divergence T AMA-Qwen2.5 0 4 8 12 16 20 24 28 Layer Index 0 10 20 30 40 KL Divergence T able-R1-Zero-7B 0 4 8 12 16 20 24 28 32 36 Layer Index 0 5 10 15 20 25 30 35 40 KL Divergence Qwen3-4B-Instruct 0 4 8 12 16 20 24 28 32 36 Layer Index 0 5 10 15 20 25 30 35 KL Divergence Qwen2.5-3B-Instruct 0 6 12 18 24 30 36 42 48 Layer Index 0 5 10 15 20 25 30 35 40 KL Divergence Qwen2.5-14B-Instruct 0 8 16 24 32 40 48 56 64 Layer Index 0 5 10 15 20 25 30 35 KL Divergence Qwen2.5-32B-Instruct 0 2 4 6 8 10 12 14 16 Layer Index 0 2 4 6 8 10 12 14 16 KL Divergence Llama-3.2-1B-Instruct 0 2 4 6 8 10 12 14 16 Layer Index 0 2 4 6 8 10 12 14 KL Divergence OLMoE-1B-7B -Instruct 0 3 6 9 12 15 18 21 24 27 Layer Index 0 5 10 15 20 25 30 35 KL Divergence DeepSeek- V2-Lite-Chat 0 6 12 18 24 30 36 42 48 Layer Index 0 5 10 15 20 25 30 35 40 KL Divergence Qwen3-30B-A3B-Instruct 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap Figure 27: Analysis results of LLMs’ effecti veness layer . 28

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment