Mechanistic Interpretability of GPT-2: Lexical and Contextual Layers in Sentiment Analysis

February 18, 2026

Reading time: 5 minute

...

📝 Original Info

Title: Mechanistic Interpretability of GPT-2: Lexical and Contextual Layers in Sentiment Analysis
ArXiv ID: 2512.06681
Date: 2025-12-07
Authors: Amartya Hatua

📝 Abstract

We present a mechanistic interpretability study of GPT-2 that causally examines how sentiment information is processed across its transformer layers. Using systematic activation patching across all 12 layers, we test the hypothesized two-stage sentiment architecture comprising early lexical detection and mid-layer contextual integration. Our experiments confirm that early layers (0-3) act as lexical sentiment detectors, encoding stable, position specific polarity signals that are largely independent of context. However, all three contextual integration hypotheses: Middle Layer Concentration, Phenomenon Specificity, and Distributed Processing are falsified. Instead of mid-layer specialization, we find that contextual phenomena such as negation, sarcasm, domain shifts etc. are integrated primarily in late layers (8-11) through a unified, non-modular mechanism. These experimental findings provide causal evidence that GPT-2's sentiment computation differs from the predicted hierarchical pattern, highlighting the need for further empirical characterization of contextual integration in large language models.

💡 Deep Analysis

📄 Full Content

Mechanistic Interpretability of GPT-2: Lexical and Contextual Layers in Sentiment Analysis Amartya Hatua ∗ AI Center of Excellence Fidelity Investments Boston, MA 02210 amartyahatua@gmail.com Abstract We present a mechanistic interpretability study of GPT-2 that causally examines how sentiment information is processed across its transformer layers. Using systematic activation patching across all 12 layers, we test the hypothesized two- stage sentiment architecture comprising early lexical detection and mid-layer contextual integration. Our experiments confirm that early layers (0-3) act as lexical sentiment detectors, encoding stable, position specific polarity signals that are largely independent of context. However, all three contextual integration hypotheses: Middle Layer Concentration, Phenomenon Specificity, and Distributed Processing are falsified. Instead of mid-layer specialization, we find that contextual phenomena such as negation, sarcasm, domain shifts etc. are integrated primarily in late layers (8-11) through a unified, non-modular mechanism. These experimental findings provide causal evidence that GPT-2’s sentiment computation differs from the predicted hierarchical pattern, highlighting the need for further empirical characterization of contextual integration in large language models. 1 Introduction Large language models demonstrate impressive capabilities across a wide range of diverse linguistic tasks. Despite this progress, existing interpretability research primarily relies on correlational evidence from probing or attention analysis. Consequently, the internal causal structure through which these models encode and transform linguistic information has not been widely explored. Early research focused on identifying how distinct layers within transformers contribute to different stages of linguistic processing. Tenney et al. [2019] found that BERT processes language in stages early layers handle syntactic information, while later layers understand semantic relationships. This suggests that transformers operate similarly to a pipeline, progressing from simple features to a complex understanding. It was the first clear evidence that these models have organized, step by step processing. Building upon this foundation, Jawahar et al. [2019], formalized a three-tier hierarchical framework: early layers handle basic word features, middle layers deal with grammar and sentence structure, and late layers understand meaning and how distant words relate to each other. Simultaneously, Clark et al. [2019] revealed that individual heads develop specialized functions for specific linguistic phenomena, following the same early to late progression. In Rogers et al. [2020], a comprehensive synthesis was provided that established a general consensus on middle layer specialization for syntactic structure, while highlighting that semantic processing remains more distributed and less well understood. All these studies showed that transformers seem to process language in organized, step-by-step ways. However, their methodologies were predominantly ∗Code and data available at: https://github.com/amartyahatua/MI_Sentiment_Analysis 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Efficient Reasoning. arXiv:2512.06681v1 [cs.CL] 7 Dec 2025 correlational, relying on probing classifiers and attention analysis to identify what information exists in representations rather than what models actually use during inference. Newer research has highlighted this gap. Scientists now realize that finding patterns doesn’t prove the model actually uses them. As Belinkov et al. [2023] puts it, there’s a gap between what we can detect in the model and what the model actually relies on; just because we can find information doesn’t mean the model uses it. When Elazar and Goldberg [2018] tried removing features they thought were important, the models often worked just fine without them. This suggested they were finding fake patterns, not real ones. Makelov et al. [2024] found “interpretability illusions” interventions that seemed to reveal how models work but were actually triggering backup systems that had nothing to do with normal processing. Recent years have brought major improvements in solving the correlation-causation problem. The field of mechanistic interpretability Rai et al. [2024] has developed new techniques like activation patching Heimersheim and Nanda [2024] that let researchers directly test cause and effect, while automated tools have made the analysis process more systematic. Companies like Anthropic and OpenAI have successfully applied these mechanistic methods to real models, finding millions of interpretable features in their large language models. Among these challenging phenomena, sentiment analysis presents a particularly instructive case. Sentiment analysis represents a particularly complex challenge for mechanistic interpretability. Unlike syntactic phenomena that localize to s

📄 Read Full PDF on ArXiv