Mechanistic Interpretability of GPT-2: Lexical and Contextual Layers in Sentiment Analysis

We present a mechanistic interpretability study of GPT-2 that causally examines how sentiment information is processed across its transformer layers. Using systematic activation patching across all 12

Mechanistic Interpretability of GPT-2: Lexical and Contextual Layers in Sentiment Analysis

We present a mechanistic interpretability study of GPT-2 that causally examines how sentiment information is processed across its transformer layers. Using systematic activation patching across all 12 layers, we test the hypothesized two-stage sentiment architecture comprising early lexical detection and mid-layer contextual integration. Our experiments confirm that early layers (0-3) act as lexical sentiment detectors, encoding stable, position specific polarity signals that are largely independent of context. However, all three contextual integration hypotheses: Middle Layer Concentration, Phenomenon Specificity, and Distributed Processing are falsified. Instead of mid-layer specialization, we find that contextual phenomena such as negation, sarcasm, domain shifts etc. are integrated primarily in late layers (8-11) through a unified, non-modular mechanism. These experimental findings provide causal evidence that GPT-2’s sentiment computation differs from the predicted hierarchical pattern, highlighting the need for further empirical characterization of contextual integration in large language models.


💡 Research Summary

This paper presents a causal mechanistic‑interpretability investigation of GPT‑2 (12 layers, 124 M parameters) focused on how sentiment information is processed across the model’s depth. The authors begin by noting that many prior hypotheses about sentiment computation in transformer language models assume a hierarchical two‑stage architecture: early layers detect lexical polarity, middle layers integrate contextual modifiers (negation, sarcasm, domain shifts), and later layers produce the final sentiment output. To test these ideas, the authors employ systematic activation‑patching, a technique that allows one to replace the hidden activations of a given layer with those obtained from a different input, thereby measuring the causal contribution of that layer to the model’s final prediction.

The experimental setup uses multiple sentiment datasets (SST‑2, IMDB, and domain‑specific reviews) and creates two families of inputs: “lexical‑only” sentences that contain sentiment‑bearing words but lack contextual modifiers, and “context‑rich” sentences that embed negation, sarcasm, or domain‑specific cues. For each layer, the authors patch the activations from a lexical‑only example into a context‑rich example (and vice‑versa) and record the change in the final sentiment logit. Complementary analyses include token‑wise polarity vectors, position‑specific stability metrics, gradient‑based attribution, and layer‑wise relevance propagation to verify that the patching results are not artifacts.

Key findings are as follows:

  1. Early lexical detection (layers 0‑3). These layers encode stable, position‑specific polarity signals that are virtually identical for lexical‑only and context‑rich inputs. Patching these layers has negligible impact on the final sentiment prediction, confirming that they act as robust lexical sentiment detectors independent of surrounding context.

  2. Middle‑layer hypotheses falsified (layers 4‑7). Contrary to the “Middle Layer Concentration”, “Phenomenon Specificity”, and “Distributed Processing” hypotheses, the middle layers show low activation magnitude, minimal contribution to sentiment logits, and no evidence of dedicated neurons for particular contextual phenomena. Both activation‑patching and gradient analyses indicate that these layers do not perform any substantial contextual integration.

  3. Late‑layer unified integration (layers 8‑11). The most striking result is that all complex contextual effects—negation (“not”), double negation, sarcasm (“yeah right”), and domain shifts—are integrated primarily in the final four layers. Patching a late‑layer activation from a lexical‑only example into a context‑rich example dramatically degrades sentiment accuracy, while inserting a context‑rich activation restores correct handling of the modifier. This suggests a non‑modular, unified mechanism whereby the model fuses diverse contextual cues into a high‑dimensional representation that directly drives the output head.

The authors argue that this pattern likely emerges from the next‑token prediction objective combined with massive, heterogeneous training data: the model learns to keep lexical polarity separate early on, then postpones the costly, combinatorial reasoning about context to the deepest layers where richer representations are available. They also discuss implications for interpretability: probing or intervening on middle layers will not reveal the model’s handling of negation or sarcasm, whereas interventions on late layers are far more informative.

In conclusion, the study provides causal evidence that GPT‑2’s sentiment computation deviates from the canonical hierarchical view. While early layers indeed act as lexical sentiment detectors, contextual integration does not occur in a dedicated middle stage; instead, it is performed in a late, unified fashion. This challenges existing mechanistic models of language‑model reasoning and underscores the need for broader empirical characterizations across tasks, languages, and larger models (e.g., GPT‑3, GPT‑4). Future work should replicate the activation‑patching methodology on other linguistic phenomena (fact verification, logical inference) and on larger architectures to determine whether the late‑layer unified integration is a general principle of transformer‑based language models.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...