Interpreting Negation in GPT-2: Layer- and Head-Level Causal Analysis
Negation remains a persistent challenge for modern language models, often causing reversed meanings or factual errors. In this work, we conduct a causal analysis of how GPT-2 Small internally processes such linguistic transformations. We examine its hidden representations at both the layer and head level. Our analysis is based on a self-curated 12,000-pair dataset of matched affirmative and negated sentences, covering multiple linguistic templates and forms of negation. To quantify this behavior, we define a metric, the Negation Effect Score (NES), which measures the model’s sensitivity in distinguishing between affirmative statements and their negations. We carried out two key interventions to probe causal structure. In activation patching, internal activations from affirmative sentences were inserted into their negated counterparts to see how meaning shifted. In ablation, specific attention heads were temporarily disabled to observe how logical polarity changed. Together, these steps revealed how negation signals move and evolve through GPT-2’s layers. Our findings indicate that this capability is not widespread; instead, it is highly concentrated within a limited number of mid-layer attention heads, primarily within layers 4 to 6. Ablating these specific components directly disrupts the model’s negation sensitivity: on our in-domain, ablation increased NES (indicating weaker negation sensitivity), and re-introducing cached affirmative activations (rescue) increased NES further, confirming that these heads carry affirmative signal rather than restoring baseline behavior. On xNot360, ablation slightly decreased NES and rescue restored performance above baseline. This pattern demonstrates that these causal patterns are consistent across various negation forms and remain detectable on the external xNot360 benchmark, though with smaller magnitude.
💡 Research Summary
This paper investigates how GPT‑2 Small internally represents and processes linguistic negation, a phenomenon that remains a persistent source of errors in modern language models. The authors construct a curated dataset of 12 000 sentence pairs, each consisting of an affirmative statement and its logically opposite version created by swapping only the negation cue. The dataset spans eight semantic templates (e.g., “X is the capital of Y”, “X can Y”, “X likes Y”) and seven negation forms (“not”, “never”, “no”, “does not”, “doesn’t”, “cannot”, “can’t”), ensuring coverage of both syntactic and lexical variations.
To quantify the model’s sensitivity to negation, the authors introduce the Negation Effect Score (NES):
NES = log P(t | affirmative) − log P(t | negated),
where t is a target token (typically the next token the model is asked to predict). A negative NES indicates that the model assigns lower probability to the target under the negated context, i.e., it correctly distinguishes the polarity; a positive NES signals weaker negation sensitivity. The metric is aggregated per template, reporting mean, median, failure rate (proportion of NES > 0) and 95 % confidence intervals.
The core methodological contribution is a two‑stage causal tracing pipeline based on activation patching. First, for each transformer layer L, the post‑attention representation of the last token in the affirmative run is cached. During a forward pass on the negated prefix, this cached vector replaces the original activation at the same position. The resulting change in NES (ΔNES(L)) reveals how much that layer contributes to the polarity shift. The authors find that layers 4, 5, and 6 produce the largest ΔNES, suggesting that the bulk of the negation signal is transformed in the middle of the network.
Second, the analysis drills down to the head level. Each attention head’s output slice h(L, H) is individually swapped with its affirmative counterpart, and the corresponding ΔNES(L, H) is measured. Ranking heads by average absolute ΔNES across the development set isolates a small set of highly influential heads, most of which reside in the same middle layers (e.g., L5H11, L4H3, L6H9). Patching affirmative activations into these heads dramatically increases NES (making the model less sensitive to negation), while patching negated activations decreases NES, confirming that these heads encode polarity information.
To test causality, the authors perform ablation‑rescue experiments on the top‑k heads identified above. In the ablation phase, the post‑attention outputs of the selected heads are zeroed during the negated forward pass. This intervention consistently raises NES on the in‑domain dataset, indicating that removing these heads weakens the model’s ability to recognize negation. In the rescue phase, the cached affirmative activations are re‑inserted into the same heads after ablation. This “rescue” further raises NES, showing that the affirmative signal carried by these heads is sufficient to drive the model’s polarity bias.
External validation is conducted on the xNot360 benchmark, a widely used negation test set. Here, ablation slightly lowers NES (a modest improvement in negation sensitivity), while rescue restores performance above the baseline, mirroring the pattern observed on the in‑domain data but with reduced magnitude. This demonstrates that the identified circuit generalizes across datasets and negation forms, albeit with weaker effect sizes due to distributional differences.
The paper’s contributions are fourfold: (1) the NES metric provides a principled, token‑level measure of negation sensitivity; (2) a systematic layer‑ and head‑level activation‑patching framework is introduced for causal tracing; (3) empirical evidence shows that a compact subnetwork of mid‑layer attention heads is the primary locus of negation processing in GPT‑2 Small; (4) the same subnetwork’s influence is confirmed on an external benchmark, supporting its generality.
Limitations are acknowledged. The study focuses exclusively on GPT‑2 Small; larger models such as GPT‑3 or GPT‑4 may exhibit more distributed or hierarchical negation circuits. The dataset covers only seven simple negation cues and does not address complex constructions (e.g., “not only … but also …”, double negation, or implicit negation). Moreover, activation patching and head zeroing may have side effects on other linguistic features that are not fully isolated.
Future work suggested includes scaling the methodology to larger language models to examine how the circuit evolves with model size, expanding the dataset to encompass richer negation phenomena (modal, discourse‑level, and pragmatic negation), and leveraging the identified heads for model editing—e.g., directly modifying or fine‑tuning these components to correct systematic negation errors. The authors also propose linking the discovered mechanistic patterns with linguistic theories of polarity to deepen our theoretical understanding of how transformer architectures encode logical meaning.
In summary, this research provides the first detailed mechanistic account of how a transformer model internally handles negation, revealing a localized, interpretable circuit in the middle layers. The findings not only advance the field of mechanistic interpretability but also offer practical pathways for improving the logical reliability of current and future large language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment