Position: Explaining Behavioral Shifts in Large Language Models Requires a Comparative Approach
Large-scale foundation models exhibit behavioral shifts: intervention-induced behavioral changes that appear after scaling, fine-tuning, reinforcement learning or in-context learning. While investigating these phenomena have recently received attention, explaining their appearance is still overlooked. Classic explainable AI (XAI) methods can surface failures at a single checkpoint of a model, but they are structurally ill-suited to justify what changed internally across different checkpoints and which explanatory claims are warranted about that change. We take the position that behavioral shifts should be explained comparatively: the core target should be the intervention-induced shift between a reference model and an intervened model, rather than any single model in isolation. To this aim we formulate a Comparative XAI ($Δ$-XAI) framework with a set of desiderata to be taken into account when designing proper explaining methods. To highlight how $Δ$-XAI methods work, we introduce a set of possible pipelines, relate them to the desiderata, and provide a concrete $Δ$-XAI experiment.
💡 Research Summary
The paper addresses a critical gap in the explainability of large language models (LLMs): while many studies have documented “behavioral shifts” that occur after interventions such as scaling, fine‑tuning, reinforcement learning from human feedback (RLHF), or in‑context prompting, there is little work on why these shifts happen. Traditional XAI methods (feature attribution, probing, concept activation, mechanistic interpretability, etc.) are designed to explain a single model’s decision process in isolation. They cannot systematically compare two successive checkpoints and attribute observed changes to specific internal modifications.
To fill this gap, the authors propose a Comparative XAI framework, denoted Δ‑XAI. The core idea is to treat a behavioral shift as the difference between a reference model (M_pre) and an intervened model (M_post), rather than as a property of either model alone. The framework proceeds in three stages:
-
Shift Detection – Define a behavior b (e.g., safety violation rate, task accuracy, deception score) and a quantitative metric B that maps each checkpoint M_t to a real‑valued evaluation. A shift is declared when the absolute change ΔB = |B(M_t) – B(M_{t‑1})| exceeds a task‑specific threshold ε_B.
-
Comparative Explanation Generation – Choose an existing explainer Φ (e.g., Integrated Gradients, Concept Activation Vectors, activation patching). Apply Φ separately to M_pre and M_post under matched conditions (same inputs, same layer selections) to obtain explanations e_pre and e_post.
-
Δ‑Explainer Mapping – Introduce a comparative explainer Φ_Δ that takes the pair (e_post, e_pre) and produces a shift‑focused explanation e_Δ. This mapping can be as simple as a difference of attribution maps or a more sophisticated alignment of concept vectors, but its purpose is to highlight how the explanation changes, not just what each explanation looks like in isolation.
The authors articulate four desiderata that any Δ‑XAI method should strive to satisfy:
- Accuracy – The identified change must correspond to genuine alterations in model parameters or architecture, not to noise or random fluctuations.
- Interpretability – The resulting e_Δ should be presented in a form that domain experts can readily understand (e.g., token‑level importance shifts, concept activation increases/decreases).
- Robustness – The method should yield consistent e_Δ across different random seeds, input subsets, or minor perturbations, ensuring reproducibility.
- Causality – The explanation should be linked to the behavioral metric B, allowing one to argue that the observed internal change caused the observed shift in behavior.
To illustrate the framework, the paper conducts a concrete experiment in the medical‑advice domain. A base model M₀ recommends immediate medical assistance for urgent prompts with 80% frequency. After a fine‑tuning step that improves factual knowledge (M₁) the rate rises to 90%, but a subsequent prompt‑conditioning step that encourages step‑by‑step reasoning (M₂) causes the rate to drop dramatically to 20%, crossing the predefined ε_B = 50% threshold. Using Integrated Gradients as Φ, the authors generate token‑level attribution maps for a set of urgent prompts before and after the second intervention. The Δ‑explainer Φ_Δ reveals that the post‑intervention model assigns higher importance to words that downplay symptoms (“mild”, “wait”) and lower importance to classic urgency cues (“chest pain”, “shortness of breath”). This comparative insight directly explains why the model’s behavior changed, something that would be invisible if one examined only the post‑intervention explanations in isolation.
The paper also surveys related work on multi‑checkpoint analyses (CKA similarity, layerwise probing, contrastive explanations) and argues that these approaches are fragmented and lack a unified set of evaluation criteria. Δ‑XAI, by contrast, provides a principled pipeline—from shift detection through explanation generation to causal validation—together with clear desiderata that guide method development and assessment.
In conclusion, the authors argue that comparative XAI is essential for responsible deployment of LLMs, especially under regulatory regimes such as the EU AI Act that demand traceability and meaningful explanations after model updates. By focusing on differences rather than static snapshots, Δ‑XAI enables practitioners to diagnose, anticipate, and mitigate emergent misalignments, thereby improving safety, transparency, and trustworthiness of large language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment