Jacobian Scopes: token-level causal attributions in LLMs
Large language models (LLMs) make next-token predictions based on clues present in their context, such as semantic descriptions and in-context examples. Yet, elucidating which prior tokens most strongly influence a given prediction remains challenging due to the proliferation of layers and attention heads in modern architectures. We propose Jacobian Scopes, a suite of gradient-based, token-level causal attribution methods for interpreting LLM predictions. Grounded in perturbation theory and information geometry, Jacobian Scopes quantify how input tokens influence various aspects of a model’s prediction, such as specific logits, the full predictive distribution, and model uncertainty (effective temperature). Through case studies spanning instruction understanding, translation, and in-context learning (ICL), we demonstrate how Jacobian Scopes reveal implicit political biases, uncover word- and phrase-level translation strategies, and shed light on recently debated mechanisms underlying in-context time-series forecasting. To facilitate exploration of Jacobian Scopes on custom text, we open-source our implementations and provide a cloud-hosted interactive demo at https://huggingface.co/spaces/Typony/JacobianScopes.
💡 Research Summary
The paper introduces Jacobian Scopes, a family of gradient‑based, token‑level causal attribution methods designed to explain which input tokens most strongly influence a large language model’s (LLM) next‑token prediction. Modern LLMs contain many layers and attention heads, making it difficult to trace influence back to individual tokens. Jacobian Scopes address this by leveraging the Jacobian matrix that captures the first‑order relationship between the model’s final hidden state y (the representation used for the next‑token logits) and each input token embedding xₜ.
The core idea is to compute a vector‑Jacobian product vᵀJₜ, where Jₜ = ∂y/∂xₜ and v encodes the particular aspect of the output we wish to study. The L2 norm of this product, ‖vᵀJₜ‖₂, is defined as the influence score for token t. Crucially, this score can be obtained for all positions with a single backward pass by differentiating a scalar loss L = vᵀy with respect to the input embeddings, avoiding the need for d_model separate backward passes.
Three concrete instantiations are proposed:
-
Semantic Scope – v is set to the embedding of a target vocabulary token (w_target). The loss becomes the logit of that token, so the influence score measures how sensitive the target logit is to each input token. This reveals which words most directly drive a specific prediction, exposing implicit biases (e.g., “liberal” being driven by “Columbia”).
-
Fisher Scope – v is the principal eigenvector u₁ of the Fisher Information Matrix (FIM) derived from the softmax distribution p = softmax(Wy). The FIM quantifies how a small perturbation of y changes the full predictive distribution measured by KL divergence. Projecting the Jacobian onto u₁ produces an influence score that captures a token’s effect on the entire distribution, which is especially useful for tasks with non‑unique correct answers such as translation. Experiments show that source‑language words dominate the influence on their direct translations, while phrase‑level translations involve coordinated influence from multiple source tokens.
-
Temperature Scope – v is the normalized hidden direction ŷ (the direction of y after factoring out its norm). The scalar loss is the norm β_eff = ‖y‖₂, interpreted as an effective inverse temperature. When the predictive distribution is approximately Gaussian, β_eff⁻¹ is proportional to the variance, so this scope attributes the model’s confidence (or uncertainty) to particular context tokens. In in‑context learning (ICL) for time‑series forecasting, Temperature Scope uncovers distinct strategies: for deterministic chaotic dynamics (Lorenz system) the model leans on past segments that resemble the current pattern, whereas for stochastic Brownian motion it focuses on the most recent tokens, explaining why ICL loss plateaus early for unbounded stochastic processes.
Quantitative evaluation uses the Area Over the Perturbation Curve (AOPC) on LAMBADA and IWSLT 2017 DE→EN benchmarks. Jacobian Scopes consistently outperform random ablation and Integrated Gradients, and match the performance of Input × Gradient, demonstrating that first‑order influence scores are reliable proxies for true causal relevance.
The authors discuss limitations: (i) the method provides local linear attributions, which may miss higher‑order interactions; (ii) it is architecture‑blind, ignoring transformer‑specific circuitry, so interpretations must be tempered with knowledge of possible attention‑sink effects; (iii) it relies on back‑propagation, though the overhead is modest (≈0.027 s backward vs 0.069 s forward on an RTX A4000).
Future work is suggested in exploring the spectral structure of both the Jacobian and the Fisher matrix to define new explananda, and in combining Jacobian Scopes with non‑gradient, intervention‑based techniques (e.g., activation patching, circuit tracing) for a more complete mechanistic picture.
All code is open‑sourced, and an interactive demo is hosted on HuggingFace Spaces, allowing users to upload custom text and visualize Semantic, Fisher, and Temperature Scopes in real time. Jacobian Scopes thus provide a mathematically grounded, interpretable, and computationally efficient toolkit for probing token‑level causal dynamics in modern LLMs.
Comments & Academic Discussion
Loading comments...
Leave a Comment