Sparse Attention Post-Training for Mechanistic Interpretability

December 05, 2025

Reading time: 32 minute

...

📝 Original Info

Title: Sparse Attention Post-Training for Mechanistic Interpretability
ArXiv ID: 2512.05865
Date: 2025-12-05
Authors: Florent Draye, Anson Lei, Hsiao-Ru Pan, Ingmar Posner, Bernhard Schölkopf

📝 Abstract

We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 7B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $\approx 0.4 \%$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer edges connecting them. Additionally, using cross-layer transcoders, we show that sparse attention substantially simplifies attention attribution, enabling a unified view of feature-based and circuit-based perspectives. These results demonstrate that transformer attention can be made orders of magnitude sparser, suggesting that much of its computation is redundant and that sparsity may serve as a guiding principle for more structured and interpretable models.

📄 Full Content

Scaling has driven major advances in artificial intelligence, with ever-larger models trained on internet-scale datasets achieving remarkable capabilities across domains. Large language models (LLMs) now underpin applications from text generation to question answering, yet their increasing complexity renders their internal mechanisms largely opaque (Bommasani, 2021). Methods of mechanistic interpretability have been developed to address this gap by reverse-engineering neural networks to uncover how internal

Sparse Model Sparsity-Regularised Finetuning Figure 1. Visualised attention patterns for a 4-layer toy model trained on a simple 2-digit addition task. The main idea of this work is to induce sparse attention between tokens via a posttraining procedure that optimizes for attention sparsity while maintaining model performance. In this example, while both models are able to correctly predict the sum, the sparse model solves the problem with a naturally interpretable circuit. Details of this toy setup and more examples are provided in Appendix A components implement specific computations and behaviors. Recent advances in this area have successfully identified interpretable circuits, features, and algorithms within LLMs (Nanda et al., 2023;Olsson et al., 2022), showing that large complex models can, in part, be understood mechanistically, opening avenues for improving transparency, reliability, and alignment (Bereska & Gavves, 2024).

However, interpretability is bottlenecked by the model itself: even with sophisticated reverse-engineering techniques that can faithfully reveal internal algorithms, the underlying computations implemented by large models can still remain highly complex and uninterpretable. Circuits for seemingly simple tasks may span hundreds of interacting attention heads and MLPs with densely intertwined contributions across layers (Conmy et al., 2023), and features can influence each other along combinatorially many attentionmediated paths, complicating attention attribution (Kamath et al., 2025). To exemplify this, Figure 1 (top) illustrates the attention patterns of a small, single-head transformer trained on a simple two-digit addition task. Here, the model has learned to solve the task in a highly diffused manner, where information about each token is dispersed across all token locations, rendering the interpretation of the underlying algorithm extremely difficult even in this simple case.

The crux of the problem is that models are not incentivised to employ simple algorithms during training. In this work, we advocate for directly embedding interpretability constraints into model design in a way that induces simple circuits while preserving performance. We focus our analysis on attention mechanisms and investigate sparsity regularisation on attention patterns, originally proposed in (Lei et al., 2025), as an inductive bias. To demonstrate how sparse attention patterns can give rise to interpretable circuits, we return to the two-digit addition example: Figure 1 (bottom) shows the attention patterns induced by penalising attention edges during training. Here, the sparsity inductive bias forces the model to solve the problem with much smaller, intrinsically interpretable computation circuits.

In this work, we investigate using this sparsity regularisation scheme as a post-training strategy for pre-trained LLMs. We propose a practical method for fine-tuning existing models without re-running pretraining, offering a flexible way to induce sparse attention patterns and enhance interpretability. We show, on models of up to 7B parameters, that our proposed procedure preserves the performance of the base models on pretraining data while reducing the effective attention map to less than 0.5% of its edges. To evaluate our central hypothesis that sparse attention facilitates interpretability, we consider two complementary settings. First, we study circuit discovery, where the objective is to identify the minimal set of components responsible for task performance (Conmy et al., 2023). We find that sparsified models yield substantially simpler computational graphs: the resulting circuits explain model behaviour using up to four times fewer attention heads and up to two orders of magnitude fewer edges. Second, using cross-layer transcoders (Ameisen et al., 2025), we analyse attribution graphs, which capture feature-level interactions across layers. In this setting, sparse attention mitigates the attention attribution problem by making it possible to identify which attention heads give rise to a given edge, owing to the reduced number of components mediating each connection. We argue that this clarity enables a tighter integration of feature-based and circuit-based perspectives, allowing feature interactions to be understood through explicit, tractable circuits. Taken together, these results position attention sparsity as an effective and practical inductive tool for surfacing the minimal functional backbone underlying model behaviour.

As self-attention is a key component of the ubiquitous Transformer architecture, a large number of variants of attention mechanisms have been explored in the literature. Related to our approach are sparse attention methods, which are primarily designed to alleviate the quadratic scaling of vanilla self-attention. These methods typically rely on masks based on fixed local and strided patterns (Child et al., 2019) or sliding-window and global attention patterns (Beltagy et al., 2020;Zaheer et al., 2020) to constrain the receptive field of each token. While these approaches are successful in reducing the computational complexity of self-attention, they require hand-defined heuristics that do not reflect the internal computations learned by the model.

Beyond these fixed-pattern sparse attention methods, Top-k attention, which enforces sparsity by dynamically selecting the k most relevant keys per query based on their attention scores, has also been explored (Gupta et al., 2021;DeepSeek-AI, 2025). While Top-k attention enables learnable sparse attention, the necessity to specify k limits its scope for interpretability for two reasons. First, selecting the optimal k is difficult, and setting k too low can degrade model performance. Second, and more fundamentally, Topk attention does not allow the model to choose different k for different attention heads based on the context. We argue that this flexibility is crucial for maintaining model performance.

More recently, gated attention mechanisms (Qiu et al., 2025) provide a scalable and performant framework for inducing sparse attention. In particular, Lei et al. (2025) introduce a sparsity regularisation scheme for world modelling that reveals sparse token dependencies. We adopt this method and examine its role as an inductive bias for interpretability.

Mechanistic interpretability seeks to uncover how internal components of LLMs implement specific computations. Ablation studies assess performance drops from removing components (Nanda et al., 2023), activation patching measures the effect of substituting activations (Zhang & Nanda, 2023), and attribution patching scales this approach via local linearisation (Syed et al., 2024). Together, these approaches allow researchers to isolate sub-circuits, minimal sets of attention heads and MLPs that are causally responsible for a given behavior or task (Conmy et al., 2023). Attention itself plays a dual role: it both routes information and exposes interpretable relational structure, making it a key substrate for mechanistic study. Our work builds on this foundation by leveraging sparsity to simplify these circuits, amplifying the interpretability of attention-mediated computation while preserving model performance.

Mechanistic interpretability has gradually shifted from an emphasis on explicit circuit discovery towards the analysis of internal representations and features. Recent work on attribution graphs and circuit tracing seeks to reunify these perspectives by approximating MLP outputs as sparse linear combinations of features and computing causal effects along linear paths between them (Dunefsky et al., 2024;Ameisen et al., 2025;Lindsey et al., 2025b). This framework enables the construction of feature-level circuits spanning the computation from input embeddings to final token predictions. Within attribution graphs, edges correspond to direct linear causal relationships between features. However, these relationships are mediated by attention heads that transmit information across token positions. Identifying which attention heads give rise to a particular edge, and understanding why they do so, is essential, as this mechanism forms a fundamental component of the computational graph (Kamath et al., 2025). A key limitation of current attribution-based approaches is that individual causal edges are modulated by dozens of attention components. We show that this leads to feature-to-feature influences that are overly complex, rendering explanations in terms of other features in the graph both computationally expensive and conceptually challenging.

Our main hypothesis is that post-training existing LLMs to encourage sparse attention patterns leads to the emergence of more interpretable circuits. In order to instantiate this idea, we require a post-training pipeline that satisfies three main desiderata: 1. To induce sparse message passing between tokens, we need an attention mechanism that can ‘zero-out’ attention edges, which in turn enables effective L 0regularisation on the attention weights. This is in contrast to the standard softmax attention mechanism, where naive regularisation would result in small but non-zero attention weights that still allow information flow between tokens.

The model architecture needs to be compatible with the original LLM such that the pre-trained LLM weights can be directly loaded at initialisation.
The post-training procedure needs to ensure that the post-trained models do not lose prediction performance compared to their fully-connected counterparts.

To this end, we leverage the Sparse Transformer architecture in the SPARTAN framework proposed in (Lei et al., 2025), which uses sparsity-regularised hard attention instead of the standard softmax attention. In the following subsections, we describe the Sparse Transformer architecture and the optimisation setup, highlighting how this approach satisfies the above desiderata.

Given a set of token embeddings, the Sparse Transformer layer computes the key, query, and value embeddings, {k i , q i , v i }, via linear projections, analogous to the standard Transformer. Based on the embeddings, we sample a binary gating matrix from a learnable distribution parameterised by the keys and queries,

where Bern(•) is the Bernoulli distribution and σ(•) is the logistic sigmoid function. This sampling step can be made differentiable via the Gumbel Softmax trick (Jang et al., 2017). This binary matrix acts as a mask that controls the information flow across tokens. Next, the message passing step is carried out in the same way as standard softmax attention, with the exception that we mask out the value embeddings using the sampled binary mask,

where d k is the dimension of the key embeddings and ⊙ denotes element-wise multiplication. During training, we regularise the expected number of edges between tokens based on the distribution over the gating matrix. Concretely, the expected number of edges for each layer can be calculated as

Note that during the forward pass, each entry of A is a hard binary sample that zeros out attention edges, which serves as an effective L 0 regularisation. Moreover, since the functional form of the sparse attention layer after the hard sampling step is the same as standard softmax attention, pre-trained model weights can be directly used without alterations.1

In order to ensure that the models do not lose prediction performance during the post-training procedure, as per desideratum 3, we follow the approach proposed in (Lei et al., 2025), which employs the GECO algorithm (Rezende & Viola, 2018). Originally developed in the context of regularising VAEs, the GECO algorithm places a constraint on the performance of the model and uses a Lagrangian multiplier to automatically find the right strength of regularisation during training. Concretely, we formulate the learning process as the following optimisation problem,

where A l denotes the gating matrix at layer l, CE is the standard next token prediction cross-entropy loss, and τ is the required target loss, and θ is the model parameters. In practice, we set this target as the loss of the pre-trained baseline models. We solve this optimisation problem via Lagrangian relaxation, yielding the following max-min objective,

This can be solved by taking gradient steps on θ and λ alternately. During training, updating λ automatically balances the strength of the sparsity regularisation: when CE is lower than the threshold, λ decreases, and hence more weight is given to the sparsity regularisation term. This effectively acts as an adaptive schedule which continues to increase the strength of the regularisation until the model performance degrades. Here, the value of τ is selected as a hyperparameter to ensure that the sparse model’s performance remains within a certain tolerance of the original base model. In practice, the choice of τ controls a trade off between sparsity and performance: picking a tight τ can lead to a slower training process, whereas a higher tolerance can substantially speed up training at the cost of potentially harming model performance. In Appendix C, we provide further discussion on this optimisation process and its training dynamics.

One of the main strengths of our proposed method is that, architecturally, the only difference between a sparse Transformer and a normal one lies in how the dot-product attention is computed. As such, most practical training techniques for optimising Transformers can be readily adapted to our setting. In our experiments, we find the following techniques helpful for improving computational efficiency and training stability.

LoRA finetuning (Hu et al., 2022). Low rank finetuning techniques can significantly reduce the computational requirements for training large models. In our experiments, we verify on a 7B parameter model that LoRA finetuning is sufficiently expressive for inducing sparse attention patterns.

FlashAttention (Dao, 2023) FlashAttention has become a standard method for reducing the memory footprint of dotproduct attention mechanisms. In Appendix B, we discuss how the sampled sparse attention can be implemented in an analogous manner.

Distillation (Gu et al., 2024). Empirically, we find that adding an auxiliary distillation loss based on the KL divergence between the base model and the sparse model improves training stability and ensures that the behaviour of the model remains unchanged during post-training.

To evaluate the effectiveness of our post-training pipeline, we finetune pre-trained LLMs and compare their prediction performance and interpretability before and after applying sparsity regularisation. We perform full finetuning on a GPT-2 base model (Radford et al., 2019)(124M parameters) on the OpenWebText dataset (Gokaslan & Cohen, 2019). To investigate the generality and scalability of our method, we perform LoRA finetuning on the larger OLMo-7B model (Groeneveld et al., 2024) on the Dolma dataset (Soldaini et al., 2024), which is the dataset on which the base model was trained. The GPT-2 model and the OLMo model are trained on sequences of length 64 and 512, respectively. In the following subsections, we first present a quantitative evaluation of model performance and sparsity after sparse post-training. We then conduct two interpretability studies, using activation patching and attribution graphs, to demonstrate that our method enables the discovery of substantially smaller circuits.

We begin by evaluating both performance retention and the degree of sparsity achieved by post-training. We set crossentropy targets of 3.50 for GPT-2 (base model: 3.48) and 2.29 for OLMo (base model: 2.24). After training, the mean cross-entropy loss for both models remains within ±0.01 of the target, indicating that the dual optimisation scheme effectively enforces a tight performance constraint. To quantify the sparsity achieved by the models, we evaluate them on the validation split of their respective datasets and compute the mean number of non-zero attention edges per attention head. We find that the sparsified GPT-2 model activates, on average, only 0.22% of its attention edges, while the sparsified OLMo model activates 0.44%, indicating substantial sparsification in both cases. Table 1 provides a summary of the results. To further verify that this drastic reduction in message passing between tokens does not substantially alter model behaviour, we evaluate the sparsified OLMo model on a subset of the benchmarks used to assess the original model. As shown in Figure 2, the sparse model largely retains the performance of the base model across a diverse set of tasks. In sum, our results demonstrate that sparse posttraining is effective in consolidating information flow into a small number of edges while maintaining a commensurate level of performance.

We begin by outlining the experimental procedure used for circuit discovery. Activation patching (Nanda et al., 2023) is a widely used technique for identifying task-specific circuits in transformer models. In a typical setup, the model is evaluated on pairs of prompts: a clean prompt, for which the model predicts a correct target token, and a corrupted prompt that shares the overall structure of the clean prompt but is modified to induce an incorrect prediction. Here, the goal is to find the set of model components that is responsible for the model’s preference for the correct answer over the wrong one, as measured by the logit difference between the corresponding tokens. In activation patching, individual model components, such as attention heads and individual edges, can be ‘switched-off’ by patching activation at the specific positions. Circuit discovery amounts to finding a set of components whose replacement causes the model’s prediction to shift from the correct to the corrupted answer. Since searching over every possible subset of model components is infeasible due to the exponential number of potential subsets, we adopt a common heuristic to rank each model component. Specifically, for each individual component, we compute an importance score by replacing the activations of the component with the corrupted activations and measuring its effect on the logit difference. In our experiments, we use this ranking to select the top-k components and intervene on the model by freezing all remaining components, with the goal of identifying the minimal set that accounts for at least 90% of the model’s preference for the correct prediction. Note that these importance scores can be computed at two levels: (i) a single-sentence level, using a single pair of correct and corrupted inputs, and (ii) a global level, obtained by averaging scores across many task variants. In our experiments, we report the results using single-sentence scores. In Appendix D, we also provide results using the global scores, which are largely consistent with our main results. There are also two standard approaches for freezing component activations: setting the activation to zero or replacing it with a mean activation value (Conmy et al., 2023). We evaluate both variants for each model and report results for the patching strategy that yields the smallest circuits.

We first focus on the copy task with the following prompt: “AJEFCKLMOPQRSTVWZS, AJEFCKLMOPQRSTVWZ”, where the model has to copy the letter S to the next token position. This task is well studied and is widely believed to be implemented by emergent induction heads (Elhage et al., 2021), which propagate token information forward in the sequence. Figure 3 illustrates the attention patterns of the set of attention heads that explains this prompt for the sparse and base GPT-2 models. See Appendix D for analogous results for the OLMo models. The sparse model admits a substantially smaller set of attention heads (9 heads) than its fully connected counterpart (61 heads). Moreover, the identified heads in the sparse model exhibit cleaner induction head patterns, with each token attending to a single prior position at a fixed relative offset. These results illustrate how sparsification facilitates interpretability under simple ranking-based methods and support our hypothesis that sparse post-training yields models that are more amenable to mechanistic interpretability techniques.

To further verify our hypothesis, we repeat the experiment on classical circuit discovery tasks. For GPT-2, we evaluate variants of the Indirect Object Identification (IOI) task, in which the model copies a person’s name from the start of a sentence, and the Greater Than task, in which the model predicts a number that is larger than a previously mentioned number. To further assess the scalability of our approach, we investigate more challenging and longer horizon tasks for OLMo, including a longer context IOI task and a Doc-string task where the model needs to predict an argument name in a Docstring based on an implemented function. Details of each task can be found in Appendix E. Figure 4 and 5 show the fraction of model behaviour explained as a function of the number of retained model components (attention heads and attention edges, respectively). Across all tasks and models, the sparse models consistently produce significantly smaller circuits, as measured by the number of model components needed to explain 90% of model prediction. This further corroborates our claim that sparse models lead to simpler and more interpretable internal circuits.

Next, we present a more fine-grained, feature-level investigation of whether sparsity in attention leads to interpretable circuits in practice using cross-layer transcoders (CLTs).

Since training CLTs on OLMo-7B is computationally prohibitive2 , we focus our analysis on the GPT-2 models. For the rest of the section, we perform analysis on CLTs trained on the sparse and base GPT-2 models, trained with an expansion factor of 32 and achieve above 80% replacement score measured with Circuit Tracer (Hanna et al., 2025). See Appendix F and G for details on training and visualisation.

We study the problem of attention attribution, which seeks to understand how edges between features are mediated.

The key challenge here is that any given edge can be affected by a large number of model components, making mediation circuits difficult to analyse both computationally and conceptually: computationally, exhaustive enumeration is costly; conceptually, the resulting circuits are often large and uninterpretable. In this experiment, we demonstrate that sparse attention patterns induced via post-training substantially alleviate these challenges, as the vast majority of attention components have zero effect on the computation.

As in (Ameisen et al., 2025), we define the total attribution score between feature n at layer ℓ and position k, and feature n ′ at layer ℓ ′ and position k ′ as

Here, f ℓ k,n denotes the decoder vector corresponding to feature n at layer ℓ and position k, and g ℓ ′ k ′ ,n ′ is the corresponding encoder vector for feature n ′ at layer ℓ ′ and position k ′ . The term J ℓ ′ ,k ′ ℓ,k is the Jacobian from the MLP output at (ℓ, k) to the MLP input at (ℓ ′ , k ′ ). This Jacobian is computed during a forward pass in which all nonlinearities are frozen using stop-gradient operations. Under this linearisation, the attribution score represents the sum over all linear paths from the source feature to the target feature.

To analyse how this total effect between two features is mediated by each model component, we define the componentspecific attribution by subtracting the contribution of all paths that do not pass through the component:

Here, J ℓ ′ ,k ′ ℓ,k h denotes a modified Jacobian computed under the same linearization as above, but with the specific attention component h additionaly frozen via stop-gradient. As such, these component-specific scores quantifies how much each model component impacts a particular edge between features.

Empicially, we evaluate the method on ten pruned attribution graphs, computed on the IOI, greater-than, completion, and category tasks. Similar to our previous circuit discovery experiment, we compute attribution scores on the level of attention heads as well as individual key-query pairs. In practice, attention sparsity yields substantial computational savings: because inactive key-query pairs are known a priori to have exactly zero attribution score, attribution need only be computed for a small subset of components. This reduces the computation time per attribution graph from several hours to several minutes.

In terms of circuit size, Figure 6 shows the mean cumulative distribution of component attribution scores for each edge in the attribution graph. We find that, to reach a cumulative attribution threshold of 90%, the sparse model on average requires 16.1× fewer key-query pairs and 3.4× fewer attention heads when compared to the dense GPT-2 model, supporting our hypothesis that sparse attention patterns leads to simpler mediation circuits.

Next, we present a qualitative case-study to showcase the benefits of sparse attention patterns. For a given key-query pair, we compute the causal effect from all other features in the attribution graph to both the key and the query vectors. Figure 7 illustrates this analysis for the prompt “The opposite of ’large’ is”. The resulting attribution graph decomposes into four coherent clusters of features: features related to opposite, features related to large, features activating on bracketed tokens, and the final next-token logit corresponding to small (see Appendix H for example of features and visualization).

Here, the features in the large cluster are directly connected to the small logit. The key question is then to understand how this connection from the large to the small logit comes about. To this end, we analyse their mediation structure. We find that 80% of the cumulative attribution score of the edges connecting the large cluster to the small logit is mediated by the same five late layer attention key-query pairs. These attention components map features from token position 5 directly into the final-layer residual stream at position 8, and thus operate in parallel.

For these five key-query pairs, we then compute the causal influence of all other features in the graph on their key and query vectors. The query vectors are primarily modulated by features associated with bracketed tokens in the last token position, while the key vectors are driven by strongly active features in both the opposite and large clusters, as shown in Figure 8.These results are in agreement with the recent work on attention attribution and the “opposite of” attribution graph (Kamath et al., 2025). In stark contrast, Figure 7 (left) shows that a similar (and more computa- tionally expensive) analysis on the dense model produces a much more complicated circuit. This case study illustrates the potential of sparse attention in the context of attribution graphs, as it enables a unified view of features and circuits. By jointly analyzing feature activations, attention components, and their mediating roles, we obtain a more faithful picture of the computational graph underlying the model’s input-output behavior.

Achieving interpretability requires innovations in both interpretation techniques and model design. We investigate how large models can be trained to be intrinsically interpretable.

We present a flexible post-training procedure that sparsifies transformer attention while preserving the original pretraining loss. By minimally adapting the architecture, we apply a sparsity penalty under a constrained-loss objective, allowing pre-trained model to reorganise its connectivity into a much more selective and structured pattern.

Mechanistically, this induced sparsity gives rise to substantially simpler circuits: task-relevant computation concentrates into a small number of attention heads and edges.

Across a range of tasks and analyses, we show that sparsity improves interpretability at the circuit level by reducing the number of components involved in specific behaviours.

In circuit discovery experiments, most of the model’s behaviour can be explained by circuits that are orders of magnitude smaller than in dense models; in attribution graph analyses, the reduced number of mediating components renders attention attribution tractable. Together, these results position sparse post-training of attention as a practical and effective tool for enhancing the mechanistic interpretability of pre-trained models.

Limitations and Future Work. One limitation of the present investigation is that, while we deliberately focus on sparsity as a post-training intervention, it remains an open question whether injecting a sparsity bias directly during training would yield qualitatively different or simpler circuit structures. Also, a comprehensive exploration of the performance trade-offs for larger models and for tasks that require very dense or long-range attention patterns would be beneficial, even if beyond the computational means currently at our disposal. Moreover, our study is primarily restricted to sparsifying attention patterns, the underlying principle of leveraging sparsity to promote interpretability naturally extends to other components of the transformer architecture. As such, combining the proposed method with complementary approaches for training intrinsically interpretable models, such as Sparse Mixture-of-Experts (Yang et al., 2025), sparsifying model weights (Gao et al., 2024), or limiting superposition () offers a promising direction for future work. Another exciting avenue for future work is to apply the sparsity regularisation framework developed here within alternative post-training paradigms, such as reinforcement learning (Ouyang et al., 2022;Zhou et al., 2024) or supervised fine-tuning (Pareja et al., 2025).

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

For the experiments, we implemented efficient GPU kernels for the sparse attention layers using the helion domain-specific language3 . We refer to this implementation as Splash Attention (Sparse flash Attention). Our implementation follows the same core algorithmic structure as FlashAttention-2 (Dao, 2023), including the use of online softmax computation and tiling. Note that the sparse attention variant (Eq. 2) only differs from the standard attention by a pointwise multiplication of the adjacency matrix, which can be easily integrated into FlashAttention by computing A ij on-the-fly. We additionally fuse the Gumbel-softmax computation, the straight-through gradient, and the computation of the expected number of edges (required for the penalty) into a single optimized kernel, the implementation of which will be released together with the experiment code. Figure 10 compares our Splash Attention implementation against a naive baseline based on PyTorch-native operations. We provide the key hyperparameters for our experiments in table 2. All training are performed on NVIDIA H100 GPUs.

The GPT-2 model is trained on a single GPU while the OLMo model is trained on a node of 8 GPUs. The total training time for both models is roughly 14 days. The main sparse attention code will be made available as a Transformer library wrapper.

The implementation code as well as model weights will also be released. A key feature of our post-training framework is that the strength of the sparsity regularisation is automatically controlled via a constrained optimisation scheme. By pre-specifying an accepted level for the cross-entropy target, τ , the training procedure can be written as the max-min objective:

which can be optimised by taking alternating gradient steps in the model weight space and in the λ space. The resulting training dynamics means that the sparsity regularisation strength increases when the model cross-entropy is lower than the target, and decreases when the model is above the threshold. Figure 11 shows the training curves for the OLMo-7B model. Here, we observe that the strength of sparsity regularisation keeps increasing slowly while the model cross-entropy is clipped at the desired level. Note that during a model spike (at around 100K steps), the sparsity regularisation automatically decreases to let the model recover.

In this section, we provide additional results for the activation patching circuit discovery experiment presented in the main text.

Figure 12 shows the attention patterns of the heads required to explain 90% of model behaviour on a copy task. To fully test the longer context window afforded by OLMo, we use a longer prompt than the one used for GPT2 in the main text. The result is consistent with the GPT-2 experiment: sparsified model facilitates the discovery of smaller circuits of induction heads that implement the copy task. Here, the ranking is performed on a task level, meaning that the importance score for each component is pooled across different instances of the same task. Overall, the results are commensurate with results presented in the main paper, where the ranking strategy consistently discover smaller circuits in sparse models. The only exception is the Greater Than task for GPT-2 where the number of attention heads required for the sparse model is larger than that of the base model. We hypothesise that this is due to the sparse model choosing different circuits to implement different instances of the same task, rendering the task-level importance score less suitable for circuit discovery in this case. Finally, in Figure 15, we provide a qualitative visualisation of the edges required to completed the IOI task.

Sparse GPT-2 GPT-2 (baseline) Figure 15. An example of the attention-head edges required to reach 0.9 cumulative score based on the averaged scores for the IOI task.

In the following, we provide the details and the prompts for the various tasks used in section 4.2.

Each example contains a clean prompt, a corrupt prompt, and two disjoint sets of candidate continuations, answers and wrong answers. A typical entry is:

{ “clean”: “The demonstrations lasted from the year 1363 to 13”, “corrupt”: “The demonstrations lasted from the year 1301 to 13”, “answers”: [“64”, “65”, …, “99”], “wrong_answers”: [“00”, “01”, …, “63”] } For the clean prompt, any token in answers yields an end year strictly greater than the start year (e.g. “1364”-“1399”), whereas tokens in wrong answers correspond to years that are less than or equal to the start year. The corrupt prompt changes only the starting year, shifting which continuations correspond to valid end years. We use the logit difference between the aggregated probability mass on answers vs. wrong answers in clean vs. corrupt contexts as our signal, in the spirit of prior mechanistic studies on simple algorithmic tasks (Elhage et al., 2021;Nanda et al., 2023).

Our IOI setup follows the standard indirect object identification paradigm for mechanistic interpretability (Elhage et al., 2021;Conmy et al., 2023). Each example is generated by combining:

• a pair of names (A, B), e.g. (" Mary", " John");

• a natural-language template with placeholders For example, with (A, B) = (" John", " Mary") and S = B, one instance is:

Then, Mary and John went to the park. Mary gave a ball to The correct continuation is " John", while " Mary" and any distractor names are treated as incorrect candidates.

In the OLMo experiments, in order to further test the capability of our approach, we use a different set of IOI task with increased complexity and prompt length. Example templates include: “After several months without any contact due to conflicting schedules and unexpected personal obligations, [B] and [A] finally met again at the park, where they spent a long afternoon catching up on past events, sharing stories, and reflecting on how much had changed. As the day came to an end, [S] gave a ball to” “Although [B] and [A] had previously been involved in a long and emotionally charged argument that left several issues unresolved, they agreed to meet in order to clarify their misunderstandings. After a tense but honest conversation, [S] said to”

We also test the OLMo models on a more complex Docstring task (Heimersheim & Janiak, 2023;Conmy et al., 2023), where the model needs to attend to a specific argument for a specified function in order to complete a Docstring. Similarly to the Greater Than task, each example contains a clean prompt, a corrupt prompt, and two disjoint sets of candidate continuations. A typical entry is:

“clean”: “def model(self, results, old, option): "”" stage agency security vision spot tone joy session river unit :param results: bone paper selection sky :param old: host action hell miss :param", “corrupt”: “def model(self, command, output, state):

"”" stage agency security vision spot tone joy session river unit :param old: bone paper selection sky :param results: host action hell miss param", “answers”: [" option"], “wrong_answers”: [" results"," old"] } accuracy with sparsity and dead-feature regularization:

where W ℓ dec denotes the concatenated decoder weights associated with layer ℓ, h pre ℓ are the corresponding pre-activation values, τ is a threshold parameter, and C is a scaling constant. The hyperparameters λ 0 and λ df control the strength of the sparsity and dead-feature regularization terms. We initialize the weights with following circuits updates (Conerly et al., 2025). The encoder biais is initialize to have a fixed proportion of the features active at initialization. We provide in Figure 16 the training curves of the sparsity value, the sparsity coefficient, the explained variance, and the amount of dead features. We hope this can help the community in training their own cross-layer transcoders.

Following Ameisen et al. (2025), we define the attribution score between feature n at layer ℓ and position k, and feature n ′ at layer ℓ ′ and position k ′ , as

where f ℓ→s k,n denotes the decoder vector associated with feature n projecting from layer ℓ to layer s, J ℓ ′ ,k ′ s,k is the Jacobian mapping the MLP output at (ℓ, k) to the MLP input at (ℓ ′ , k ′ ), and g ℓ ′ k ′ ,n ′ is the corresponding encoder feature at layer ℓ ′ and position k ′ . The sum in this equation reflects the cross-layer mapping of the cross-layer transcoder.

The Jacobian is computed during a modified forward pass in which all nonlinear operations, including normalization layers, attention mechanisms, and MLPs, are frozen using stop-gradient operations. The resulting attribution graph is pruned by retaining only those features that cumulatively explain 80% of the contribution to the final logit, and only those edges that account for 95% of the total edge-level effect. All attribution computations are performed using the circuit-tracer library (Hanna et al., 2025). For a complete description of the attribution graph computation and pruning, we refer the user to reading (Ameisen et al., 2025).

For the visualization and the autointerp, we write our own pipeline. In Figure 17, we show a screenshot of the interface for the ‘The opposite of “large” is “’ attribution graph. The features are colored with respect to their corresponding clusters.

Clusters: opposite -large -brackets -say small Figure 17. Circuit-tracing interface example for the ‘The opposite of “large” is “’ with GPT2-sparse.

H. Graph: The opposite of “large” is "

We obtain a replacement score of 0.82, with 459 features identified before pruning and 82 features remaining after pruning. The majority of features in the resulting attribution graph fall into four dominant clusters:

• Opposition cluster: features associated with opposition and comparison, primarily localized at the token position corresponding to opposite.

• Magnitude cluster: features related to notions of size (e.g., large, big, full, medium), predominantly located in the residual stream at the large token position.

• Bracket cluster: features that activate on tokens enclosed in brackets.

• Final-logit cluster: mainly the final logit itself and a couple of features that activate before the token “small” or related terms.

In the boxes below, we present the top activations of representative feature sets for each cluster.

Feature 1117

“Opposite” cluster in Washington has now adopted the wider measure of student debt outstanding. This new the situation in Syria, Iran and the wider region. “The recharged by the wider dense forests of Sanjay Van and its overflow drained public, with interesting accounts of Oswald’s demeanor at this significant moment has a slightly wider range. Specifically, the Atom-powered NANO 56 becoming part of the wider Seven Years’ War in which Britain and France

Feature 1337

“Opposite” cluster opposite, piece of Mexico’s cultural identity. I made the hour opposite shows, or something bigger, “where there’s villains opposite sides of Mars in 2004 and used their instruments to discover geologic evidence opposite, but not anymore. Now everything he says to me is some kind opposite direction, and had little trouble finding space at the campsites.

always seem to be just the opposite.

show a growing trend to cast “no” votes , opposing how much salary and the occupation of the opposing forces was generally limited to mutual observation.

work hand in hand for the purpose of opposing all movements of the thinking part the defense’s inability to stop opposing run games. The Bills have ing opposing quarterbacks. The Seahawks not only had depth, they were versatile.

to win more hand battles particularly when the opposing tackle neutralizes his initial

Technically, the sampled A affects the computation. This can be mitigated by adding a positive bias term inside the sigmoid function to ensure all gates are open at initialisation. Experimentally, we found this to be unnecessary as the models quickly recover their original performance within a small number of gradient steps.

The largest open-source CLT is on Gemma-2B at the time of writing.

https://helionlang.com/

📄 Read Full PDF on ArXiv