Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks

Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

State-space models (SSMs) have emerged as an efficient strategy for building powerful language models, avoiding the quadratic complexity of computing attention in transformers. Despite their promise, the interpretability and steerability of modern SSMs remain relatively underexplored. We take a major step in this direction by identifying activation subspace bottlenecks in the Mamba family of SSM models using tools from mechanistic interpretability. We then introduce a test-time steering intervention that simply multiplies the activations of the identified bottlenecks by a scalar. Across 5 SSMs and 6 diverse benchmarks, this intervention improves performance by an average of 8.27%, without requiring any task-specific tuning. Finally, we validate that the identified bottlenecks are indeed hindering performance by modifying them to yield an architecture we call Stable-Mamba, which achieves long-context performance gains when retrained from scratch.


💡 Research Summary

State‑space models (SSMs) have emerged as a compelling alternative to transformers, offering linear‑time training and constant‑time inference per token by maintaining a fixed‑size hidden state. Despite these efficiency gains, the recurrent nature of SSMs makes them opaque: they lack explicit neuron‑level representations that have been the focus of most mechanistic interpretability work on transformers. This paper addresses the gap by introducing a systematic interpretability pipeline for the Mamba family of SSMs and by demonstrating how the insights can be turned into practical performance improvements.

The authors first train sparse autoencoders (SAEs) on the hidden activations of Mamba, then apply dictionary learning to obtain a set of sparse latent features that define meaningful activation subspaces. Using Stochastic Parameter Decomposition (SPD), they compute layer‑wise statistics such as activation mean, variance, sparsity, entropy, effective rank, coefficient of variation, gradient sensitivity, and post‑ablation KL divergence. An entropy spike observed at layer 20 signals a “bottleneck” where diverse information is forced through a narrow set of parameters.

To pinpoint the functional components responsible for this bottleneck, the paper introduces “Delta‑Sensitive” subspaces. In Mamba, the Δ‑parameter governs the discretized time‑step of the state‑space update, controlling how strongly new inputs influence the recurrent hidden state. By recording the activation subspace values across a large corpus and selecting those with the highest variance, the authors identify 668 out of 768 subspaces as highly Δ‑sensitive. Ablating these subspaces dramatically degrades next‑token prediction accuracy, confirming their critical role.

Armed with this knowledge, the authors propose a test‑time steering intervention: multiply the activations of the identified Delta‑Sensitive subspaces by a scalar factor. A simple grid search on a held‑out portion of The Pile reveals that a factor of 5 yields the best results, while a factor of 2 also provides modest gains. Importantly, this steering requires no fine‑tuning and is applied only during inference. Across five SSM variants (Mamba‑130M, Mamba‑2, DenseMamba, Hyena, MiniPLM‑Mamba) and six diverse benchmarks (TriviaQA, SQuAD, MuSiQue, IFEval, RULER, DROP), the steering improves average performance by 8.27 % relative to the unmodified models.

To address limitations that persist even after steering—particularly poor long‑context behavior—the paper introduces Stable‑Mamba, a minimally altered architecture that explicitly removes the identified bottlenecks. Stable‑Mamba adds only 256 parameters and incorporates several design changes: multi‑timescale state dynamics for simultaneous short, medium, and long‑range processing; an ensembled output projection that adaptively weights contributions from each timescale; sparse global‑context injection via a lightweight attention mechanism; learned gating to selectively activate task‑relevant features; adaptive compression length to control phase‑specific capacity; gradient scaling and scaled residual connections to improve training stability. These modifications have negligible impact on inference speed or memory. When trained from scratch on The Pile, Stable‑Mamba achieves notable gains on long‑context benchmarks such as Long Range Arena, RULER, and LongContext V2, outperforming the original Mamba by roughly 10–15 % absolute accuracy.

In summary, the paper makes three major contributions: (1) a novel mechanistic interpretability framework for SSMs that identifies activation subspace bottlenecks; (2) a simple, task‑agnostic test‑time steering technique that amplifies the identified subspaces and yields consistent performance improvements without additional training; and (3) the design of Stable‑Mamba, which structurally eliminates the bottlenecks and further enhances long‑context capabilities while adding only a tiny parameter overhead. This work demonstrates that SSMs can be both efficient and interpretable, opening the door to more transparent and controllable sequence models that rival transformers in accuracy.


Comments & Academic Discussion

Loading comments...

Leave a Comment