Identifying Intervenable and Interpretable Features via Orthogonality Regularization

Identifying Intervenable and Interpretable Features via Orthogonality Regularization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

With recent progress on fine-tuning language models around a fixed sparse autoencoder, we disentangle the decoder matrix into almost orthogonal features. This reduces interference and superposition between the features, while keeping performance on the target dataset essentially unchanged. Our orthogonality penalty leads to identifiable features, ensuring the uniqueness of the decomposition. Further, we find that the distance between embedded feature explanations increases with stricter orthogonality penalty, a desirable property for interpretability. Invoking the $\textit{Independent Causal Mechanisms}$ principle, we argue that orthogonality promotes modular representations amenable to causal intervention. We empirically show that these increasingly orthogonalized features allow for isolated interventions. Our code is available under $\texttt{https://github.com/mrtzmllr/sae-icm}$.


💡 Research Summary

This paper tackles two longstanding challenges in mechanistic interpretability of large language models (LLMs): the lack of identifiability of sparse autoencoder (SAE) features and the difficulty of performing isolated interventions on those features. Building on recent work that fine‑tunes LLMs around a fixed SAE, the authors introduce an orthogonality regularization term applied to the decoder matrix of the SAE. The regularizer penalizes the off‑diagonal entries of DᵀD, encouraging the feature vectors (the columns of D) to become nearly orthogonal.

The theoretical motivation draws from classical dictionary learning. In over‑complete dictionaries, low self‑coherence (μ) guarantees that K‑sparse representations are unique (Donoho et al., 2005). By shrinking μ through the orthogonal penalty, the authors ensure that the learned features become identifiable: the same sparse combination of features cannot be reproduced by a different set of coefficients. They formalize this connection using finite frame theory: the analysis operator T maps a residual stream vector to its inner products with the feature set, while the synthesis operator T* reconstructs the stream from coefficients. When a feature fₗ is intervened on (adding a scalar α to its coefficient), the new coefficients of all other features shift by α⟨fⱼ, fₗ⟩. Theorem 3.1 quantifies this “interference” as proportional to the inner products between features, showing that more orthogonal features lead to less spill‑over during interventions.

Experimentally, the authors follow a two‑step pipeline. First, they fine‑tune a pre‑trained Top‑K SAE on a reconstruction loss augmented with the orthogonal penalty, exploring λ ∈ {0, 10⁻⁶, 10⁻⁵, 10⁻⁴}. Second, they embed the SAE into a 2‑B parameter Gemma transformer (after layer 12) and low‑rank adapt the whole model on the MetaMathQA dataset for a single epoch. They evaluate four aspects: (1) orthogonality loss, (2) downstream math reasoning performance on GSM8K, (3) interpretability via human‑readable explanations generated by Llama 3.1 8B‑Instruct, and (4) the cosine similarity between embedded feature explanations.

Results show a monotonic decrease in the orthogonality loss as λ increases, confirming that the regularizer successfully pushes the dictionary toward orthogonality. Crucially, GSM8K accuracy remains essentially unchanged across all λ values (≈ 0.68 for λ = 0, ≈ 0.67 for λ = 10⁻⁴), demonstrating that the regularization does not sacrifice task performance. Interpretability scores—measured as the ability of a language model to match a generated explanation to one of five example snippets—stay around 40 % for all λ, well above the 20 % random baseline, indicating that orthogonalization does not degrade the human‑readability of features. Most strikingly, the average cosine similarity between embedded explanations drops from ~0.60 at λ = 0 to below 0.58 at λ = 10⁻⁴, suggesting that orthogonal features are more semantically distinct.

The authors also conduct concrete intervention experiments. They identify a feature corresponding to the concept “Jerry”, zero it out, and replace it with a feature that activates on the prefix “aqua”. The model still solves the arithmetic problem (producing the correct answer 624) while swapping the name “Jerry” for “Aquaman” in the generated text, and the downstream effect on other concepts is minimal. This demonstrates that the orthogonalized features satisfy the Independent Causal Mechanisms (ICM) principle: intervening on one latent mechanism does not alter the functional form of the others.

In summary, the paper makes three key contributions: (1) it introduces an orthogonality regularizer that yields identifiable, nearly orthogonal SAE features; (2) it shows empirically that this regularization preserves downstream performance and interpretability; and (3) it provides both theoretical analysis and practical experiments confirming that more orthogonal features enable isolated, low‑interference interventions, aligning with causal representation learning goals. The work offers a compelling pathway toward more controllable and explainable language models.


Comments & Academic Discussion

Loading comments...

Leave a Comment