Discovering Interpretable Algorithms by Decompiling Transformers to RASP
Recent work has shown that the computations of Transformers can be simulated in the RASP family of programming languages. These findings have enabled improved understanding of the expressive capacity and generalization abilities of Transformers. In particular, Transformers have been suggested to length-generalize exactly on problems that have simple RASP programs. However, it remains open whether trained models actually implement simple interpretable programs. In this paper, we present a general method to extract such programs from trained Transformers. The idea is to faithfully re-parameterize a Transformer as a RASP program and then apply causal interventions to discover a small sufficient sub-program. In experiments on small Transformers trained on algorithmic and formal language tasks, we show that our method often recovers simple and interpretable RASP programs from length-generalizing transformers. Our results provide the most direct evidence so far that Transformers internally implement simple RASP programs.
💡 Research Summary
The paper introduces a systematic pipeline for extracting human‑readable algorithms from trained transformer models by translating them into a RASP‑style language called Decompiled RASP (D‑RASP) and then simplifying the resulting program through causal interventions. The authors first define D‑RASP, a dialect that mirrors the primitives of transformers—selectors, aggregations, element‑wise operations, and projection to logits—while exposing intermediate variables in token‑ and position‑bases. Crucially, D‑RASP works with real‑valued activations and softmax‑based aggregation, extending earlier RASP variants that were limited to counts or booleans.
The first stage of the pipeline, “Faithful Re‑parameterization,” shows that any GPT‑2‑style transformer satisfying a Linear Layer‑Norm Assumption (LLN‑A) can be exactly expressed as a D‑RASP program. Under LLN‑A, each layer‑norm is replaced by a linear transformation with negligible effect on the model’s input‑output behavior. The authors prove (Theorem 3.2) that the residual stream at every layer can be reconstructed as a linear combination of D‑RASP variables. Concretely, token and position embeddings become the initial variables token and pos. For each attention head, four selectors are built to capture all key‑query interactions (token‑to‑token, token‑to‑position, etc.). These selectors feed into an aggregate operation that produces head‑specific position and token summaries. The head output is then a linear combination of these summaries, and the MLP is represented as an elementwise operation. By recursively applying this construction layer‑by‑layer, the entire transformer is mapped to a D‑RASP program whose final project statements use the model’s unembedding matrix to produce next‑token logits.
The second stage, “Causal Pruning & Simplification,” aims to reduce the exponentially large program to a minimal, interpretable sub‑program while preserving functional fidelity. The authors perform causal ablations: selectors are replaced with zeros or key‑only versions if doing so does not significantly affect predictions; inputs to elementwise ops are removed and absorbed into the function’s bias; and unused variables are dropped entirely. Faithfulness is measured by “match accuracy,” the proportion of inputs for which the pruned program’s top‑token predictions match those of the original model. A threshold of 90 % is set for a successful decompilation.
Experiments are conducted on small GPT‑2‑style models (1‑layer 4‑head and 4‑layer 4‑head) trained on algorithmic tasks such as “most frequent character,” induction‑head copying, and bounded‑depth Dyck language bracket counting. For models that successfully length‑generalize, the decompilation pipeline recovers compact programs that align with known theoretical explanations: a selector‑free aggregation of token counts followed by an identity projection for the most‑frequent‑character task; an induction‑head pattern that copies the previous token; and a counting program that aggregates opening and closing brackets. These recovered programs are often only a few lines long and consist of primitives from the provided library, demonstrating that the models internally implement simple RASP‑like algorithms. Conversely, models that fail to generalize either cannot be pruned without dropping match accuracy or result in programs that remain dense and unintuitive, suggesting reliance on more entangled mechanisms.
The paper also situates D‑RASP within the broader theoretical landscape. By imposing rounded semantics and restricting parameters to rational logarithms, D‑RASP becomes equivalent to C‑RASP (Theorem 2.1), linking it to prior work on length‑generalization and to the logical framework FO(M). This connection shows that the decompilation method bridges the gap between abstract RASP‑based analyses and concrete mechanistic interpretability.
In summary, the authors provide (1) a provably exact translation from transformers to a RASP dialect, (2) a causal pruning methodology that yields minimal, human‑readable programs, and (3) empirical evidence that length‑generalizing transformers indeed embody simple, extractable algorithms. Limitations include reliance on the linear layer‑norm assumption, focus on small models, and the need for scalable pruning strategies for larger architectures. Future work could extend the approach to non‑linear layer norms, larger language models, and real‑world NLP tasks, potentially offering a powerful tool for mechanistic interpretability and for designing models with provable generalization properties.
Comments & Academic Discussion
Loading comments...
Leave a Comment