SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models
Large language models (LLMs) have achieved remarkable progress, yet their internal mechanisms remain largely opaque, posing a significant challenge to their safe and reliable deployment. Sparse autoencoders (SAEs) have emerged as a promising tool for decomposing LLM representations into more interpretable features, but explaining the features captured by SAEs remains a challenging task. In this work, we propose SAGE (SAE AGentic Explainer), an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanation-driven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanationdriven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.
💡 Research Summary
The paper tackles the longstanding problem of interpreting the internal representations of large language models (LLMs) by focusing on the features extracted by sparse autoencoders (SAEs). While recent work such as Neuronpedia uses off‑the‑shelf LLMs (e.g., GPT‑4, Claude 4.5) to generate a single natural‑language description for each SAE feature, this approach suffers from two major drawbacks: (1) lack of consistency—different LLMs often produce divergent explanations for the same feature, and (2) inability to capture polysemantic features that activate for multiple distinct patterns, because only one explanation is provided.
To overcome these limitations, the authors propose SAGE (SAE Agentic Explainer), an agent‑based framework that reframes feature interpretation as an active, hypothesis‑testing loop rather than a passive, one‑shot generation task. The system consists of four specialized agents, all powered by the same high‑capacity LLM (GPT‑5):
- Explainer – extracts the top‑k (k = 10) text snippets that most strongly activate a target SAE feature f_j and, using a prompt (P_init), generates n (n = 4) initial, testable hypotheses H_i about what semantic or structural property might cause the activation.
- Designer – for each hypothesis H_i, creates a concrete test sentence T_i (via prompt P_test) that should, if the hypothesis is correct, produce a high activation of f_j when processed by the target LLM.
- Analyzer – runs T_i through the target LLM, passes the resulting hidden state through the SAE encoder, and records the activation magnitude a_i = SAE_j(TargetLLM(T_i)).
- Reviewer – evaluates the activation evidence (using prompt P_analyze) and decides one of four state transitions for each hypothesis: Accept, Reject, Refine, or Refute.
The loop iterates: accepted hypotheses are kept as final explanations; rejected ones are discarded; refined hypotheses are updated (both the textual description and the test sentence) to better match the observed activation pattern; refuted hypotheses trigger alternative probing to understand why the prediction failed. All evidence (hypothesis, test text, activation) is accumulated in a history set E(t), which informs subsequent iterations. The process terminates when all hypotheses reach a terminal state or a maximum number of turns is reached.
After convergence, the Reviewer synthesizes the accepted hypotheses and their supporting evidence into a final natural‑language explanation E. For monosemantic features this yields a single, well‑validated description; for polysemantic features it produces multiple distinct facets, each with concrete trigger examples.
Experimental Setup
The authors evaluate SAGE on three open‑source LLMs—Qwen‑3‑4B, Gemma‑2‑2B, and GPT‑OSS‑20B—sampling four transformer layers (3, 7, 11, 23) from each model. From each layer they randomly select 10 SAE features (d_sae = 16 000, so 10 × 4 × 3 = 120 features total). All agents use GPT‑5, ensuring a fair comparison with the baseline. The baseline is Neuronpedia, which also uses GPT‑5 for explanation generation but relies on a single‑pass approach. The authors keep the experimental conditions identical across methods: same top‑k activation exemplars, same LLM for generating text, same activation measurement API, and identical test‑sentence sampling procedures.
Evaluation Metrics
Two complementary metrics are introduced:
Generative Accuracy – measures causal validity: given an explanation, can one generate novel text that reliably activates the target feature?
Predictive Accuracy – measures descriptive power: how well does the explanation predict feature activations on held‑out data?
Implementation details are in the appendix.
Results
Table 2 shows that SAGE consistently outperforms Neuronpedia across all models and layers. Generative accuracy improvements range from 29 % to 458 %, with the most dramatic gains at deeper layers where Neuronpedia’s performance collapses (e.g., layer 23: 0.67 vs 0.12). Predictive accuracy also improves modestly but consistently (0.65–0.83 for SAGE vs 0.52–0.70 for Neuronpedia). Qualitative analysis reveals that SAGE can uncover nuanced, conditional behaviors such as “activates on capitalized proper‑name tokens with a strong lexical bias” or “responds to brand‑specific morphemes”, which single‑pass methods miss.
Key Insights and Contributions
- Methodologically, SAGE introduces a scientific‑style hypothesis‑testing cycle into neural interpretability, turning explanations into empirically validated claims.
- By maintaining multiple parallel hypotheses, the framework naturally captures polysemantic features and provides multi‑faceted explanations.
- The feedback‑driven refinement loop yields explanations that are both more consistent (different runs converge on the same validated hypothesis) and more actionable (they come with concrete test sentences that reliably trigger the feature).
- Empirically, SAGE demonstrates substantial gains in both generative and predictive metrics across a diverse set of LLMs and transformer depths.
Limitations and Future Work
The current implementation relies on a single large LLM (GPT‑5) for all agents, which is computationally expensive and may limit scalability. The search space for test sentences can grow rapidly for complex hypotheses, suggesting a need for more efficient probing strategies. Moreover, the study evaluates only 10 features per layer; extending to a larger portion of the 16 K‑dimensional feature space will be necessary to assess coverage. Finally, integrating cheaper, specialized probing models or reinforcement‑learning‑based test generation could reduce cost while preserving the rigorous scientific workflow.
Conclusion
SAGE redefines SAE feature interpretation as an active, agentic process that iteratively formulates, tests, and refines multiple explanations using direct activation feedback. This approach yields explanations that are empirically grounded, more consistent, and capable of handling polysemantic phenomena. The reported improvements in both generative and predictive accuracy suggest that SAGE sets a new benchmark for interpretability of large language models, and its hypothesis‑testing paradigm may inspire future work across a broad range of neural‑network analysis tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment