GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering

Reading time: 6 minute
...

📝 Original Info

  • Title: GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering
  • ArXiv ID: 2512.06655
  • Date: 2025-12-07
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Large language models (LLMs) face critical safety challenges, as they can be manipulated to generate harmful content through adversarial prompts and jailbreak attacks. Many defenses are typically either black-box guardrails that filter outputs, or internals-based methods that steer hidden activations by operationalizing safety as a single latent feature or dimension. While effective for simple concepts, this assumption is limiting, as recent evidence shows that abstract concepts such as refusal and temporality are distributed across multiple features rather than isolated in one. To address this limitation, we introduce Graph-Regularized Sparse Autoencoders (GSAEs), which extends SAEs with a Laplacian smoothness penalty on the neuron co-activation graph. Unlike standard SAEs that assign each concept to a single latent feature, GSAEs recover smooth, distributed safety representations as coherent patterns spanning multiple features. We empirically demonstrate that GSAE enables effective runtime safety steering, assembling features into a weighted set of safety-relevant directions and controlling them with a two-stage gating mechanism that activates interventions only when harmful prompts or continuations are detected during generation. This approach enforces refusals adaptively while preserving utility on benign queries. Across safety and QA benchmarks, GSAE steering achieves an average 82% selective refusal rate, substantially outperforming standard SAE steering (42%), while maintaining strong task accuracy (70% on TriviaQA, 65% on TruthfulQA, 74% on GSM8K). Robustness experiments further show generalization across LLaMA-3, Mistral, Qwen, and Phi families and resilience against jailbreak attacks (GCG, AutoDAN), consistently maintaining >= 90% refusal of harmful content.

💡 Deep Analysis

Deep Dive into GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering.

Large language models (LLMs) face critical safety challenges, as they can be manipulated to generate harmful content through adversarial prompts and jailbreak attacks. Many defenses are typically either black-box guardrails that filter outputs, or internals-based methods that steer hidden activations by operationalizing safety as a single latent feature or dimension. While effective for simple concepts, this assumption is limiting, as recent evidence shows that abstract concepts such as refusal and temporality are distributed across multiple features rather than isolated in one. To address this limitation, we introduce Graph-Regularized Sparse Autoencoders (GSAEs), which extends SAEs with a Laplacian smoothness penalty on the neuron co-activation graph. Unlike standard SAEs that assign each concept to a single latent feature, GSAEs recover smooth, distributed safety representations as coherent patterns spanning multiple features. We empirically demonstrate that GSAE enables effective r

📄 Full Content

GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering Jehyeok Yeon University of Illinois Urbana-Champaign jehyeok2@illinois.edu Federico Cinus CENTAI Institute federico.cinus@centai.eu Yifan Wu University of Southern California ywu32880@usc.edu Luca Luceri University of Southern California Information Sciences Institute lluceri@isi.edu Abstract Large language models (LLMs) face critical safety challenges, as they can be ma- nipulated to generate harmful content through adversarial prompts and jailbreak attacks. Many defenses are typically either black-box guardrails that filter outputs, or internals-based methods that steer hidden activations by operationalizing safety as a single latent feature or dimension. While effective for simple concepts, this assumption is limiting, as recent evidence shows that abstract concepts such as re- fusal and temporality are distributed across multiple features rather than isolated in one. To address this limitation, we introduce Graph-Regularized Sparse Au- toencoders (GSAEs), which extends SAEs with a Laplacian smoothness penalty on the neuron co-activation graph. Unlike standard SAEs that assign each concept to a single latent feature, GSAEs recover smooth, distributed safety representa- tions as coherent patterns spanning multiple features. We empirically demonstrate that GSAE enables effective runtime safety steering, assembling features into a weighted set of safety-relevant directions and controlling them with a two-stage gating mechanism that activates interventions only when harmful prompts or con- tinuations are detected during generation. This approach enforces refusals adap- tively while preserving utility on benign queries. Across safety and QA bench- marks, GSAE steering achieves an average 82% selective refusal rate, substan- tially outperforming standard SAE steering (42%), while maintaining strong task accuracy (70% on TriviaQA, 65% on TruthfulQA, 74% on GSM8K). Robustness experiments further show generalization across LLaMA-3, Mistral, Qwen, and Phi families and resilience against jailbreak attacks (GCG, AutoDAN), consis- tently maintaining ≥90% refusal of harmful content. 1 Introduction Modern large language models (LLMs) excel at diverse tasks like question answering and reason- ing (Touvron et al., 2023), yet their deployment faces significant safety challenges. LLMs can be manipulated into generating harmful content through adversarial prompts and jailbreak attacks (Wei et al., 2023). Effective defenses must both block unsafe generations and preserve the model’s utility on benign queries (Ganguli et al., 2022). Existing safety approaches generally fall into two categories: black-box guardrails and internals- based methods. Black-box guardrails, such as prompt engineering (Bai et al., 2022) or output clas- sifiers (Inan et al., 2023), offer quick defenses but are often brittle to distributional shifts (Zou et al., Preprint. arXiv:2512.06655v2 [cs.LG] 4 Feb 2026 2023) and lack interpretability. Internals-based methods (Turner et al., 2023) aim to leverage the model’s hidden representations. Sparse autoencoders (SAEs) have become a prominent tool in this category, allowing the decomposition of hidden activations into sparse, often interpretable, latent features (Cunningham et al., 2023; Templeton et al., 2024; Bricken et al., 2023). Despite their utility for interpreting concrete concepts, standard SAEs may have limitations when applied to complex domains like time or safety. This is because SAEs are inherently local, encour- aging each latent dimension to represent a single “monosemantic” feature. This often leads to it can be fragmented into disconnected sub-concepts (like ‘refusal’ or ‘danger’) or create redundant fea- tures that overlap in meaning, failing to learn a coherent representation (Bricken et al., 2023; Pach et al., 2025). Recent studies highlight this representational gap for abstract concepts. While concrete concepts (e.g., objects) often align with single, axis-like features, higher-level abstract concepts are typically encoded in a distributed and nonlinear fashion (Liao et al., 2023). For instance, temporal concepts manifest as nonlinear circular manifolds (Engels et al., 2025), and refusal behavior involves multi- ple independent directions and nonlinear geometries (Wollschl¨ager et al., 2025; Hildebrandt et al., 2025). This evidence suggests that abstract concepts are better modeled as distributed properties. We argue that safety, as an abstract, socially grounded concept dependent on context and human judgment (Slavich, 2023), requires a distributed representation. Our proposed approach. To model safety as a distributed concept, we introduce the Graph- Regularized Sparse Autoencoder (GSAE). GSAE extends standard SAEs by incorporating a graph Laplacian regularizer (Belkin et al., 2006). This treats each neuron as a node, with edges defined by activation similarity. The Laplacian penalty enforces smoothness acro

…(Full text truncated)…

📸 Image Gallery

GSAE.png GSAE.webp bar.png bar.webp dirichlet_energy_pdf_cdf.png dirichlet_energy_pdf_cdf.webp risk_score_distribution.png risk_score_distribution.webp scatter.png scatter.webp violin.png violin.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut