📝 Original Info
- Title: GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering
- ArXiv ID: 2512.06655
- Date: 2025-12-07
- Authors: Researchers from original ArXiv paper
📝 Abstract
Large language models (LLMs) face critical safety challenges, as they can be manipulated to generate harmful content through adversarial prompts and jailbreak attacks. Many defenses are typically either black-box guardrails that filter outputs, or internals-based methods that steer hidden activations by operationalizing safety as a single latent feature or dimension. While effective for simple concepts, this assumption is limiting, as recent evidence shows that abstract concepts such as refusal and temporality are distributed across multiple features rather than isolated in one. To address this limitation, we introduce Graph-Regularized Sparse Autoencoders (GSAEs), which extends SAEs with a Laplacian smoothness penalty on the neuron co-activation graph. Unlike standard SAEs that assign each concept to a single latent feature, GSAEs recover smooth, distributed safety representations as coherent patterns spanning multiple features. We empirically demonstrate that GSAE enables effective runtime safety steering, assembling features into a weighted set of safety-relevant directions and controlling them with a two-stage gating mechanism that activates interventions only when harmful prompts or continuations are detected during generation. This approach enforces refusals adaptively while preserving utility on benign queries. Across safety and QA benchmarks, GSAE steering achieves an average 82% selective refusal rate, substantially outperforming standard SAE steering (42%), while maintaining strong task accuracy (70% on TriviaQA, 65% on TruthfulQA, 74% on GSM8K). Robustness experiments further show generalization across LLaMA-3, Mistral, Qwen, and Phi families and resilience against jailbreak attacks (GCG, AutoDAN), consistently maintaining >= 90% refusal of harmful content.
💡 Deep Analysis
Deep Dive into GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering.
Large language models (LLMs) face critical safety challenges, as they can be manipulated to generate harmful content through adversarial prompts and jailbreak attacks. Many defenses are typically either black-box guardrails that filter outputs, or internals-based methods that steer hidden activations by operationalizing safety as a single latent feature or dimension. While effective for simple concepts, this assumption is limiting, as recent evidence shows that abstract concepts such as refusal and temporality are distributed across multiple features rather than isolated in one. To address this limitation, we introduce Graph-Regularized Sparse Autoencoders (GSAEs), which extends SAEs with a Laplacian smoothness penalty on the neuron co-activation graph. Unlike standard SAEs that assign each concept to a single latent feature, GSAEs recover smooth, distributed safety representations as coherent patterns spanning multiple features. We empirically demonstrate that GSAE enables effective r
📄 Full Content
GSAE: Graph-Regularized Sparse Autoencoders for
Robust LLM Safety Steering
Jehyeok Yeon
University of Illinois Urbana-Champaign
jehyeok2@illinois.edu
Federico Cinus
CENTAI Institute
federico.cinus@centai.eu
Yifan Wu
University of Southern California
ywu32880@usc.edu
Luca Luceri
University of Southern California
Information Sciences Institute
lluceri@isi.edu
Abstract
Large language models (LLMs) face critical safety challenges, as they can be ma-
nipulated to generate harmful content through adversarial prompts and jailbreak
attacks. Many defenses are typically either black-box guardrails that filter outputs,
or internals-based methods that steer hidden activations by operationalizing safety
as a single latent feature or dimension. While effective for simple concepts, this
assumption is limiting, as recent evidence shows that abstract concepts such as re-
fusal and temporality are distributed across multiple features rather than isolated
in one. To address this limitation, we introduce Graph-Regularized Sparse Au-
toencoders (GSAEs), which extends SAEs with a Laplacian smoothness penalty
on the neuron co-activation graph. Unlike standard SAEs that assign each concept
to a single latent feature, GSAEs recover smooth, distributed safety representa-
tions as coherent patterns spanning multiple features. We empirically demonstrate
that GSAE enables effective runtime safety steering, assembling features into a
weighted set of safety-relevant directions and controlling them with a two-stage
gating mechanism that activates interventions only when harmful prompts or con-
tinuations are detected during generation. This approach enforces refusals adap-
tively while preserving utility on benign queries. Across safety and QA bench-
marks, GSAE steering achieves an average 82% selective refusal rate, substan-
tially outperforming standard SAE steering (42%), while maintaining strong task
accuracy (70% on TriviaQA, 65% on TruthfulQA, 74% on GSM8K). Robustness
experiments further show generalization across LLaMA-3, Mistral, Qwen, and
Phi families and resilience against jailbreak attacks (GCG, AutoDAN), consis-
tently maintaining ≥90% refusal of harmful content.
1
Introduction
Modern large language models (LLMs) excel at diverse tasks like question answering and reason-
ing (Touvron et al., 2023), yet their deployment faces significant safety challenges. LLMs can be
manipulated into generating harmful content through adversarial prompts and jailbreak attacks (Wei
et al., 2023). Effective defenses must both block unsafe generations and preserve the model’s utility
on benign queries (Ganguli et al., 2022).
Existing safety approaches generally fall into two categories: black-box guardrails and internals-
based methods. Black-box guardrails, such as prompt engineering (Bai et al., 2022) or output clas-
sifiers (Inan et al., 2023), offer quick defenses but are often brittle to distributional shifts (Zou et al.,
Preprint.
arXiv:2512.06655v2 [cs.LG] 4 Feb 2026
2023) and lack interpretability. Internals-based methods (Turner et al., 2023) aim to leverage the
model’s hidden representations. Sparse autoencoders (SAEs) have become a prominent tool in this
category, allowing the decomposition of hidden activations into sparse, often interpretable, latent
features (Cunningham et al., 2023; Templeton et al., 2024; Bricken et al., 2023).
Despite their utility for interpreting concrete concepts, standard SAEs may have limitations when
applied to complex domains like time or safety. This is because SAEs are inherently local, encour-
aging each latent dimension to represent a single “monosemantic” feature. This often leads to it can
be fragmented into disconnected sub-concepts (like ‘refusal’ or ‘danger’) or create redundant fea-
tures that overlap in meaning, failing to learn a coherent representation (Bricken et al., 2023; Pach
et al., 2025).
Recent studies highlight this representational gap for abstract concepts. While concrete concepts
(e.g., objects) often align with single, axis-like features, higher-level abstract concepts are typically
encoded in a distributed and nonlinear fashion (Liao et al., 2023). For instance, temporal concepts
manifest as nonlinear circular manifolds (Engels et al., 2025), and refusal behavior involves multi-
ple independent directions and nonlinear geometries (Wollschl¨ager et al., 2025; Hildebrandt et al.,
2025). This evidence suggests that abstract concepts are better modeled as distributed properties.
We argue that safety, as an abstract, socially grounded concept dependent on context and human
judgment (Slavich, 2023), requires a distributed representation.
Our proposed approach.
To model safety as a distributed concept, we introduce the Graph-
Regularized Sparse Autoencoder (GSAE). GSAE extends standard SAEs by incorporating a graph
Laplacian regularizer (Belkin et al., 2006). This treats each neuron as a node, with edges defined
by activation similarity. The Laplacian penalty enforces smoothness acro
…(Full text truncated)…
📸 Image Gallery
Reference
This content is AI-processed based on ArXiv data.