Fail-Closed Alignment for Large Language Models
We identify a structural weakness in current large language model (LLM) alignment: modern refusal mechanisms are fail-open. While existing approaches encode refusal behaviors across multiple latent features, suppressing a single dominant feature$-$via prompt-based jailbreaks$-$can cause alignment to collapse, leading to unsafe generation. Motivated by this, we propose fail-closed alignment as a design principle for robust LLM safety: refusal mechanisms should remain effective even under partial failures via redundant, independent causal pathways. We present a concrete instantiation of this principle: a progressive alignment framework that iteratively identifies and ablates previously learned refusal directions, forcing the model to reconstruct safety along new, independent subspaces. Across four jailbreak attacks, we achieve the strongest overall robustness while mitigating over-refusal and preserving generation quality, with small computational overhead. Our mechanistic analyses confirm that models trained with our method encode multiple, causally independent refusal directions that prompt-based jailbreaks cannot suppress simultaneously, providing empirical support for fail-closed alignment as a principled foundation for robust LLM safety.
💡 Research Summary
The paper begins by diagnosing a fundamental structural flaw in contemporary large language model (LLM) alignment: the refusal (or safety) mechanisms are “fail‑open.” Although prior work has shown that refusal behavior is distributed across multiple latent features, the authors demonstrate that in practice a single dominant linear direction—identified via a difference‑in‑means (DIM) estimate—accounts for the bulk of the model’s refusal response. By suppressing this direction through prompt‑based jailbreaks (e.g., GCG, AutoDAN, HumanJailbreak, and a template‑based attack), the refusal collapses and the model complies with harmful requests, confirming that current alignment behaves like a single point of failure.
To address this, the authors introduce the design principle of “fail‑closed alignment.” The principle requires that safety be encoded redundantly across several causally independent pathways so that partial failures do not automatically lead to unsafe behavior. They instantiate this principle with a progressive alignment algorithm that iteratively discovers, removes, and replaces refusal directions, thereby forcing the model to construct new, independent refusal subspaces at each step.
The algorithm proceeds as follows:
-
Identify the next dominant refusal direction – Using Refusal Direction Optimization (RDO), a gradient‑based method that outperforms simple activation‑space estimates, the algorithm finds a direction r_k that maximally influences refusal when added to or projected out of hidden states. The initial seed for RDO is the DIM estimate, and orthogonalization ensures linear independence from previously discovered directions.
-
Construct a multi‑feature ablation (MFA) operator – All identified directions {r_1,…,r_k} are stacked into a matrix R_k, orthogonalized via QR decomposition to obtain Q_k, and then an orthogonal projection h → h – Q_k Q_kᵀ h is applied to every hidden state during forward passes on harmful prompts. This simultaneously suppresses every known refusal direction, preventing the model from relying on any of them.
-
Learn a new refusal mechanism under ablation – The model is fine‑tuned on a safety dataset (CircuitBreaker adversarial prompts) and a utility dataset (Alpaca instructions plus XSTest benign prompts) with a combined loss L_safe + λ·L_util. The safety loss encourages the model to refuse harmful inputs even when the MFA operator is active, while the utility loss preserves performance on benign tasks.
The loop repeats for K iterations (the authors use K≈4). After each iteration the model possesses an additional independent refusal direction, yielding a set of K mutually orthogonal mechanisms. Empirically, this approach reduces attack success rates (ASR) across four jailbreak attacks by 92–97% compared to baseline fine‑tuning, while achieving the highest compliance rate on benign prompts among robust methods (average 86%). Moreover, a LoRA‑based low‑rank fine‑tuning with only ~5 % of the parameters matches full‑model performance, indicating modest computational overhead.
Mechanistic analyses confirm the core claim: cosine similarity between the learned directions is near zero, and each jailbreak only attenuates a subset of them, never all simultaneously. Ablation studies show that removing any single direction does not dramatically increase ASR, whereas removing the entire set does, evidencing true redundancy. In contrast, prior defenses that merely strengthen a single refusal direction or perform adversarial training still collapse when the dominant direction is suppressed.
The paper’s contributions are threefold: (1) a quantitative demonstration that current LLM alignment is fail‑open; (2) the formulation of fail‑closed alignment as a principled safety design; (3) a concrete, scalable training framework that builds multiple independent refusal pathways and validates its effectiveness both empirically and mechanistically. The work suggests a shift in LLM safety research from “training more data” toward “architecturally structuring safety,” offering a promising route to robust, trustworthy language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment