Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment

Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) remain vulnerable to adaptive jailbreaks that easily bypass empirical defenses like GCG. We propose a framework for certifiable robustness that shifts safety guarantees from single-pass inference to the statistical stability of an ensemble. We introduce Certified Semantic Smoothing (CSS) via Stratified Randomized Ablation, a technique that partitions inputs into immutable structural prompts and mutable payloads to derive rigorous lo norm guarantees using the Hypergeometric distribution. To resolve performance degradation on sparse contexts, we employ Noise-Augmented Alignment Tuning (NAAT), which transforms the base model into a semantic denoiser. Extensive experiments on Llama-3 show that our method reduces the Attack Success Rate of gradient-based attacks from 84.2% to 1.2% while maintaining 94.1% benign utility, significantly outperforming character-level baselines which degrade utility to 74.3%. This framework provides a deterministic certificate of safety, ensuring that a model remains robust against all adversarial variants within a provable radius.


💡 Research Summary

The paper tackles the pressing problem of jailbreak attacks on large language models (LLMs), which can force a model to produce harmful content despite safety filters. Existing defenses are largely empirical—perplexity‑based filters, “erase‑and‑check” pipelines, or character‑level randomization—and are quickly circumvented by adaptive gradient‑based attacks such as Greedy Coordinate Gradient (GCG) and AutoDAN. To move beyond this cat‑and‑mouse dynamic, the authors propose a provable defense framework that shifts safety guarantees from a single forward pass to the statistical stability of an ensemble of model evaluations.

The core of the framework is Certified Semantic Smoothing (CSS), a novel adaptation of randomized smoothing for discrete token spaces. Inputs are split into two disjoint sets: immutable structural tokens (system prompts, chat templates, delimiters) and mutable semantic payload tokens (the user query). CSS performs Stratified Randomized Ablation: structural tokens are always retained, while a random subset of k semantic tokens is kept and the rest are masked via a randomized attention mask. This mask prevents the model from attending to omitted tokens without deleting them, thereby preserving positional embeddings and avoiding out‑of‑distribution symbols such as


Comments & Academic Discussion

Loading comments...

Leave a Comment