Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety
Large language models (LLMs) are increasingly deployed worldwide, yet their safety alignment remains predominantly English-centric. This allows for vulnerabilities in non-English contexts, especially with low-resource languages. We introduce a novel application of knowledge distillation (KD) in the context of multilingual jailbreak prevention, examining its efficacy. We distill the refusal behaviors of a proprietary teacher model (OpenAI o1-mini) with Low-Rank Adaptation (LoRA) into three open-source student models: Meta-Llama-3-8B-Instruct, Gemma-2-2B-IT, and Qwen3-8B, using ~28,000 multilingual jailbreak prompts from XSafety via black-box response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the MultiJail benchmark reveals a counterintuitive behavior: standard fine-tuning on the teacher’s ``safe’’ refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points. Our experiments reveal a divergent generalization to unseen languages during distillation, with varying outcomes depending on the base model. By removing a primary source of safety degradation, nuanced `boundary’ refusals, we mitigate or even reverse safety declines in student models, although reductions in reasoning performance (GSM8K) persist. Overall, our exploratory study highlights the challenges and potential of KD as a technique for multilingual safety alignment, offering a foundation for future research in this direction.
💡 Research Summary
The paper investigates whether response‑based knowledge distillation (KD) can improve multilingual jailbreak resistance in large language models (LLMs). The authors use OpenAI’s proprietary o1‑mini model as a teacher, generating safe refusal responses to roughly 28 000 jailbreak prompts drawn from the XSafety dataset, which covers ten languages and fourteen safety categories. These prompt‑response pairs form a supervised “distillation dataset.” Three open‑source student models—Meta‑Llama‑3‑8B‑Instruct, Gemma‑2‑2B‑IT, and Qwen3‑8B—are fine‑tuned on this dataset using Low‑Rank Adaptation (LoRA), a parameter‑efficient fine‑tuning (PEFT) method that updates only about 0.5 % of the model’s parameters (rank 16, scaling 32). Training runs for two epochs with a learning rate of 2e‑4 on an H100 GPU.
Evaluation is performed on the MultiJail benchmark, which contains 3 150 prompts across ten languages (including low‑resource Swahili and Javanese not present in the distillation data) and eighteen safety scenarios. GPT‑4o acts as an automated judge, classifying model outputs as “safe,” “unsafe,” or “invalid.” The primary metric is Jailbreak Success Rate (JSR), the proportion of prompts that elicit unsafe responses (lower is better).
Results show a consistent degradation of safety after distillation. The teacher o1‑mini maintains a low baseline JSR of 3.1 %. All student models experience higher JSRs: Meta‑Llama‑3‑8B‑Instruct rises from 12.5 % to 13.9 % (+1.4 pp), Gemma‑2‑2B‑IT from 5.0 % to 21.6 % (+16.6 pp), and Qwen3‑8B from 5.7 % to 8.3 % (+2.6 pp). Increases are observed across high‑, medium‑, and low‑resource language groups, with the most severe impact on the smallest model (Gemma‑2‑2B‑IT). Larger variants (Llama‑2‑13B‑chat‑hf, Gemma‑3‑12B‑IT, Qwen3‑14B) also show JSR growth, though the magnitude is somewhat mitigated.
The authors attribute the safety drop to three intertwined factors: (1) “boundary” data—prompts that sit near the safe/unsafe decision line—lead the student models to over‑generalize, producing more “invalid” or unsafe outputs; (2) the teacher’s own latent vulnerabilities are transferred because response‑based KD relies on hard text labels rather than richer soft logits; and (3) catastrophic forgetting caused by freezing most of the base model during LoRA fine‑tuning, which erodes previously learned safety heuristics.
A purification experiment removes the boundary data from the distillation set. This mitigates the degradation: Gemma‑2‑2B‑IT’s JSR drops by ~14 pp and Qwen3‑8B’s by ~1.7 pp, while Meta‑Llama‑3‑8B‑Instruct shows modest gains in a few languages. However, all models suffer a decline in reasoning performance on GSM8K, highlighting a trade‑off between safety and general capability.
The study also reveals poor zero‑shot generalization to languages absent from the distillation data, especially low‑resource languages, suggesting that multilingual safety knowledge does not fully transfer through simple response‑based KD.
In the broader context, prior work has shown KD can improve safety in English‑centric settings, but recent findings (e.g., concurrent ICLR 2026 paper) indicate that both response‑based and logit‑based distillation may harm safety. This paper corroborates those observations in a multilingual scenario and emphasizes the need for more sophisticated approaches: soft‑label distillation, multi‑teacher ensembles, careful data curation, and language‑specific adaptation.
Overall, the paper provides a cautionary empirical account: naïvely applying response‑based knowledge distillation to enhance multilingual jailbreak resistance can backfire, degrading safety while preserving or slightly improving performance in narrow cases. Future work must address the identified failure modes to develop robust, scalable safety alignment techniques for LLMs across the full spectrum of world languages.
Comments & Academic Discussion
Loading comments...
Leave a Comment