Black-Box Behavioral Distillation Breaks Safety Alignment in Medical LLMs
As medical large language models (LLMs) become increasingly integrated into clinical workflows, concerns around alignment robustness, and safety are escalating. Prior work on model extraction has focused on classification models or memorization leakage, leaving the vulnerability of safety-aligned generative medical LLMs underexplored. We present a black-box distillation attack that replicates the domain-specific reasoning of safety-aligned medical LLMs using only output-level access. By issuing 48,000 instruction queries to Meditron-7B and collecting 25,000 benign instruction response pairs, we fine-tune a LLaMA3 8B surrogate via parameter efficient LoRA under a zero-alignment supervision setting, requiring no access to model weights, safety filters, or training data. With a cost of $12, the surrogate achieves strong fidelity on benign inputs while producing unsafe completions for 86% of adversarial prompts, far exceeding both Meditron-7B (66%) and the untuned base model (46%). This reveals a pronounced functional-ethical gap, task utility transfers, while alignment collapses. To analyze this collapse, we develop a dynamic adversarial evaluation framework combining Generative Query (GQ)-based harmful prompt generation, verifier filtering, category-wise failure analysis, and adaptive Random Search (RS) jailbreak attacks. We also propose a layered defense system, as a prototype detector for real-time alignment drift in black-box deployments. Our findings show that benign-only black-box distillation exposes a practical and under-recognized threat: adversaries can cheaply replicate medical LLM capabilities while stripping safety mechanisms, underscoring the need for extraction-aware safety monitoring.
💡 Research Summary
The paper investigates a novel threat to safety‑aligned medical large language models (LLMs) by showing that an adversary can clone a model’s functional abilities while stripping away its safety guardrails using only black‑box access. The authors query a commercial, safety‑aligned model (Meditron‑7B) with 48 000 medically‑relevant instructions and collect 25 000 benign instruction‑response pairs, deliberately ignoring any refusal or safety metadata. Using these pairs, they fine‑tune an open‑weight LLaMA‑3 8B model with LoRA adapters (rank‑8), keeping the backbone frozen. The entire distillation costs roughly $12 and requires only modest GPU resources.
Evaluation proceeds in three stages. First, on benign medical prompts the surrogate matches the teacher’s performance, demonstrating high semantic fidelity. Second, the authors build a dynamic adversarial testing framework that combines Generative Query (GQ)‑based harmful prompt generation, verifier filtering, category‑wise failure analysis, and an adaptive Random Search (RS) jailbreak. Over 5 000 automatically generated and 50 hand‑crafted harmful prompts, the surrogate produces unsafe completions 86 % of the time, far exceeding Meditron‑7B’s 66 % and the untuned base model’s 46 %. The RS jailbreak achieves a 100 % success rate on the surrogate, confirming a systematic collapse of safety alignment.
The authors attribute this collapse to “zero‑alignment supervision”: the surrogate never sees refusals or safety signals during training, so it cannot learn to refuse or filter dangerous content. To mitigate the risk, they propose DistillGuard++, a layered detection system that leverages behavioral watermarking, refusal‑pattern modeling, and semantic fingerprinting to flag models whose alignment has drifted. Preliminary results show promising detection accuracy, but the authors stress that robust defenses require keeping alignment signals hidden from external observers and continuous monitoring of deployed APIs.
Overall, the study demonstrates that even low‑cost black‑box distillation can produce high‑fidelity yet unsafe clones of medical LLMs, highlighting an urgent need for extraction‑aware safety monitoring and stronger API‑level protections in high‑stakes AI deployments.
Comments & Academic Discussion
Loading comments...
Leave a Comment