When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As large language models (LLMs) become increasingly integrated into daily life, audio has emerged as a key interface for human-AI interaction. However, this convenience also introduces new vulnerabilities, making audio a potential attack surface for adversaries. Our research introduces WhisperInject, a two-stage adversarial audio attack framework that manipulates state-of-the-art audio language models to generate harmful content. Our method embeds harmful payloads as subtle perturbations into audio inputs that remain intelligible to human listeners. The first stage uses a novel reward-based white-box optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to jailbreak the target model and elicit harmful native responses. This native harmful response then serves as the target for Stage 2, Payload Injection, where we use gradient-based optimization to embed subtle perturbations into benign audio carriers, such as weather queries or greeting messages. Our method achieves average attack success rates of 60-78% across two benchmarks and five multimodal LLMs, validated by multiple evaluation frameworks. Our work demonstrates a new class of practical, audio-native threats, moving beyond theoretical exploits to reveal a feasible and covert method for manipulating multimodal AI systems.


💡 Research Summary

The paper introduces WhisperInject, a two‑stage adversarial attack framework that targets modern audio‑language models (ALMs) such as Whisper‑based multimodal large language models. The authors argue that as voice interfaces become ubiquitous in homes, vehicles, and public spaces, audio itself becomes a potent attack surface that has been under‑explored compared to text or image modalities.

Stage 1, called Native Target Discovery, uses a novel reward‑driven white‑box optimizer named Reinforcement Learning with Projected Gradient Descent (RL‑PGD). Unlike classic PGD, which requires a fixed target, RL‑PGD generates a diverse set of candidate responses to a benign audio prompt (e.g., “How’s the weather today?”) by sampling with multiple decoding strategies (greedy, beam search, temperature sampling). Each candidate is scored by an external LLM judge (e.g., GPT‑4o‑mini) on a predefined harmfulness rubric, producing a reward in the range


Comments & Academic Discussion

Loading comments...

Leave a Comment