멀티모달 LLM 탈옥의 새로운 지평: 이미지 스테가노그래피를 활용한 이중 은닉 공격

Reading time: 7 minute
...

📝 Abstract

By integrating language understanding with perceptual modalities such as images, multimodal large language models (MLLMs) constitute a critical substrate for modern AI systems, particularly intelligent agents operating in open and interactive environments. However, their increasing accessibility also raises heightened risks of misuse, such as generating harmful or unsafe content. To mitigate these risks, alignment techniques are commonly applied to align model behavior with human values. Despite these efforts, recent studies have shown that jailbreak attacks can circumvent alignment and elicit unsafe outputs. Currently, most existing jailbreak methods are tailored for open-source models and exhibit limited effectiveness against commercial MLLM-integrated systems, which often employ additional filters. These filters can detect and prevent malicious input and output content, significantly reducing jailbreak threats. In this paper, we reveal that the success of these safety filters heavily relies on a critical assumption that malicious content must be explicitly visible in either the input or the output. This assumption, while often valid for traditional LLM-integrated systems, breaks down in MLLM-integrated systems, where attackers can leverage multiple modalities to conceal adversarial intent, leading to a false sense of security in existing MLLM-integrated systems. To challenge this assumption, we propose Odysseus, a novel jailbreak paradigm that introduces dual steganography to covertly embed malicious queries and responses into benignlooking images. Our method proceeds through four stages: (1) malicious query encoding, (2) steganography embedding, (3) model interaction, and (4) response extraction. We first encode the adversary-specified malicious prompt into binary matrices and embed them into images using a steganography model. The modified image will be fed into the victim MLLM-integrated system. We encourage the victim MLLM-integrated system to implant the generated illegitimate content into a carrier image (via steganography), which will be used for attackers to decode the hidden response locally. Extensive experiments on benchmark datasets demonstrate that our Odysseus successfully jailbreaks several pioneering and realistic MLLM-integrated systems, including GPT-4o, Gemini-2.0-pro, Gemini-2.0-flash, and Grok-3, achieving up to 99% attack success rate. It exposes a fundamental blind spot in existing defenses, and calls for rethinking crossmodal security in MLLM-integrated systems.

💡 Analysis

By integrating language understanding with perceptual modalities such as images, multimodal large language models (MLLMs) constitute a critical substrate for modern AI systems, particularly intelligent agents operating in open and interactive environments. However, their increasing accessibility also raises heightened risks of misuse, such as generating harmful or unsafe content. To mitigate these risks, alignment techniques are commonly applied to align model behavior with human values. Despite these efforts, recent studies have shown that jailbreak attacks can circumvent alignment and elicit unsafe outputs. Currently, most existing jailbreak methods are tailored for open-source models and exhibit limited effectiveness against commercial MLLM-integrated systems, which often employ additional filters. These filters can detect and prevent malicious input and output content, significantly reducing jailbreak threats. In this paper, we reveal that the success of these safety filters heavily relies on a critical assumption that malicious content must be explicitly visible in either the input or the output. This assumption, while often valid for traditional LLM-integrated systems, breaks down in MLLM-integrated systems, where attackers can leverage multiple modalities to conceal adversarial intent, leading to a false sense of security in existing MLLM-integrated systems. To challenge this assumption, we propose Odysseus, a novel jailbreak paradigm that introduces dual steganography to covertly embed malicious queries and responses into benignlooking images. Our method proceeds through four stages: (1) malicious query encoding, (2) steganography embedding, (3) model interaction, and (4) response extraction. We first encode the adversary-specified malicious prompt into binary matrices and embed them into images using a steganography model. The modified image will be fed into the victim MLLM-integrated system. We encourage the victim MLLM-integrated system to implant the generated illegitimate content into a carrier image (via steganography), which will be used for attackers to decode the hidden response locally. Extensive experiments on benchmark datasets demonstrate that our Odysseus successfully jailbreaks several pioneering and realistic MLLM-integrated systems, including GPT-4o, Gemini-2.0-pro, Gemini-2.0-flash, and Grok-3, achieving up to 99% attack success rate. It exposes a fundamental blind spot in existing defenses, and calls for rethinking crossmodal security in MLLM-integrated systems.

📄 Content

Odysseus: Jailbreaking Commercial Multimodal LLM-integrated Systems via Dual Steganography Songze Li1, Jiameng Cheng1, Yiming Li2,†, Xiaojun Jia2, Dacheng Tao2 1Southeast University, 2Nanyang Technological University {songzeli, jiamengcheng}@seu.edu.cn; {liyiming.tech, jiaxiaojunqaq, dacheng.tao}@gmail.com Abstract—By integrating language understanding with per- ceptual modalities such as images, multimodal large language models (MLLMs) constitute a critical substrate for modern AI systems, particularly intelligent agents operating in open and interactive environments. However, their increasing accessibility also raises heightened risks of misuse, such as generating harmful or unsafe content. To mitigate these risks, alignment techniques are commonly applied to align model behavior with human values. Despite these efforts, recent studies have shown that jailbreak attacks can circumvent alignment and elicit unsafe outputs. Currently, most existing jailbreak methods are tailored for open-source models and exhibit limited effectiveness against commercial MLLM-integrated systems, which often employ addi- tional filters. These filters can detect and prevent malicious input and output content, significantly reducing jailbreak threats. In this paper, we reveal that the success of these safety filters heavily relies on a critical assumption that malicious content must be explicitly visible in either the input or the output. This assumption, while often valid for traditional LLM-integrated sys- tems, breaks down in MLLM-integrated systems, where attackers can leverage multiple modalities to conceal adversarial intent, leading to a false sense of security in existing MLLM-integrated systems. To challenge this assumption, we propose Odysseus, a novel jailbreak paradigm that introduces dual steganography to covertly embed malicious queries and responses into benign- looking images. Our method proceeds through four stages: (1) malicious query encoding, (2) steganography embedding, (3) model interaction, and (4) response extraction. We first encode the adversary-specified malicious prompt into binary matrices and embed them into images using a steganography model. The modified image will be fed into the victim MLLM-integrated system. We encourage the victim MLLM-integrated system to implant the generated illegitimate content into a carrier image (via steganography), which will be used for attackers to decode the hidden response locally. Extensive experiments on benchmark datasets demonstrate that our Odysseus successfully jailbreaks several pioneering and realistic MLLM-integrated systems, in- cluding GPT-4o, Gemini-2.0-pro, Gemini-2.0-flash, and Grok-3, achieving up to 99% attack success rate. It exposes a fundamental blind spot in existing defenses, and calls for rethinking cross- modal security in MLLM-integrated systems. † Corresponding author: Yiming Li (liyiming.tech@gmail.com).

  • Code is available at GitHub (https://github.com/S3IC-Lab/Odysseus) . I. Introduction Multimodal large language models (MLLMs), such as GPT- 4o [2] and Gemini [59], integrate language, vision, and au- ditory modalities to enable cross-modal understanding and generation, establishing a transformative paradigm in the field of artificial intelligence (AI). These models have demonstrated unprecedented capabilities in tasks such as image captioning [37], visual question answering [51], and interleaved content generation [27]. Recent advancements underscore their broad applicability and versatility across various domains, position- ing them as powerful tools [81]. As MLLMs become more powerful, they are increasingly misused to generate harmful content, leak sensitive informa- tion, and assist in prohibited or malicious activities, signifi- cantly lowering the technical barrier to harmful operations and enabling non-experts to carry out sophisticated attacks [78] [73] [72]. To mitigate such risks, developers have proposed a range of alignment techniques aimed at steering MLLMs toward generating responses that align with human values and social norms. These methods typically include supervised fine-tuning (SFT) [55], reinforcement learning from human feedback (RLHF) [47], and reinforcement learning from AI feedback (RLAIF) [5]. Specifically, SFT involves training the model on instruction-response pairs curated by human anno- tators. RLHF further refines model behavior beyond SFT by leveraging reward signals derived from human preferences. In RLAIF, human evaluators are replaced by a trained preference model that simulates human judgments to score candidate responses. Aligned MLLMs are designed to reject prompts that solicit malicious content or violate safety guidelines, thereby serving as a first and critical line of defense against misuse. Despite these alignment efforts, recent studies [35], [64], [76], [52] demonstrated that alignment can be bypassed through jailbreak attacks. Jailbreaking refers to techniques that circumvent safety alignment m

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut