MPIB: A Benchmark for Medical Prompt Injection Attacks and Clinical Safety in LLMs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are increasingly integrated into clinical workflows; however, prompt injection attacks can steer these systems toward clinically unsafe or misleading outputs. We introduce the Medical Prompt Injection Benchmark (MPIB), a dataset-and-benchmark suite for evaluating clinical safety under both direct prompt injection and indirect, RAG-mediated injection across clinically grounded tasks. MPIB emphasizes outcome-level risk via the Clinical Harm Event Rate (CHER), which measures high-severity clinical harm events under a clinically grounded taxonomy, and reports CHER alongside Attack Success Rate (ASR) to disentangle instruction compliance from downstream patient risk. The benchmark comprises 9,697 curated instances constructed through multi-stage quality gates and clinical safety linting. Evaluating MPIB across a diverse set of baseline LLMs and defense configurations, we find that ASR and CHER can diverge substantially, and that robustness depends critically on whether adversarial instructions appear in the user query or in retrieved context. We release MPIB with evaluation code, adversarial baselines, and comprehensive documentation to support reproducible and systematic research on clinical prompt injection. Code and data are available at GitHub (code) and Hugging Face (data).

💡 Research Summary

The paper introduces the Medical Prompt Injection Benchmark (MPIB), a comprehensive dataset and evaluation suite designed to assess the clinical safety of large language models (LLMs) and retrieval‑augmented generation (RAG) systems when faced with prompt injection attacks. While prior safety benchmarks focus on generic policy violations, toxicity, or the raw Attack Success Rate (ASR)—the proportion of times a model follows a malicious instruction—MPIB goes further by measuring downstream patient harm. To this end the authors develop a clinically grounded harm taxonomy (types H1‑H5) with severity levels 0‑4, and define the Clinical Harm Event Rate (CHER) as the proportion of high‑severity (≥ 3) events. CHER is reported alongside ASR to separate simple instruction compliance from actual clinical risk.

MPIB contains 9,697 curated instances spanning four realistic medical scenarios: (S1) explanation/summarization, (S2) medication dosing, (S3) emergency triage, and (S4) guideline/evidence verification. Each instance is a tuple (query q, context C, threat vector m, label ℓ). The threat vectors capture two attack modalities: V1 – direct injection where the adversarial directive is placed directly in the user query, and V2 – indirect (RAG‑mediated) injection where malicious instructions are embedded in retrieved documents (e.g., a poisoned guideline update). Benign (V0/V0’) anchors are also provided for baseline comparison.

The dataset construction follows a multi‑stage pipeline: (1) selection of clinically relevant queries, (2) generation of benign, borderline, and malicious prompts using domain experts, (3) retrieval of context passages from simulated knowledge bases, (4) manual safety linting and taxonomy annotation, and (5) quality gating to ensure high‑fidelity labels. The resulting benchmark is thus both task‑diverse and safety‑rich.

Evaluation is performed on twelve LLMs, including open‑source models (LLaMA, Falcon) and proprietary systems (GPT‑4, Claude). For each model the authors compute ASR (did the model obey the malicious instruction?) and CHER (did the model produce a high‑severity clinical error?). The key findings are:

Indirect injection (V2) consistently yields higher CHER than direct injection (V1), often by a factor of three to five, highlighting the danger of treating retrieved text as authoritative.
ASR and CHER can diverge dramatically; a model may refuse the malicious command (low ASR) yet still generate a harmful recommendation by citing the poisoned context, leading to high CHER.
Simple defenses—prompt‑filtering, instruction‑ignoring layers—reduce ASR but have limited impact on CHER. Context‑filtering and retrieval‑time sanitization are more effective against V2 attacks.
The authors implement an LLM‑as‑a‑judge framework to automatically label model outputs with harm type and severity. The judge model is fine‑tuned on a small human‑annotated subset, and deterministic post‑processing ensures reproducibility.

The paper also discusses responsible release practices: payload redaction, pointer‑based reconstruction hooks, and cryptographic integrity commitments are provided to mitigate dual‑use risks while preserving research utility.

In summary, MPIB fills a critical gap in medical AI safety research by shifting the evaluation focus from surface‑level policy compliance to outcome‑level clinical risk. By pairing ASR with the novel CHER metric, the benchmark reveals that many existing defenses may give a false sense of security. The work underscores the need for robust retrieval pipelines, context‑aware defenses, and outcome‑centric auditing before deploying LLM‑driven tools in real‑world healthcare settings.

MPIB: A Benchmark for Medical Prompt Injection Attacks and Clinical Safety in LLMs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment