A Practical Framework for Evaluating Medical AI Security: Reproducible Assessment of Jailbreaking and Privacy Vulnerabilities Across Clinical Specialties
Title: A Practical Framework for Evaluating Medical AI Security: Reproducible Assessment of Jailbreaking and Privacy Vulnerabilities Across Clinical Specialties
ArXiv ID: 2512.08185
Date: 2025-12-09
Authors: ** - Jinghao Wang (The Ohio State University) - Ping Zhang (The Ohio State University) - Carter Yagemann (The Ohio State University) **
📝 Abstract
Medical Large Language Models (LLMs) are increasingly deployed for clinical decision support across diverse specialties, yet systematic evaluation of their robustness to adversarial misuse and privacy leakage remains inaccessible to most researchers. Existing security benchmarks require GPU clusters, commercial API access, or protected health data -- barriers that limit community participation in this critical research area. We propose a practical, fully reproducible framework for evaluating medical AI security under realistic resource constraints. Our framework design covers multiple medical specialties stratified by clinical risk -- from high-risk domains such as emergency medicine and psychiatry to general practice -- addressing jailbreaking attacks (role-playing, authority impersonation, multi-turn manipulation) and privacy extraction attacks. All evaluation utilizes synthetic patient records requiring no IRB approval. The framework is designed to run entirely on consumer CPU hardware using freely available models, eliminating cost barriers. We present the framework specification including threat models, data generation methodology, evaluation protocols, and scoring rubrics. This proposal establishes a foundation for comparative security assessment of medical-specialist models and defense mechanisms, advancing the broader goal of ensuring safe and trustworthy medical AI systems.
💡 Deep Analysis
📄 Full Content
A Practical Framework for Evaluating Medical AI Security:
Reproducible Assessment of Jailbreaking and Privacy
Vulnerabilities Across Clinical Specialties
Jinghao Wang1, Ping Zhang1, and Carter Yagemann1
1The Ohio State University
Abstract
Medical Large Language Models (LLMs) are increasingly deployed for clinical decision support
across diverse specialties, yet systematic evaluation of their robustness to adversarial misuse
and privacy leakage remains inaccessible to most researchers. Existing security benchmarks
require GPU clusters, commercial API access, or protected health data—barriers that limit
community participation in this critical research area. We propose a practical, fully reproducible
framework for evaluating medical AI security under realistic resource constraints. Our framework
design covers multiple medical specialties stratified by clinical risk—from high-risk domains
such as emergency medicine and psychiatry to general practice—addressing jailbreaking attacks
(role-playing, authority impersonation, multi-turn manipulation) and privacy extraction attacks.
All evaluation utilizes synthetic patient records requiring no IRB approval. The framework is
designed to run entirely on consumer CPU hardware using freely available models, eliminating
cost barriers. We present the framework specification including threat models, data generation
methodology, evaluation protocols, and scoring rubrics. This proposal establishes a foundation for
comparative security assessment of medical-specialist models and defense mechanisms, advancing
the broader goal of ensuring safe and trustworthy medical AI systems.
Keywords: Medical AI, Adversarial Attacks, AI Safety, Privacy, Jailbreaking, LLM Security,
Reproducible Research, Clinical Specialties
1
Introduction
Large Language Models are rapidly transforming healthcare across all clinical specialties [Singhal et al.,
2024, 2023a]. GPT-4 achieves expert-level performance on medical licensing examinations [OpenAI,
2023], and AI assistants increasingly provide clinical decision support in domains ranging from
emergency medicine to psychiatry. However, these systems face critical security vulnerabilities that
directly threaten patient safety [Dong et al., 2024, Amodei et al., 2016].
The Problem.
Medical AI systems face two critical security vulnerabilities. First, jailbreaking
attacks bypass safety mechanisms through adversarial prompts, causing models to generate dangerous
treatment recommendations or lethal drug information [Wei et al., 2023, Zou et al., 2023]. Zhang
et al. [2024] demonstrated that medical-specialist models paradoxically show higher compliance with
harmful requests than general models—domain knowledge amplifies rather than mitigates security
risks. Second, privacy extraction attacks exploit the tendency of language models to memorize and
1
arXiv:2512.08185v1 [cs.CR] 9 Dec 2025
regurgitate training data [Carlini et al., 2021], creating HIPAA violations when models leak protected
health information [U.S. Department of Health and Human Services, 2003].
Despite these critical risks, systematic security evaluation remains inaccessible to most researchers.
Existing benchmarks such as HarmBench [Mazeika et al., 2024] and DecodingTrust [Wang et al.,
2023] require GPU clusters, commercial API budgets, or access to protected health information.
This accessibility barrier conflicts with the principle that security research benefits from broad
participation [Ganguli et al., 2022].
Why This Matters.
The consequences of medical AI security failures extend beyond typical AI
risks to direct patient harm. Jailbreaking attacks that elicit dangerous medical advice can cause
patient injury or death [Finlayson et al., 2019]. HIPAA violations carry penalties up to $1.5 million
per incident [U.S. Department of Health and Human Services, 2003]. Critically, risks are not uniform
across medical domains: emergency medicine involves time-critical decisions where errors can be
immediately fatal, psychiatry deals with vulnerable populations, and pharmacology presents risks of
dangerous drug interactions [Seyyed-Kalantari et al., 2021, Obermeyer et al., 2019]. A comprehensive
security framework must therefore evaluate vulnerabilities across the spectrum of clinical practice.
Contributions.
We address this gap by proposing a practical framework for evaluating medical
AI security that any researcher can replicate:
1. Multi-specialty threat model: Attack scenarios organized by clinical risk level and grounded
in domain-specific risks identified by foundational medical AI research.
2. Accessible design: Framework designed to run on consumer hardware without GPU require-
ments, using freely available models.
3. Synthetic data methodology: Patient record generation approach requiring no IRB approval,
enabling fully reproducible evaluation.
4. Evaluation protocol: Standardized metrics and scoring rubrics adapted from established
security research.
2
Related Work
Foundations in AI Safe