CLINIC: Evaluating Multilingual Trustworthiness in Language Models for Healthcare

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Integrating language models (LMs) in healthcare systems holds great promise for improving medical workflows and decision-making. However, a critical barrier to their real-world adoption is the lack of reliable evaluation of their trustworthiness, especially in multilingual healthcare settings. Existing LMs are predominantly trained in high-resource languages, making them ill-equipped to handle the complexity and diversity of healthcare queries in mid- and low-resource languages, posing significant challenges for deploying them in global healthcare contexts where linguistic diversity is key. In this work, we present CLINIC, a Comprehensive Multilingual Benchmark to evaluate the trustworthiness of language models in healthcare. CLINIC systematically benchmarks LMs across five key dimensions of trustworthiness: truthfulness, fairness, safety, robustness, and privacy, operationalized through 18 diverse tasks, spanning 15 languages (covering all the major continents), and encompassing a wide array of critical healthcare topics like disease conditions, preventive actions, diagnostic tests, treatments, surgeries, and medications. Our extensive evaluation reveals that LMs struggle with factual correctness, demonstrate bias across demographic and linguistic groups, and are susceptible to privacy breaches and adversarial attacks. By highlighting these shortcomings, CLINIC lays the foundation for enhancing the global reach and safety of LMs in healthcare across diverse languages.

💡 Research Summary

The paper introduces CLINIC, a comprehensive multilingual benchmark designed to evaluate the trustworthiness of language models (LMs) in the healthcare domain. Recognizing that current LMs are predominantly trained on high‑resource languages and that existing evaluations focus on narrow aspects or single languages, the authors aim to fill a critical gap by providing a systematic, multi‑dimensional assessment across 15 languages spanning all major continents and six key medical sub‑domains (patient conditions, preventive care, diagnostics, pharmacology, surgery, and emergency medicine).

Dataset Construction
The authors source high‑quality English medical content from MedlinePlus and FDA drug documents, both of which include professionally vetted translations into the target languages. Languages are categorized as high‑resource (e.g., English, Chinese, Spanish), mid‑resource (Russian, Vietnamese, Bengali), and low‑resource (Swahili, Hausa, Nepali, Somali). For each language, an equal number of samples is curated to avoid bias.

A two‑step prompting pipeline generates multilingual questions: (1) an LLM creates an English question from an English passage; (2) the same LLM, given the English question, the English passage, and the translated passage, produces the target‑language question. Generated items are reviewed by two clinicians (≥8 years experience) and 22 native speakers, achieving an average quality rating of 3.9/5 and Cohen’s κ = 0.82, indicating strong inter‑annotator agreement.

Trustworthiness Dimensions
CLINIC operationalizes five trustworthiness pillars—truthfulness, fairness, safety, robustness, and privacy—through 18 concrete tasks: false‑confidence, false‑question, “None of the Above” (truthfulness); persona‑based and preference‑based sycophancy (fairness); toxicity, exaggerated safety claims, jailbreak (safety); out‑of‑distribution and adversarial attacks (robustness); and leakage rate (privacy). Evaluation metrics include accuracy, similarity scores, honesty scores, and leak rates, with an external LLM acting as a judge for open‑ended responses.

Model Evaluation
Thirteen models are benchmarked: proprietary (Gemini‑2.5‑Pro, GPT‑4o‑mini, Gemini‑1.5‑Flash), open‑weight small models (LLaMA‑3.2‑3B, Qwen‑2.1‑5B, Phi‑4mini), open‑weight large models (Qwen3‑32B, DeepSeek‑R1, DeepSeek‑R1‑Llama, QwQ‑32B), and medical‑specialized models (OpenBioLLM‑8B, UltraMedical, MMed‑Llama).

Truthfulness: In false‑confidence, false‑question, and “None of the Above” tests, Gemini‑2.5‑Pro and Gemini‑1.5‑Flash achieve the highest accuracies (>90 % in high‑resource languages). Medical‑specific models lag behind (≈60‑70 %). Accuracy drops 10‑15 % for mid‑ and low‑resource languages, and sycophancy scores increase, indicating higher susceptibility to misleading prompts.

Fairness: Persona‑based tests reveal a tendency to align with authoritative “expert” personas, especially in low‑resource languages where cultural cues are weaker. Preference‑based sycophancy shows similar patterns, suggesting that models may amplify existing health disparities if deployed without safeguards.

Safety: Toxicity and exaggerated safety claim scores are markedly higher for low‑resource languages across most models, indicating insufficient exposure to language‑specific safety norms during pre‑training.

Robustness: Out‑of‑distribution and adversarial attack performance follows the same trend: high‑resource languages see >85 % accuracy, while low‑resource languages fall to ~65 %.

Privacy: Leak rates (percentage of generated responses that expose protected health information) reach up to 5 % for several open‑weight models, highlighting a non‑trivial risk of inadvertent data disclosure.

Insights & Limitations
The study demonstrates that even state‑of‑the‑art LLMs are not universally trustworthy across languages and medical tasks. Proprietary large models generally outperform open‑weight and medical‑specialized models, but all exhibit degradation in low‑resource settings. The authors note several limitations: reliance on LLMs for both question generation and evaluation may propagate hidden biases; translation quality for low‑resource languages, while vetted, may still contain subtle errors; and the external LLM judge, though convenient, is not a perfect surrogate for human clinical judgment.

Future Directions
The authors propose expanding human evaluation, incorporating more diverse native speakers, and refining the prompting pipeline to reduce generation bias. They also suggest a continuous feedback loop where model failures on CLINIC inform targeted fine‑tuning and data augmentation, ultimately moving toward globally reliable, safe, and equitable AI‑driven healthcare.

In sum, CLINIC provides the first large‑scale, multilingual, multi‑dimensional benchmark for assessing healthcare LMs, revealing critical gaps in truthfulness, fairness, safety, robustness, and privacy that must be addressed before widespread clinical deployment.

CLINIC: Evaluating Multilingual Trustworthiness in Language Models for Healthcare

💡 Research Summary

Comments & Academic Discussion

Leave a Comment