Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks

Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In recent years, Large Language Models (LLMs) have become widely used in medical applications, such as clinical decision support, medical education, and medical question answering. Yet, these models are often English-centric, limiting their robustness and reliability for linguistically diverse communities. Recent work has highlighted discrepancies in performance in low-resource languages for various medical tasks, but the underlying causes remain poorly understood. In this study, we conduct a cross-lingual empirical analysis of LLM performance on Arabic and English medical question and answering. Our findings reveal a persistent language-driven performance gap that intensifies with increasing task complexity. Tokenization analysis exposes structural fragmentation in Arabic medical text, while reliability analysis suggests that model-reported confidence and explanations exhibit limited correlation with correctness. Together, these findings underscore the need for language-aware design and evaluation strategies in LLMs for medical tasks.


💡 Research Summary

The paper presents a systematic cross‑lingual evaluation of several recent open‑source large language models (LLMs) on Arabic and English medical multiple‑choice question (MCQ) tasks. Using the MedAraBench dataset—a collection of Arabic medical exam questions annotated with difficulty level, specialty, and answer options—the authors create an English counterpart by translating each item with Google Translate. This parallel setup allows them to isolate the effect of language from medical content while keeping the prompting format identical (English prompts for both languages).

Six models are examined: three general‑purpose LLMs (DeepSeek‑V3.2, LLaMA 3.3 70B, Mistral‑Small‑3.2‑24B) and three medical‑domain LLMs (Meditron 3 70B, Med42‑70B, medgemma‑27B‑text‑it). None of the medical models were explicitly trained on multilingual medical data; they primarily rely on English‑centric corpora. All models are evaluated under a unified MCQ prompting scheme with greedy decoding, fixed token limits for option selection, answer generation, and explanation generation.

Key Findings

  1. Persistent Language Gap – Across almost all models, accuracy on the English version exceeds that on the Arabic version. The gap ranges from a negligible 0.5 % for DeepSeek‑V3.2 to more than 20 % for Med42‑70B. The fact that even the largest 70 B‑parameter models suffer substantial drops indicates that sheer scale does not eliminate the cross‑lingual disparity.

  2. Impact of Input Length – When question length (measured in tokens) increases, Arabic performance degrades sharply, whereas English performance remains relatively stable. Token‑level analysis shows that Arabic sentences generate many more sub‑tokens due to rich morphology and cliticization, leading to longer input sequences that the models handle less effectively.

  3. Difficulty and Specialty Effects – Accuracy declines for later‑year (Y3‑Y5) questions compared with early‑year (Y1‑Y2) items, and the decline is consistently larger for Arabic. Specialty‑wise, clinically oriented fields (e.g., Emergency Medicine, Internal Medicine) yield higher scores, while foundational or detail‑intensive specialties (Microbiology, Embryology) produce the lowest scores. The Arabic‑English gap persists across specialties, suggesting that language‑related factors are not merely domain‑specific.

  4. Alignment and Output Format – The authors compare “soft matching” (letter‑based option selection) with “hard matching” (free‑form answer generation). Hard matching amplifies Arabic errors because the models must generate coherent Arabic sentences; tokenization fragmentation and limited Arabic generation capability cause higher mismatch rates.

  5. Reliability Signals – Model‑reported confidence scores and generated rationales correlate weakly with actual correctness (Pearson r≈0.2). In Arabic, high confidence does not guarantee correctness, and explanations often suffer from incoherent phrasing due to tokenization issues.

Interpretation

The performance gap is not solely attributable to the amount of Arabic pre‑training data. Instead, it emerges from an interaction of several factors: (a) Arabic’s complex morphology leads to token fragmentation and longer effective sequences; (b) the English‑centric prompting and instruction tuning bias models toward English comprehension and generation; (c) medical domain adaptation has been performed mainly on English corpora, leaving Arabic medical terminology under‑represented; (d) longer and more difficult questions exacerbate the representation mismatch; (e) confidence estimation mechanisms are not calibrated for multilingual contexts.

Implications

To build trustworthy multilingual medical LLMs, future work should:

  • Develop Arabic‑aware tokenizers or sub‑word vocabularies that reduce fragmentation.
  • Curate large, high‑quality Arabic medical corpora (textbooks, clinical guidelines, exam archives) for pre‑training and fine‑tuning.
  • Design multilingual prompts and instruction‑tuning pipelines that treat Arabic inputs natively rather than relying on English‑only templates.
  • Implement language‑agnostic confidence calibration and explanation generation methods, possibly via cross‑lingual consistency checks.

Overall, the paper provides a thorough diagnostic framework that goes beyond aggregate accuracy, revealing how linguistic structure, task complexity, and model alignment jointly shape the reliability of LLMs in Arabic medical question answering.


Comments & Academic Discussion

Loading comments...

Leave a Comment