Why Text Prevails: Vision May Undermine Multimodal Medical Decision Making

Reading time: 1 minute
...

📝 Original Info

  • Title: Why Text Prevails: Vision May Undermine Multimodal Medical Decision Making
  • ArXiv ID: 2512.13747
  • Date: 2025-12-15
  • Authors: Siyuan Dai, Lunxiao Li, Kun Zhao, Eardi Lila, Paul K. Crane, Heng Huang, Dongkuan Xu, Haoteng Tang, Liang Zhan

📝 Abstract

With the rapid progress of large language models (LLMs), advanced multimodal large language models (MLLMs) have demonstrated impressive zero-shot capabilities on vision-language tasks. In the biomedical domain, however, even state-of-the-art MLLMs struggle with basic Medical Decision Making (MDM) tasks. We investigate this limitation using two challenging datasets: (1) three-stage Alzheimer's disease (AD) classification (normal, mild cognitive impairment, dementia), where category differences are visually subtle, and (2) MIMIC-CXR chest radiograph classification with 14 non-mutually exclusive conditions. Our empirical study shows that text-only reasoning consistently outperforms vision-only or vision-text settings, with multimodal inputs often performing worse than text alone. To mitigate this, we explore three strategies: (1) in-context learning with reason-annotated exemplars, (2) vision captioning followed by text-only inference, and (3) few-shot fine-tuning of the vision tower with classification supervision. These findings reveal that current MLLMs lack grounded visual understanding and point to promising directions for improving multimodal decision making in healthcare.

📄 Full Content

...(본문 내용이 길어 생략되었습니다. 사이트에서 전문을 확인해 주세요.)

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut