SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning

SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multimodal in-context learning (ICL) remains underexplored despite significant potential for domains such as medicine. Clinicians routinely encounter diverse, specialized tasks requiring adaptation from limited examples, such as drawing insights from a few relevant prior cases or considering a constrained set of differential diagnoses. While multimodal large language models (MLLMs) have shown advances in medical visual question answering (VQA), their ability to learn multimodal tasks from context is largely unknown. We introduce SMMILE, the first expert-driven multimodal ICL benchmark for medical tasks. Eleven medical experts curated problems, each including a multimodal query and multimodal in-context examples as task demonstrations. SMMILE encompasses 111 problems (517 question-image-answer triplets) covering 6 medical specialties and 13 imaging modalities. We further introduce SMMILE++, an augmented variant with 1038 permuted problems. A comprehensive evaluation of 15 MLLMs demonstrates that most models exhibit moderate to poor multimodal ICL ability in medical tasks. In open-ended evaluations, ICL contributes only an 8% average improvement over zero-shot on SMMILE and 9.4% on SMMILE++. We observe a susceptibility for irrelevant in-context examples: even a single noisy or irrelevant example can degrade performance by up to 9.5%. Moreover, we observe that MLLMs are affected by a recency bias, where placing the most relevant example last can lead to substantial performance improvements of up to 71%. Our findings highlight critical limitations and biases in current MLLMs when learning multimodal medical tasks from context. SMMILE is available at https://smmile-benchmark.github.io.


💡 Research Summary

The paper introduces SMMILE, the first expert‑driven benchmark for multimodal in‑context learning (ICL) in the medical domain. While large language models (LLMs) have demonstrated impressive few‑shot abilities in pure text, extending ICL to multimodal settings—especially in high‑stakes fields like medicine—remains largely unexplored. Clinicians often solve new cases by referring to a handful of prior examples or a limited differential diagnosis list, a process that mirrors ICL. To evaluate whether current multimodal large language models (MLLMs) can learn from such examples, the authors built a dataset of 111 problems, each consisting of a multimodal query (text question + image) and two or more multimodal in‑context examples (question‑image‑answer triples). The problems were curated by 11 medical experts (nine physicians and two medical students) across six specialties (radiology, general medicine, pathology, etc.) and 13 imaging modalities (X‑ray, CT, MRI, ultrasound, histopathology, etc.). In total the benchmark contains 517 question‑image‑answer triplets.

Two evaluation formats are supported: (1) open‑ended generation, where the model must produce a free‑text answer, and (2) closed‑ended generation, where the model selects the correct answer from a set of candidates derived from the in‑context examples. To probe the effect of example ordering, the authors also created SMMILE++, an augmented version that permutes the order of in‑context examples (up to 24 permutations per problem), yielding 1,038 problems.

The authors evaluated 15 state‑of‑the‑art MLLMs, spanning open‑source models (LLaVA‑v1.5, LLaVA‑OneVision, LLaVA‑NeXT, LLaVA‑Med, Llama‑3.2‑Vision, MedVLM‑R1, MedGemma, Qwen2.5‑VL) and closed‑source models (GPT‑4o, Claude 3.7 Sonnet). For open‑ended tasks they used Exact Match (EM) and an LLM‑as‑a‑Judge approach (LLama 3.3 70B) to score correctness; for closed‑ended tasks they measured selection accuracy. Human expert evaluation (five clinicians) was also performed, with near‑perfect inter‑rater agreement, providing a reliable gold standard.

Key findings:

  1. Limited ICL gains – Across all models, ICL yields an average absolute improvement of only 8 % (31 % relative) over zero‑shot. However, performance is highly heterogeneous. Seven models (including several LLaVA variants and the domain‑specific MedVLM‑R1) performed worse than a simple random baseline that picks an answer from the in‑context set (27.86 %). Notably, LLaVA‑Med‑7B’s accuracy dropped from 21.65 % in zero‑shot to 10.19 % with ICL, indicating that naïve inclusion of examples can hurt.

  2. Best performers – GPT‑4o achieved the highest open‑ended score (49.88 %) and the best closed‑ended accuracy (58.85 %). Among open‑source models, Qwen2.5‑VL‑72B performed best (42.59 % open‑ended). These results suggest that very large, general‑purpose models still have an edge over specialized medical MLLMs in multimodal ICL.

  3. Sensitivity to example quality – Adding a single irrelevant or noisy example can reduce accuracy by up to 9.5 %. This underscores that the benchmark’s expert‑crafted examples are not merely decorative; their relevance is crucial for model performance.

  4. Recency bias – The ordering of examples dramatically influences outcomes. Placing the most relevant example last can boost accuracy by as much as 71 %, revealing a strong recency bias in current MLLMs. Models tend to give disproportionate weight to the most recent tokens, which can be exploited (or mitigated) via prompt engineering.

  5. Dataset diversity – The benchmark covers a broad spectrum of clinical difficulty (rated by experts), rarity (common vs. uncommon presentations), and cognitive processes (visual pattern matching vs. reasoning). About one‑third of problems are labeled “rare,” providing a realistic test of generalization.

The authors discuss several implications. First, the modest ICL gains and frequent degradation indicate that existing MLLMs do not truly “learn” from multimodal demonstrations; they may merely attend to surface patterns. Second, the pronounced recency bias suggests that current attention mechanisms are not robust to example ordering, a serious concern for clinical deployment where consistency is essential. Third, domain‑specific models underperforming in ICL points to a mismatch between pre‑training data and the structured, example‑driven inference required here.

Future research directions proposed include: developing methods for selecting or weighting in‑context examples, designing architectures or training objectives that reduce recency bias (e.g., position‑agnostic encodings), and exploring meta‑learning approaches that explicitly train models to adapt from a few multimodal demonstrations. Extending the benchmark beyond static images to videos, electronic health records, and multimodal time series is also envisioned.

In summary, SMMILE (and its expanded SMMILE++) provides a rigorous, expert‑validated platform to assess multimodal ICL in medicine. The comprehensive evaluation reveals that current MLLMs have limited ability to leverage multimodal examples, are vulnerable to noisy or poorly ordered demonstrations, and exhibit strong recency bias. These insights highlight critical gaps that must be addressed before multimodal ICL can be trusted in real‑world clinical decision support. The benchmark is publicly released, offering a valuable resource for the community to track progress and drive the next generation of medical‑ready multimodal LLMs.


Comments & Academic Discussion

Loading comments...

Leave a Comment