From Compound Figures to Composite Understanding: Developing a Multi-Modal LLM from Biomedical Literature with Medical Multiple-Image Benchmarking and

Reading time: 6 minute
...

📝 Abstract

Multi-modal large language models (MLLMs) have shown tremendous promise in advancing healthcare. However, most existing models remain confined to single-image understanding, which greatly limits their applicability in real-world clinical workflows. In practice, medical diagnosis and disease progression assessment often require synthesizing information across multiple images from different modalities or time points. The development of medical MLLMs capable of such multi-image understanding has been hindered by the lack of large-scale, high-quality annotated training data. To address this limitation, we propose a novel framework that leverages license-permissive compound images, widely available in biomedical literature, as a rich yet underutilized data source for training medical MLLMs in multi-image analysis. Specifically, we design a five-stage, context-aware instruction generation paradigm underpinned by a divide-and-conquer strategy that systematically transforms compound figures and their accompanying expert text into high-quality training instructions. By decomposing the complex task of multi-image analysis into manageable sub-tasks, this paradigm empowers MLLMs to move beyond single-panel analysis and provide a composite understanding by learning the complex spatial, temporal, and cross-modal relationships inherent in these compound figures. By parsing over 237,000 compound figures and their contextual text for instruction generation, we develop M 3 LLM, a medical multi-image multi-modal large language model. For comprehensive benchmarking, we construct PMC-MI-Bench for composite understanding, manually validated by medical experts. Extensive experiments show that M 3 LLM significantly outperforms both general-purpose and specialized medical MLLMs across multiimage, single-image, text-only, and multi-choice scenarios. Notably, M 3 LLM exhibits strong generalization to real-world clinical settings, achieving superior performance on longitudinal chest X-ray analysis using the MIMIC dataset. This work establishes a scalable and efficient paradigm for developing next-generation medical MLLMs, capable of composite reasoning across complex multi-image scenarios, bridging the gap between biomedical literature and real-world clinical applications.

💡 Analysis

Multi-modal large language models (MLLMs) have shown tremendous promise in advancing healthcare. However, most existing models remain confined to single-image understanding, which greatly limits their applicability in real-world clinical workflows. In practice, medical diagnosis and disease progression assessment often require synthesizing information across multiple images from different modalities or time points. The development of medical MLLMs capable of such multi-image understanding has been hindered by the lack of large-scale, high-quality annotated training data. To address this limitation, we propose a novel framework that leverages license-permissive compound images, widely available in biomedical literature, as a rich yet underutilized data source for training medical MLLMs in multi-image analysis. Specifically, we design a five-stage, context-aware instruction generation paradigm underpinned by a divide-and-conquer strategy that systematically transforms compound figures and their accompanying expert text into high-quality training instructions. By decomposing the complex task of multi-image analysis into manageable sub-tasks, this paradigm empowers MLLMs to move beyond single-panel analysis and provide a composite understanding by learning the complex spatial, temporal, and cross-modal relationships inherent in these compound figures. By parsing over 237,000 compound figures and their contextual text for instruction generation, we develop M 3 LLM, a medical multi-image multi-modal large language model. For comprehensive benchmarking, we construct PMC-MI-Bench for composite understanding, manually validated by medical experts. Extensive experiments show that M 3 LLM significantly outperforms both general-purpose and specialized medical MLLMs across multiimage, single-image, text-only, and multi-choice scenarios. Notably, M 3 LLM exhibits strong generalization to real-world clinical settings, achieving superior performance on longitudinal chest X-ray analysis using the MIMIC dataset. This work establishes a scalable and efficient paradigm for developing next-generation medical MLLMs, capable of composite reasoning across complex multi-image scenarios, bridging the gap between biomedical literature and real-world clinical applications.

📄 Content

From Compound Figures to Composite Understanding: Developing a Multi-Modal LLM from Biomedical Literature with Medical Multiple-Image Benchmarking and Validation Zhen Chen1‡, Yihang Fu1‡, Gabriel Madera1,2, Mauro Giuffre1, Serina Applebaum1, Hyunjae Kim1, Hua Xu1, Qingyu Chen1∗ 1Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT 06510, USA 2School of Medicine, University of Puerto Rico, San Juan, PR 00921, USA ‡Contributed Equally *Corresponding author 1 Abstract Multi-modal large language models (MLLMs) have shown tremendous promise in advancing healthcare. How- ever, most existing models remain confined to single-image understanding, which greatly limits their applica- bility in real-world clinical workflows. In practice, medical diagnosis and disease progression assessment often require synthesizing information across multiple images from different modalities or time points. The development of medical MLLMs capable of such multi-image understanding has been hindered by the lack of large-scale, high-quality annotated training data. To address this limitation, we propose a novel framework that leverages license-permissive compound images, widely available in biomedical literature, as a rich yet under- utilized data source for training medical MLLMs in multi-image analysis. Specifically, we design a five-stage, context-aware instruction generation paradigm underpinned by a divide-and-conquer strategy that systemati- cally transforms compound figures and their accompanying expert text into high-quality training instructions. By decomposing the complex task of multi-image analysis into manageable sub-tasks, this paradigm empow- ers MLLMs to move beyond single-panel analysis and provide a composite understanding by learning the complex spatial, temporal, and cross-modal relationships inherent in these compound figures. By parsing over 237,000 compound figures and their contextual text for instruction generation, we develop M3LLM, a medical multi-image multi-modal large language model. For comprehensive benchmarking, we construct PMC-MI- Bench for composite understanding, manually validated by medical experts. Extensive experiments show that M3LLM significantly outperforms both general-purpose and specialized medical MLLMs across multi- image, single-image, text-only, and multi-choice scenarios. Notably, M3LLM exhibits strong generalization to real-world clinical settings, achieving superior performance on longitudinal chest X-ray analysis using the MIMIC dataset. This work establishes a scalable and efficient paradigm for developing next-generation medi- cal MLLMs, capable of composite reasoning across complex multi-image scenarios, bridging the gap between biomedical literature and real-world clinical applications. 1 arXiv:2511.22232v1 [cs.CV] 27 Nov 2025 2 Introduction Multi-modal large language models (MLLMs) 1–3 combine natural language processing with multi-modal per- ception capabilities, and are capable of processing and reasoning across textual and visual data. In the general domain, MLLMs have demonstrated remarkable capability in understanding and integrating information across modalities, paving the way for their adaptation to specialized fields 4. Preliminary results in healthcare applica- tions have revealed promising potentials, particularly in processing clinical text, answering medical questions, and analyzing visual medical data 5–12. These advancements indicate the prospect of MLLMs to enhance diag- nostic processes 13,14, streamline clinical decision-making 15, and support medical education 16. Despite these advances, a critical limitation persists: most existing MLLMs are primarily designed for single-image under- standing, which significantly constrains their applicability in real-world medical scenarios involving complex multi-image, multi-modal data. Compared to single-image tasks, multi-image tasks hold greater practical significance in real-world clinical workflows 17–19. For example, longitudinal monitoring requires comparing multiple images collected across different time points to track disease progression, while clinical diagnosis often integrates medical images from different modalities to provide a comprehensive understanding of a medical case 20,21. For instance, oncologists routinely analyze Magnetic Resonance Imaging (MRI) scans for tumor morphology, Positron Emission Tomog- raphy (PET) scans for metabolic activity, and histopathology slides collectively to formulate a comprehensive diagnostic picture 20, while cardiologists and neurologists similarly combine modalities like echocardiography, Computed Tomography (CT), and functional MRI to evaluate heart disease and brain disorders 22,23. These multiple-image scenarios, which constitute a substantial portion of clinical workflows, demand the compos- ite understanding capabilities that synthesize information across multiple medical images. However, existing MLLMs 5–1

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut