MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology
📝 Abstract
Multimodal Large Language Models (LLMs) hold promise for biomedical reasoning, but current benchmarks fail to capture the complexity of real-world clinical workflows. Existing evaluations primarily assess unimodal, decontextualized question-answering, overlooking multi-agent decision-making environments such as Molecular Tumor Boards (MTBs). MTBs bring together diverse experts in oncology, where diagnostic and prognostic tasks require integrating heterogeneous data and evolving insights over time. Current benchmarks lack this longitudinal and multimodal complexity. We introduce MTBBench, an agentic benchmark simulating MTB-style decision-making through clinically challenging, multimodal, and longitudinal oncology questions. Ground truth annotations are validated by clinicians via a co-developed app, ensuring clinical relevance. We benchmark multiple open and closed-source LLMs and show that, even at scale, they lack reliability – frequently hallucinating, struggling with reasoning from time-resolved data, and failing to reconcile conflicting evidence or different modalities. To address these limitations, MTBBench goes beyond benchmarking by providing an agentic framework with foundation model-based tools that enhance multi-modal and longitudinal reasoning, leading to task-level performance gains of up to 9.0% and 11.2%, respectively. Overall, MTBBench offers a challenging and realistic testbed for advancing multimodal LLM reasoning, reliability, and tool-use with a focus on MTB environments in precision oncology.
💡 Analysis
Multimodal Large Language Models (LLMs) hold promise for biomedical reasoning, but current benchmarks fail to capture the complexity of real-world clinical workflows. Existing evaluations primarily assess unimodal, decontextualized question-answering, overlooking multi-agent decision-making environments such as Molecular Tumor Boards (MTBs). MTBs bring together diverse experts in oncology, where diagnostic and prognostic tasks require integrating heterogeneous data and evolving insights over time. Current benchmarks lack this longitudinal and multimodal complexity. We introduce MTBBench, an agentic benchmark simulating MTB-style decision-making through clinically challenging, multimodal, and longitudinal oncology questions. Ground truth annotations are validated by clinicians via a co-developed app, ensuring clinical relevance. We benchmark multiple open and closed-source LLMs and show that, even at scale, they lack reliability – frequently hallucinating, struggling with reasoning from time-resolved data, and failing to reconcile conflicting evidence or different modalities. To address these limitations, MTBBench goes beyond benchmarking by providing an agentic framework with foundation model-based tools that enhance multi-modal and longitudinal reasoning, leading to task-level performance gains of up to 9.0% and 11.2%, respectively. Overall, MTBBench offers a challenging and realistic testbed for advancing multimodal LLM reasoning, reliability, and tool-use with a focus on MTB environments in precision oncology.
📄 Content
Recent advances in large multi-modal and language models have opened the door to general-purpose clinical agents capable of reasoning across diverse biomedical tasks (Moor et al., 2023). Visionlanguage models can describe pathology images (Lu et al., 2024a;Dai et al., 2025;Lu et al., 2024b), LLMs can summarize clinical notes (Choudhuri et al., 2025;Yang et al., 2024), and medical agents are increasingly able to query tools, retrieve knowledge, and even hold multi-turn clinical conversations (Schmidgall et al., 2024;Wang et al., 2025a). These developments have prompted growing interest in using agents to support complex workflows (Wang et al., 2024a(Wang et al., , 2025b;;Gao et al., 2024;Lee et al., 2024;Yue et al., 2024;Fallahpour et al., 2025) like those seen in molecular tumor boards (MTBs) (Tsimberidou et al., 2023), where oncologists, radiologists, pathologists, and geneticists jointly analyze a patient’s evolving case (Fig. 4). Vision LMs
Is there evidence of perineural invasion in the primary tumor site?
What can be concluded about T-cell (CD3+) infiltration in the tumor center?
Will the patient’s cancer recur within two years, given all available data? board workflows, presenting agents with longitudinal, multi-modal patient data (H&E, IHC, hematology, and genomics) along with temporally distributed clinical events. Agents are tasked with integrating this information to support complex decision-making. b. MTBBench allows benchmarking agents on their ability to reason across modalities and time in order to accurately tackle clinical questions concerning diagnosis, prognosis, and biomarker interpretation. Lastly, we introduce an agentic framework that enables querying both external tools and pretrained foundation models, allowing agents to more effectively reason over complex, multi-modal and temporally resolved clinical information.
However, the evaluation of such agents remains underdeveloped. Existing benchmarks typically frame tasks as static, uni-modal, single-turn question-answering problems, where the model is given all necessary inputs at once and evaluated on its ability to predict a discrete answer. This setup diverges sharply from how clinical decisions are made in practice. Real-world oncology reasoning is interactive, temporal, and multi-modal: physicians accumulate information over time, integrate findings from multiple data types (e.g., hematoxylin and eosin (H&E) staining, immunohistochemistry (IHC) staining, radiology, blood values, genomics), and make provisional decisions that are updated as new evidence emerges (Fig. 1a). To be useful in these settings, AI agents must not only understand each modality, but also query, contextualize, and reconcile information across modalities and timecapabilities rarely assessed in current evaluations.
Recent works such as MedAgentBench (Jiang et al., 2025), MediQ (Li et al., 2024), and MedJourney (Wu et al., 2024) take steps toward interactive or longitudinal evaluation, but typically in limited or uni-modal contexts (e.g., textual EHRs) (Kweon et al., 2024) (Table 1). Likewise, emerging studies on multi-modal agents demonstrate strong promise but lack standardized evaluation across longitudinal patient trajectories (Li et al., 2024). Most importantly, these agents are not tested under the cognitive demands of tasks that mirror MTB decision-making: questions involving partial data, sequential updates, conflicting information, highly heterogeneous and different modalities, and high-stakes outcomes.
To address this gap, we introduce MTBBench, an oncology-focused benchmark for evaluating AI agents in longitudinal, multi-modal clinical reasoning. Inspired by the structure and decision flow of real molecular tumor boards, MTBBench simulates patient case reviews where agents must process heterogeneous patient data across time-including pathology slides, lab data, pathological, surgical and genomic reports-and answer clinically meaningful questions at each step. Questions span diverse task types, including diagnostic classification, spatial biomarker interpretation, and outcome, progression, or recurrence prediction (Fig. 1b). Importantly, the benchmark is validated by clinicians using a custom-built expert annotation platform (Fig. 5, for details see Appendix C.1), ensuring both the clinical plausibility of the data and the correctness of model evaluation.
Beyond benchmark construction, we also introduce a modular agentic framework designed to interface with different tools as well as pretrained foundation models (Fig. 1b). These include models trained Agents can query these foundation models as part of their reasoning process-invoking them when needed to interpret image regions, extract genomic signatures, or cross-reference trial data-thus mirroring how expert clinicians rely on specialized resources in practice. This framework enables flexible, multi-step decision-making and substantially enhances the agent’s ability to synthesize information a
This content is AI-processed based on ArXiv data.