Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis

Gastric-X: A Multimodal Multi-Phase Benchmark Dataset f or Advancing V ision-Language Models in Gastric Cancer Analysis Sheng Lu † , * Ruijin Hospital Hao Chen ∗ Univ ersity of Cambridge Rui Y in Nanjing First Hospital Juyan Ba Shenzhen Univ ersity Y u Zhang Shanghai Jiao T ong Uni versity Y uanzhe Li † Ruijin Hospital Abstract Recent vision-language models (VLMs) have shown str ong generalization and multimodal r easoning abilities in natur al domains. However , their application to medi- cal diagnosis r emains limited by the lack of compr ehensive and structur ed datasets that captur e r eal clinical workﬂows. T o advance the development of VLMs for clinical applica- tions, particularly in gastric cancer , we intr oduce Gastric- X, a lar ge-scale multimodal benchmark for gastric cancer analysis pr oviding 1.7K cases. Each case in Gastric-X in- cludes pair ed r esting and dynamic CT scans, endoscopic image , a set of structured biochemical indicator s, expert- author ed diagnostic notes, and bounding box annotations of tumor re gions, r eﬂecting realistic clinical conditions. W e systematically examine the capability of recent VLMs on ﬁve core tasks: V isual Question Answering (VQA), re- port gener ation, cr oss-modal r etrieval, disease classiﬁca- tion, and lesion localization. These tasks simulate criti- cal stages of clinical workﬂow , fr om visual understanding and reasoning to multimodal decision support. Thr ough this evaluation, we aim not only to assess model perfor- mance but also to pr obe the nature of VLM understanding: Can curr ent VLMs meaningfully correlate bioc hemical sig- nals with spatial tumor features and te xtual r eports? W e en vision Gastric-X as a step towar d aligning machine in- telligence with the cognitive and evidential r easoning pr o- cesses of physicians, and as a r esour ce to inspir e the de- velopment of next-g eneration medical VLMs. Dataset link: https://huggingface.co/datasets/HaoChen2/Gastric-X . 1. Introduction Gastric cancer remains one of the most fatal malignancies worldwide, responsible for ov er 769,000 deaths in recent global estimates [ 4 , 13 ]. Its diagnosis requires integrating * Equal contribution. † Correspondence: Lu Sheng and Y uanzhe Li (ls12593@rjh.com.cn, yuanzhe@shuzhiweitech.com). heterogeneous, multimodal sources of evidence, including multi-phase 3D CT imaging, endoscopic assessment, lab- oratory tests, and patient history [ 9 , 10 , 31 , 53 , 58 ? ]. Among these modalities, CT plays a central role in local- ize lesion, clinical staging [ 30 ? ? ], yet ev aluations remain subject to substantial inter-observ er variability [ 29 ]. As the volume and complexity of patient data increase, clinicians face growing cognitiv e b urden, and diagnostic outcomes may v ary across institutions and e xpertise le vels [ 8 , 39 , 54 ]. These challenges underscore an urgent need for automated systems capable of jointly perceiving and reasoning across div erse clinical data streams. In parallel, rapid progress in vision–language models (VLMs) has reshaped our understanding of cross-modal learning in the general domain. Models such as CLIP [ 49 ], ALIGN [ 25 ], BLIP [ 33 ], and Flamingo [ 1 ] demonstrate that large-scale alignment of images and text can unlock power- ful representations capable of classiﬁcation, captioning, and visual question answering. These successes ha ve naturally inspired the development of medical VLMs [ 3 , 18 , 37 , 48 ? ]. Y et, the datasets that currently underpin medical VLM research paint a very different picture from real-world clin- ical practice. Most av ailable benchmarks focus on single- modality 2D radiographs paired with free-text reports, such as MIMIC-CXR [ 28 ], CheXpert [ 23 ], and PadChest [ 5 ]. A few recent efforts have begun to explore volumetric data, e.g., MedVL-CT69K [ 55 ] and 3D-RAD [ 15 ], but these datasets still primarily emphasize image–report matching and rarely incorporate the richer set of modalities or multi- phase volumes (except MedVL-CT69K) that are indispens- able for cancer diagnosis. This leav es a fundamental gap: clinical decision-making, especially in oncology , is inher ently multimodal. Radiolo- gists and oncologists must integrate multi-phase 3D imag- ing, structured laboratory indicators, lesion localization, and clinical narrativ es to form a coherent diagnosis. Imag- ing alone pro vides only a partial vie w; without datasets that capture this complexity , VLMs often rely on superﬁcial cor- 1 Figure 1. Overview of the multi-modal inf ormation in pr oposed Gastric-X. The center panel shows a schematic gastric representation alongside an endoscopic image. The left panel presents examples of structured laboratory data (e.g., blood counts, serum biochemistry , and tumor markers) and clinical textual reports (diagnostic and CT reports) that reﬂect real-world radiological reasoning. The right panel illustrates multi-phase 3D CT scans (non-contrast, arterial, venous, and equilibrium phases) with multi-class lesion annotations, including tumor , perigastric carcinoma, and stomach regions. relations and fail to generalize to real clinical reasoning. T o address this critical gap, we introduce Gastric-X , a new multimodal benchmark constructed directly from real-world gastric cancer diagnostic workﬂows. Gastric- X inte grates four key data modalities and annotations es- sential to clinical decision-making: (1) multi-phase 3D CT scans (non-contrast, arterial, venous, and equilibrium phases) capturing tumor characteristics across enhancement stages [ 30 ]; (2) endoscopic images that visualize the tumor; (3) structured patient proﬁles and biochemical laboratory measurements used in gastric cancer staging [ 57 ]; and (4) comprehensiv e clinical and radiology reports, along with expert-annotated ground truths including disease stages and precise 3D bounding box es of tumors, which follow the standard clinical workﬂo w for gastric cancer localization. Additionally , the dataset pro vides curated VQA pairs that probe cross-modal understanding. As sho wn in Figure 1 , the right panel illustrates multi-phase CT imaging and anno- tated bounding box examples, the left panel presents struc- tured laboratory indicators and clinical reports, and the cen- ter highlights the captured endoscopic image. This multi- modal set mirrors the information pipeline encountered by radiologists in practice. Contributions. The main contributions are: • W e introduce Gastric-X , a multimodal gastric cancer dataset designed to reﬂect how clinicians actually reason, by integrating four modalities: multi-phase 3D CT scans, endoscopic images, laboratory measurements, and diag- nostic reports, into one uniﬁed resource. • T o ground the dataset in real clinical semantics, we provide expert-crafted annotations, including precise 3D bounding boxes for tumors and perigastric lesions, de- tailed disease stages, and a curated set of VQA pairs that probe cross-modal understanding. • Building on these components, we establish a compre- hensiv e benchmark of ﬁ ve clinically meaningful tasks, of- fering a standardized platform for e valuating and advanc- ing multimodal medical VLMs. 2. Related W ork 2.1. V ision-Language Models V ision–Language Models (VLMs) learn joint image–te xt representations from large-scale data, enabling strong zero- shot generalization [ 26 , 50 , 65 ] beyond traditional super - vised learning [ 17 ]. The CLIP model [ 49 ] established 2 T able 1. Comprehensive comparison of medical vision–language datasets. Multi-phase: scans or images captured at dif ferent stages or conditions. Biochemical data: structured laboratory measurements (e.g., serology results) or Electronic Health Records (EHRs). Lesion label: annotations of lesions, including masks or bounding boxes. T extual Modality: how text-based labels are provided. The VQA pairs column speciﬁes whether the dataset is primarily designed for visual question answering. The report column indicates whether original diagnostic reports are av ailable. General Information Imaging Annotation & Report Dataset Release Domain Link Image Modalities #Scans / Images Multi-phase Biochemical Data Lesion Label T extual Modality PathVQA [ 21 ] 2020 Div erse link HP 5.00K Images ✗ ✗ ✗ VQA Pairs PadChest [ 5 ] 2020 Chest link XR 160K Images ✗ ✗ ✓ Reports SLAKE [ 40 ] 2022 Div erse link CT , MR, XR 642 Items ✗ ✗ ✓ VQA Pairs Merlin [ 3 ] 2024 Abdomen link CT 25.5K Scans ✗ ✓ ✗ Reports MIMIC-CXR v2 [ 27 ] 2024 Chest link XR 377K Images ✗ ✓ ✗ Reports CT -RA TE [ 19 ] 2024 Chest link CT 25.7K Scans ✗ ✗ ✗ Reports GEMeX [ 41 ] 2025 Chest link XR 151K Images ✗ ✗ ✗ VQA Pairs MedVL-CT69K [ 7 ] 2025 Di verse – CT 272K Scans ✓ ✗ ✗ Reports 3D-RAD [ 15 ] 2025 Di verse link CT 16.1K Scans ✗ ✗ ✗ VQA Pairs Gastric-X 2025 Gastric – CT 7.1K Scans 1.7K Images ✓ ✓ ✓ Reports Notes. Div erse: includes multiple organs. CT : Computed T omography; HP: Histopathology; MR: Magnetic Resonance; XR: X-ray . contrastiv e learning as the foundation for aligning visual and linguistic spaces. Building on this, later works intro- duced multi-objecti ve training [ 56 , 67 ] and uniﬁed archi- tectures [ 24 , 60 ], extending VLM capabilities to dense pre- diction tasks such as object detection [ 35 , 66 ]. For do wn- stream adaptation, lightweight techniques such as prompt tuning [ 73 , 74 ] and feature adaptation [ 16 ] efﬁciently tailor pretrained models to new tasks, enabling VLMs to serve as versatile foundations for multimodal reasoning. 2.2. VLMs in Medical Diagnosis Inspired by the success of general-domain VLMs, med- ical adaptations have sought to extend them to support clinical diagnosis. Early studies primarily aligned 2D X-ray images with radiology reports through contrastiv e objectiv es [ 11 , 12 , 59 , 72 ]. Subsequent work reﬁned these representations via ﬁne-grained region–text corre- spondence [ 22 , 48 , 61 ] and integration of structured med- ical knowledge [ 36 , 42 , 64 , 69 , 71 ]. Howe ver , most efforts remain limited to 2D imagery , with only recent attempts exploring 3D visual–language learning from computed to- mography (CT) volumes [ 3 , 6 , 18 , 37 ]. A further challenge lies in the reliance on single-volume inputs, while real- world diagnosis often requires multiple imaging sequences. For instance, 4 MRI sequences for brain tumors or 2 (ideally 4) CT volumes for gastric cancer . These gaps highlight the need for domain-speciﬁc 3D VLM frameworks that better reﬂect clinical diagnostic practice. 2.3. Multi-Modal Datasets f or VLM Diagnosis Progress in medical VLMs is closely tied to the av ailabil- ity of high-quality , multi-modal datasets. Foundational resources such as TCIA [ 14 ] and TCGA [ 62 ] have pro- vided extensi ve data that underpin cancer research. In radiology , dataset dev elopment has rapidly ev olved, from early 2D efforts linking images to free-text reports, such as MIMIC-CXR [ 2 ] and PadChest [ 5 ], to vision–language benchmarks with structured supervision like SLAKE [ 40 ] and GEMeX [ 41 ]. More recently , 3D CT datasets including MedVL-CT69K [ 55 ] and 3D-RAD [ 15 ] have shifted atten- tion toward volumetric reasoning and multi-modal integra- tion. While these de velopments mark important progress to- ward holistic medical understanding, multi-modal datasets dedicated to cancer diagnosis such as gastric cancer remain scarce, limiting the advancement of specialized 3D VLMs in this critical domain. 3. Gastric-X: Design and Construction In this section, we introduce Gastric-X, a clinically inspired multi-modal dataset designed to bridge real-world diagnos- tic workﬂo ws and vision–language modeling. W e ﬁrst de- scribe its clinical motiv ation and guiding design principles, followed by a detailed overvie w of its multimodal compo- sition and associated reasoning tasks. 3.1. Overview and Backgr ound Gastric-X is a multimodal gastric cancer dataset speciﬁcally curated for vision–language model research. Its design is motiv ated by how clinicians diagnose gastric cancer in prac- tice. Clinically , diagnosis is never based on a single modal- ity . Radiologists analyze multiple CT phases, gastroenterol- ogists interpret endoscopic ﬁndings. These imaging assess- ments are integrated with biochemical tests and laboratory results to form a comprehensiv e diagnostic conclusion. This complex multi-model reliance process inspired the creation of our Gastric-X : a dataset that mirrors real di- agnostic reasoning by aligning these heterogeneous data sources within a uniﬁed framew ork. Each patient record in 3 Figure 2. Gastric-X Ov erview . (A) Overall stage distribution by gender (67.5% male, 32.5% female). (B) Distribution of 5 tumor markers on a logarithmic scale. (C) Ov erall distribution of CT slice lengths across 4 different phases. (D) Sank ey diagram illustrating the transitions between T , N, and M stages and corresponding o verall stages. (E) Histogram of tumor sizes with a cumulative percentage curve. (F) W ord cloud summarizing the most frequent terms from radiology reports. Zoom in for better visualization. Gastric-X includes quad-phase CT scans, endoscopic im- ages, biochemical indicators, expert bounding box annota- tions of lesions, detailed imaging and endoscopic reports, and ﬁnal diagnostic summaries. T o facilitate advanced reasoning tasks, Gastric-X is further enriched with V isual Question Answering (VQA) annotations, making it a com- prehensiv e benchmark for clinical VLM dev elopment. The ov erview of dataset distrib ution can be seen in Figure 2 . 3.2. Multi-modal Data Our dataset is inherently multi-modal, encompassing di- verse types of medical data that reﬂect ho w clinicians in- tegrate information from multiple sources during diagnosis. Imaging Modality . The foundation of Gastric-X is its di- verse and complementary imaging data. W e begin with the CT modality , collected across four key phases: arterial, ve- nous, equilibrium, and non-contrast. Each phase offers a unique view of vascular perfusion and tissue enhancement, and together they capture the dynamic passage of contrast through the gastric wall. In clinical practice, radiologists often integrate these dynamic differences to form a holis- tic understanding of disease progression. Gastric-X reﬂects this process in data form. Complementing CT , Gastric-X also includes endoscopic images, which provide high-resolution, color -rich views of the gastric mucosa. These images expose ﬁne-grained tex- tures, color variations, and microv ascular structures in visi- ble to CT . Subtle irregularities, such as fold distortion, ero- sion, or discoloration, often signal the earliest stages of ma- lignancy and help pinpoint tumor location and stage. Gastric-X comprises 7.1K CT scans (a total of 83.48K slices) and 1.7K endoscopic images. W e unify these com- plementary vie ws to support a holistic understanding of gas- tric cancer through multi-modal imaging. The distrib ution of multi-phase CT slices is shown in Figure 2C . Stages and Lesion Annotation. Gastric cancer is clinically staged using the TNM system, which characterizes disease progression by assessing the primary tumor (T), regional lymph nodes (N), and distant metastasis (M). This frame- work guides diagnosis, treatment planning, and prognosis estimation, ultimately determining the overall stage. The distribution of overall stages by gender is shown in Fig- ure 2A . The transition from TNM components to o verall stages is illustrated in Figure 2D , and the tumor size distri- bution is presented in Figure 2E . For each CT phase, Gastric-X provides detailed 3D bounding box (BBox) annotations covering both tumor le- sions and rele vant organs. The annotations are organized into three hierarchical lev els to capture dif ferent diagnos- tic focuses: the tumor core, the regional lymph nodes, and the entire stomach region. In total, each CT study includes three BBoxes per phase across four phases and 1.74K pa- tients, resulting in a rich and comprehensiv e annotation set of 21,408 BBoxes for multi-scale lesion analysis. Biochemical Data. Beyond imaging, the diagnostic reason- ing of clinicians often turns to the language of biochemical test, signals that quietly reﬂect internal pathology . Gastric- 4 X captures this dimension through 11 serum biochemistry indicators ( e.g . liver and renal function markers) and 5 key tumor markers commonly used in gastric cancer screening. T o complement these laboratory measures, we further in- clude comprehensi ve electronic health records (EHR) de- tailing surgical procedures, medication history , and treat- ment progress across 134 structured items. The tumor marker distrib ution is shown in Figure 2B . Reports. Beyond imaging, Gastric-X incorporates rich clinical reports that mirror the reasoning chain of real-w orld diagnosis. Each patient record is paired with three com- plementary types of reports: (1) the CT report, describing lesion morphology , enhancement patterns, and staging im- pressions; (2) the endoscopic report, detailing mucosal ap- pearance, color, and biopsy ﬁndings; and (3) the diagno- sis report, which summarizes pathological conﬁrmation and clinical outcomes. The word cloud of diagnosis reports is shown in Figure 2F . Building upon these reports, we construct 26,760 visual question–answer (VQA) pairs, designed to transform narra- tiv e clinical observ ations into structured reasoning tasks. 3.3. Comparison to Other Datasets As summarized in T able 1 , existing medical vi- sion–language datasets primarily focus on static imaging modalities (e.g., X-ray , CT , pathology slides) paired with diagnostic reports or VQA annotations. While datasets such as P athVQA [ 21 ], SLAKE [ 40 ], and GEMeX [ 41 ] hav e advanced multimodal research, they lack dynamic information from multi-phase imaging and structured clinical measurements. More recent datasets, including Merlin [ 3 ] and MIMIC- CXR [ 27 ], begin to incorporate biochemical data. Struc- tured EHR components such as serology results offer quan- titativ e insights into physiological states not captured by imaging alone. For gastric cancer speciﬁcally , TCGA- ST AD [ 46 ] provides 46 subjects with CT scans and accom- panying clinical, genomic, and histopathology data, but re- mains limited in scale. In contrast, Gastric-X introduces a previously missing multimodal conﬁguration: multi-phase 3D CT , endoscopy , structured biochemistry , lesion annotations, and clinical text, all aligned at the patient level. No existing dataset of- fers such a comprehensiv e setting. This design provides the multimodal supervision necessary for realistic clinical rea- soning and establishes a new standard for holistic medical understanding. 4. Adapting VLMs to Our Multi-Modal Data T o e valuate how current vision–language models (VLMs) handle our multi-model medical data, we adapt both general-purpose and medical-speciﬁc models to Gastric-X. Figure 3. VLM adaptation. In adapting VLMs to our dataset, the visual encoder incorporates multi-phase CT inputs, while test tables, textual queries, and lesion-localization cues serve as com- plementary multimodal inputs guiding the model’ s diagnostic rea- soning. The VLMs must ef fectiv ely adapt to these div erse input modalities to better capture and perform the targeted clinical tasks. The symbolic adaptation and the tasks can be seen in Fig- ure 3 . Model Selection. W e select LLaV A-1.5-7B [ 43 ], BLIP- 2 [ 34 ], and X 2 -VLM [ 68 ] as representativ e general VLMs, and LLaV A-Med v1.5 [ 32 ], Med-Flamingo [ 47 ], and Med- VInT [ 70 ] as medical-domain models. CT Adaptation. Our dataset comprises multi-phase 3D CT volumes, whereas most VLMs are designed for 2D or single-image inputs. For LLaV A-1.5 [ 43 ] and BLIP- 2 [ 34 ], which natively support multi-channel image inputs, we concatenate CT slices across phases to form multi- channel representations. For X 2 -VLM [ 68 ], whose archi- tecture is 2D, we replace its vision encoder with a 3D Swin T ransformer [ 44 , 45 ] and its text encoder with Med- BER T [ 51 ], both initialized with pre-trained weights. The adapted model is referred to as X 2 -VLM-Med. For medical-speciﬁc models, Med-Flamingo and Med- VInT already support volumetric inputs in the form of se- quential 2D (pseudo-3D) processing and therefore require only minimal modiﬁcation. In contrast, LLaV A-Med na- tiv ely accepts only 2D inputs, so we adopt the same strat- egy used for X 2 -VLM [ 68 ] by replacing its vision encoder with a Swin Transformer [ 45 ] to enable effecti ve 3D volume processing. Retrieval Extension. Some models lack built-in retriev al capability . W e add a lightweight retriev al head to en- able bidirectional image–text matching, allo wing consistent ev aluation across all models on cross-modal retriev al tasks. Addition modalities usage. Clinical diagnosis rarely relies on imaging alone; clinicians naturally integrate visual cues with structured biomedical measurements. T o mirror this 5 Figure 4. Radar plot comparing multimodal conﬁgurations across three medical vision-language tasks. The ”Image+T able+Bbox” conﬁguration achiev es the highest overall performance across all ev aluation metrics. workﬂo w with minimal architectural changes, we introduce two lightweight but impactful sources of auxiliary informa- tion: bounding-box cues and biomedical test tables, to sup- port the VLMs. The bounding boxes are rendered directly on the CT slices as colored o verlays, serving as soft spa- tial priors that nudge the VLM’ s attention to ward clinically relev ant regions without altering the visual encoder . In contrast, biomedical test tables present a unique chal- lenge: the y contain dozens of measurements, many of which are irrele vant or fall within normal ranges. Rather than feeding the entire table, we mimic the behavior of clin- icians, who ﬁrst look for abnormal v alues, meaning v alues that exceed physiological thresholds, and reason from there. W e extract only these abnormal entries and con vert them into concise textual descriptors that include the test name, its measured v alue, and the ratio by which it e xceeds the normal limit. 5. Experiments Implementation Details. All models are ﬁne-tuned using the AdamW optimizer with a batch size of 32 and a learning rate of 5e-5, following a linear warm-up and decay sched- ule. All models are trained on a single NVIDIA R TX 3090 GPU. The dataset is split at the patient level into training, validation, and test sets (70/15/15). All models are initial- ized from the X2-VLM checkpoint and trained indepen- dently for each task. W e optimize models using AdamW with a learning rate of 5 × 10 − 5 , weight decay 0.01, and 10% warm-up schedule. The te xt encoder is trained with a 2 × learning rate relativ e to the visual encoder . Multi-Modal Input Settings. In clinical practice, multiple modalities are often examined together to support diagno- sis. T o emulate this process and ev aluate the contribution of each modality , we design four input schemas: (1) Image Only , using visual information alone; (2) Image + T able , combining images with biomedical test results; (3) Image + BBox , incorporating both images and bounding box an- notations; and (4) Image + T able + BBox , integrating all av ailable modalities for a comprehensiv e diagnostic setting. T ask Settings. T o comprehensively ev aluate multi-modal understanding, we design ﬁv e challenges within our dataset: report generation, visual question answering, cross-modal retriev al, phase classiﬁcation, and lesion detection. For all task result tables, the best and second-best results are indicated in bold and underlined, respectiv ely . A summary of the model performance across all conﬁg- urations in the three main tasks is illustrated in Figure 4 . The results are reported using X 2 -VLM-Med [ 68 ], demon- strating consistent improv ements when incorporating mul- timodal information. Notably , the Image+T able+Bbox con- ﬁguration achieves the most balanced and superior results across all ev aluation metrics. The detailed conﬁgurations and results for each task are presented in the following subsections. 5.1. V isual Question Answering As shown in T able 3a , across all four ev aluation settings, a consistent upward trend is observed in model performance as the input conditions become progressiv ely richer and complex. Speciﬁcally , all models exhibit improvements in Precision, Accuracy , F1, and Area Under the Curve (A UC) from the image-only setting to the fully integrated one, indicating that exposure to more informativ e or bal- anced input–target combinations substantially enhances vi- sual–language reasoning. Among the compared methods, X 2 -VLM-Med [ 68 ] achiev es the best overall performance across all metrics and settings, demonstrating superior generalization and multi- modal alignment capabilities. Its A UC steadily increases from 85.3% (Image Only) to 91.5% (Image + T able + BBox), reﬂecting robust discriminative po wer under vary- ing conﬁgurations. Med-Flamingo [ 47 ] consistently ranks second, highlighting its strong medical-domain adaptation, while MedVInT [ 70 ] closely follo ws with competitiv e gains across all metrics. 5.2. Report Generation W e e valuate report generation quality using four comple- mentary metrics: R OUGE-L (R-L), BLEU-4 (B-4), ME- 6 T able 2. Performance across modalities for V ision Question Answer and report generation. W e ev aluate four input-modality settings, including Image Only and combinations that add T able, Bounding Box (BBox), or both. The best and second-best results in each column are shown in bold and underlined, respecti vely . Model Image Only Image + T able Image + BBox Image + T able + BBox Prec. ↑ Acc. ↑ F1 ↑ AUC (%) ↑ Prec. ↑ Acc. ↑ F1 ↑ AUC (%) ↑ Prec. ↑ Acc. ↑ F1 ↑ AUC (%) ↑ Prec. ↑ Acc. ↑ F1 ↑ AUC (%) ↑ LLaV A-1.5-7B [ 43 ] 42 . 5 53 . 1 46 . 2 67 . 8 48 . 3 59 . 7 52 . 4 72 . 1 50 . 1 61 . 0 54 . 3 73 . 0 55 . 2 66 . 5 59 . 1 77 . 8 LLaV A-Med v1.5 [ 32 ] 56 . 3 64 . 7 59 . 9 78 . 4 62 . 5 71 . 8 65 . 7 82 . 0 63 . 8 72 . 9 66 . 9 82 . 6 67 . 4 76 . 1 69 . 8 85 . 0 Med-Flamingo [ 47 ] 60.8 68.9 63.9 80.5 66.9 74.8 69.7 84.2 68.1 75.4 70.9 84.8 71.0 78.9 73.5 86.5 BLIP-2 [ 34 ] 50 . 2 59 . 4 53 . 7 73 . 1 56 . 0 64 . 8 58 . 9 77 . 0 58 . 0 66 . 4 60 . 7 78 . 2 61 . 3 69 . 7 63 . 9 80 . 1 MedVInT [ 70 ] 58 . 1 66 . 0 60 . 9 79 . 0 63 . 4 71 . 5 65 . 9 82 . 5 65 . 0 72 . 3 67 . 4 83 . 0 68 . 0 75 . 8 70 . 1 85 . 4 X 2 -VLM-Med [ 68 ] 66.7 75.2 69.8 85.3 72.4 80.1 74.5 88.7 74.0 81.3 76.2 89.2 77.8 84.6 79.9 91.5 (a) V isual Question Answering. Prec.: Precision, Acc.: Accuracy , A UC: Area Under the ROC Curv e. Model Image Only Image + T able Image + BBox Image + T able + BBox R-L ↑ B-4 ↑ MTR ↑ BS-F1 ↑ R-L ↑ B-4 ↑ MTR ↑ BS-F1 ↑ R-L ↑ B-4 ↑ MTR ↑ BS-F1 ↑ R-L ↑ B-4 ↑ MTR ↑ BS-F1 ↑ LLaV A-1.5-7B [ 43 ] 28 . 4 9 . 2 14 . 1 45 . 3 34 . 2 12 . 1 17 . 6 51 . 2 36 . 0 13 . 5 18 . 9 53 . 1 40 . 1 16 . 3 21 . 7 57 . 8 LLaV A-Med v1.5 [ 32 ] 35 . 7 13 . 8 18 . 9 53 . 6 42 . 0 17 . 2 23 . 4 60 . 0 44 . 1 18 . 7 25 . 1 62 . 6 48 . 7 21 . 9 28 . 4 66 . 5 Med-Flamingo [ 47 ] 42.1 18.6 23.4 60.2 49.0 22.4 28.2 67.0 51.2 24.1 30.1 69.4 55.6 27.8 34.6 73.1 BLIP-2 [ 34 ] 36 . 8 15 . 2 19 . 7 55 . 1 43 . 5 19 . 0 25 . 0 61 . 9 45 . 6 20 . 6 26 . 8 64 . 0 50 . 2 23 . 7 30 . 5 67 . 7 MedVInT [ 70 ] 38 . 9 16 . 9 21 . 0 57 . 3 45 . 1 20 . 8 26 . 9 63 . 5 47 . 2 22 . 3 28 . 7 65 . 9 52 . 0 25 . 5 32 . 4 69 . 8 X 2 -VLM-Med [ 68 ] 49.5 24.1 29.6 68.7 56.8 29.3 35.4 76.2 58.1 31.1 37.8 78.3 62.3 34.5 41.6 82.0 (b) Report Generation. Evaluation metrics include R OUGE-L (R-L), BLEU-4 (B-4), METEOR (MTR), and BER TScore F1 (BS-F1). TEOR (MTR), and BER TScore F1 (BS-F1), to cover lexical ov erlap, ﬂuency , and semantic ﬁdelity between generated and reference reports. As sho wn in T able 3b , all models exhibit a consistent upward trend across the four experimental settings, mirror- ing the pattern observ ed in T able 3a . X 2 -VLM-Med [ 68 ] achiev es the best overall performance, with BER TScore F1 improving from 68.7 in Image Only to 82.0 in Image + T a- ble + BBox, highlighting its strong cross-modal reasoning and text generation capabilities. 5.3. Cross-modal Retrie val T able 4 presents image-only cross-modal retriev al results for both Image → T ext and T ext → Image, ev aluated using Recall at K (Recall@K), Median Rank (MedR), Mean Rank (MnR), and Mean A verage Precision (mAP). Across all metrics, X 2 -VLM-Med [ 68 ] achiev es the best performance, reaching an R@1 of 48.9% for Image → T ext and 47.5% for T ext → Image, substantially outperforming other models. Med-Flamingo [ 47 ] consistently ranks sec- ond, reﬂecting ef fectiv e multimodal reasoning within the medical domain. MedVInT [ 70 ] and BLIP-2 [ 34 ] deliver competitive re- sults but remain behind the top two models, while LLaV A- Med v1.5 [ 32 ] and LLaV A-1.5-7B [ 43 ] lag further due to limited retriev al-speciﬁc adaptation. 5.4. Disease Stage Classiﬁcation T able 5 presents the results for disease stage classiﬁca- tion across all ev aluated models. W e observe a con- sistent performance hierarchy across input conﬁgurations, broadly aligned with both model capacity and multimodal alignment strength. Con ventional con volutional architec- tures such as ResNet-50 [ 20 ] lag considerably behind transformer-based approaches, reﬂecting their limited abil- ity to capture long-range and cross-modal dependencies. In contrast, the Swin T ransformer [ 44 ] achieves notable gains ov er ResNet baselines, conﬁrming that hierarchical self- attention effecti vely models localized medical features and spatial-scale variations. Among multimodal encoders, MedVInT [ 70 ] and LLaV A-Med v1.5 [ 32 ] demonstrate moderate improve- ments, particularly when incorporating table cues. Ho w- ev er , their performance saturates as the modality complex- ity increases (Image + T able + BBox). In contrast, X 2 -VLM-Med [ 68 ] consistently achie ves the strongest results across all metrics and input conﬁgurations, outperforming the second-best model by a clear margin. The gain is most pronounced in A UC (+1.6 over Swin on Image + T able + BBox). 5.5. Lesion Detection T able 6 summarizes lesion detection performance under COCO e valuation settings [ 38 ], reporting COCO-style AP and F1. Faster R-CNN [ 52 ] serves as a strong conv olu- tional baseline b ut sho ws moderate precision, with AP@0.5 = 64.1 and localization accuracy of 70.4. The Swin T ransformer [ 63 ] signiﬁcantly improves across all metrics, beneﬁting from hierarchical self- attention that captures multi-scale lesion structures. Med- VInT [ 70 ] further boosts performance, attaining the highest AP@0.5 of 72.1 and a mean AP of 50.2, reﬂecting superior spatial reasoning through multimodal medical pretraining. LLaV A-Med v1.5 [ 43 ] maintains strong ov erall detec- 7 T able 4. Cross-modal retriev al. This table reports retrieval results in both Image-to-T ext and T ext-to-Image directions. Metrics include Recall@K (%, higher is better), Median Rank and Mean Rank (lower is better), and mean A verage Precision (mAP). Bold and underlined numbers denote the best and second-best performance in each column. Model Image → T ext T ext → Image R@1 ↑ R@5 ↑ R@10 ↑ MedR ↓ MnR ↓ mAP ↑ R@1 ↑ R@5 ↑ R@10 ↑ MedR ↓ MnR ↓ mAP ↑ LLaV A-1.5-7B [ 43 ] 24 . 3 52 . 7 63 . 1 11 . 2 28 . 4 38 . 5 22 . 1 50 . 3 61 . 7 12 . 4 29 . 8 36 . 9 LLaV A-Med v1.5 [ 32 ] 35 . 6 68 . 9 78 . 3 7 . 5 19 . 6 52 . 7 33 . 8 65 . 2 76 . 1 8 . 1 21 . 4 50 . 3 Med-Flamingo [ 47 ] 42.8 74.1 83.5 6.2 17.3 57.9 41.5 72.6 82.8 6.6 18.4 56.8 BLIP-2 [ 34 ] 39 . 7 70 . 5 80 . 9 6 . 9 18 . 1 55 . 3 37 . 6 68 . 9 79 . 2 7 . 4 19 . 7 53 . 2 MedVInT [ 70 ] 40 . 1 73 . 6 82 . 4 6 . 4 17 . 1 56 . 4 39 . 4 71 . 8 81 . 2 6 . 8 17 . 9 55 . 1 X 2 -VLM-Med [ 68 ] 48.9 80.7 88.2 4.9 13.5 63.1 47.5 79.3 87.4 5.2 14.1 61.7 T able 5. Disease stage classiﬁcation. For each conﬁguration, we ev aluate Precision, Recall, F1 score, and Area Under (A UC). These results show how integrating multimodal cues and medical-aware pretraining beneﬁts ﬁne-grained disease staging. Bold and underlined numbers indicate the best and second-best performance in each column. Model Image Only Image + T able Image + BBox Image + T able + BBox Prec. ↑ Rec. ↑ F1 ↑ A UC ↑ Prec. ↑ Rec. ↑ F1 ↑ AUC ↑ Prec. ↑ Rec. ↑ F1 ↑ A UC ↑ Prec. ↑ Rec. ↑ F1 ↑ AUC ↑ ResNet-50 [ 20 ] 72 . 8 70 . 6 71 . 6 80 . 3 74 . 1 72 . 0 73 . 0 81 . 2 75 . 9 73 . 7 74 . 7 82 . 1 76 . 4 74 . 8 75 . 5 82 . 7 Swin T ransformer [ 44 ] 79.5 78.0 78.7 87.1 80 . 4 79 . 2 79 . 8 88 . 0 82.1 80.9 81.5 89.2 83.4 82.0 82.7 90.1 MedVInT [ 70 ] 77 . 1 76 . 2 76 . 6 85 . 5 81.0 79.8 80.4 87.6 80 . 6 79 . 4 80 . 0 88 . 3 81 . 2 80 . 1 80 . 6 88 . 7 LLaV A-Med v1.5 [ 32 ] 75 . 8 74 . 3 75 . 0 84 . 0 78 . 0 76 . 6 77 . 3 85 . 3 78 . 8 77 . 5 78 . 1 86 . 1 79 . 5 78 . 0 78 . 7 86 . 6 X 2 -VLM-Med [ 68 ] 80.6 79.2 79.8 88.3 82.4 80.9 81.6 89.4 83.2 82.0 82.5 90.2 83.9 82.6 83.2 90.8 T able 6. Lesion detection . Metrics are A verage Precision (AP) and F1. mAP denotes mean AP a veraged over IoU thresholds from 0.50 to 0.95 (step 0.05), and localization accuracy (Loc. Acc.) is computed at IoU = 0.5. Model AP@0.5 AP@0.75 F1@0.5 mAP Loc. Acc. FRCNN [ 52 ] 64 . 1 48 . 7 61 . 0 43 . 2 70 . 4 Swin T ransformer [ 63 ] 70 . 5 53 . 9 66 . 7 48 . 9 77 . 1 MedVInT [ 70 ] 72.1 55.1 67 . 8 50.2 78.5 LLaV A-Med v1.5 [ 43 ] 68 . 3 51 . 7 67.9 46 . 8 75 . 2 X 2 -VLM-Med [ 68 ] 70.4 56.4 68.1 51.5 79.6 Metric@IoU indicates computation at a giv en IoU threshold. tion quality , especially in F1@0.5, suggesting ef fecti ve vi- sion–language grounding despite limited adaptation. The X 2 -VLM-Med [ 68 ] achie ves the best results on most e valu- ation metrics ov er baselines by a consistent margin. 6. Ethical Concern and Publication All data were collected by ﬁrst-line clinicians under strict institutional ethical approval and in full compliance with relev ant priv acy and data-protection regulations. After col- lection, all samples were fully de-identiﬁed and manually checked to ensure that no personally identiﬁable informa- tion remained. The annotations were created by experienced clinicians with domain expertise and were cross-checked for consis- tency . W e aim to ensure that the clinicians’ qualiﬁcations and the revie w process provide a solid basis for the relia- bility and clinical validity of the annotations, which may beneﬁt future research and model dev elopment. The dataset is well-structured, has obtained Institutional Revie w Board (IRB) approval, and is planned to be made publicly av ailable after the paper is accepted. The release will follow institutional data-sharing policies, and no data that restrict public or research redistribution will be in- cluded. W e plan to host a small subset of samples on Hug- ging Face for demonstration purposes, while the complete dataset will be distrib uted through our project webpage af- ter publication. Access to the dataset will require signing a consent form prior to distribution. The dataset is intended to be released under the CC BY -NC-ND 4.0 license. 7. Conclusion W e presented Gastric-X , a comprehensiv e multimodal benchmark designed to advance vision–language research in gastric cancer analysis. The dataset is carefully curated from real clinical workﬂo ws, integrating multi-phase 3D CT scans, endoscopic images, biochemical indicators, and clin- ical reports with bounding box and disease stage annota- tions. Gastric-X provides a uniﬁed platform encompassing ﬁv e core tasks, including visual question answering, report generation, cross-modal retriev al, disease stage classiﬁca- tion, and lesion detection, of fering a holistic ev aluation of multimodal reasoning in medical AI. W e will publicly re- lease the dataset and accompanying experimental code to support reproducibility and community dev elopment. W e en vision Gastric-X as a standard reference for developing robust and clinically aligned vision–language models in healthcare. 8 References [1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr , Y ana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems , 35:23716–23736, 2022. 1 [2] Elisa Baratella, Cristina Marrocchio, Alessandro Marco Bozzato, Erik Roman-Pognuz, and Maria Assunta Cov a. Chest x-ray in intensiv e care unit patients: what there is to know about thoracic devices. Diagnostic and Interventional Radiology , 27(5):633, 2021. 3 [3] Louis Blankemeier , Joseph Paul Cohen, Ashwin Kumar , Dav e V an V een, Syed Jamal Safdar Gardezi, Magdalini Paschali, Zhihong Chen, Jean-Benoit Delbrouck, Eduardo Reis, Cesar T ruyts, et al. Merlin: A vision language founda- tion model for 3d computed tomography . Resear ch Square , pages rs–3, 2024. 1 , 3 , 5 [4] Freddie Bray , Jacques Ferlay , Isabelle Soerjomataram, Re- becca L Siegel, Lindsey A T orre, and Ahmedin Jemal. Global cancer statistics 2018: Globocan estimates of inci- dence and mortality worldwide for 36 cancers in 185 coun- tries. CA: a cancer journal for clinicians , 68(6):394–424, 2018. 1 [5] Aurelia Bustos, Antonio Pertusa, Jose-Maria Salinas, and Maria De La Iglesia-V aya. Padchest: A large chest x-ray image dataset with multi-label annotated reports. Medical image analysis , 66:101797, 2020. 1 , 3 [6] W eiwei Cao, Jianpeng Zhang, Y ingda Xia, T ony CW Mok, Zi Li, Xianghua Y e, Le Lu, Jian Zheng, Y uxing T ang, and Ling Zhang. Bootstrapping chest ct image understanding by distilling knowledge from x-ray expert models. In Proceed- ings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 11238–11247, 2024. 3 [7] W eiwei Cao, Jianpeng Zhang, Zhongyi Shui, Sinuo W ang, Zeli Chen, Xi Li, Le Lu, Xianghua Y e, Qi Zhang, Tingbo Liang, et al. Boosting vision semantic density with anatomy normality modeling for medical vision-language pre-training. In Pr oceedings of the IEEE/CVF International Confer ence on Computer V ision , pages 23041–23050, 2025. 3 [8] Hao Chen, Hongrun Zhang, U W ang Chan, Rui Y in, Xi- aofei W ang, and Chao Li. Domain game: Disentangle anatomical feature for single domain generalized segmenta- tion. In International W orkshop on Computational Mathe- matics Modeling in Cancer Analysis , pages 41–51. Springer Nature Switzerland Cham, 2024. 1 [9] Hao Chen, Rui Y in, Y ifan Chen, Qi Chen, and Chao Li. Learning patient-speciﬁc disease dynamics with latent ﬂo w matching for longitudinal imaging generation. ICLR 2026 , 2025. 1 [10] Y ifan Chen, Fei Y in, Hao Chen, Jia W u, and Chao Li. Pmpbench: A paired multi-modal pan-cancer benchmark for medical image synthesis. arXiv pr eprint arXiv:2601.15884 , 2026. 1 [11] Zhihong Chen, Guanbin Li, and Xiang W an. Align, rea- son and learn: Enhancing medical vision-and-language pre- training with knowledge. In Proceedings of the 30th A CM international confer ence on multimedia , pages 5152–5161, 2022. 3 [12] Pujin Cheng, Li Lin, Junyan L yu, Y ijin Huang, W enhan Luo, and Xiaoying T ang. Prior: Prototype representation joint learning from medical images and reports. In Pr oceedings of the IEEE/CVF international confer ence on computer vi- sion , pages 21361–21371, 2023. 3 [13] Bhupender S Chhikara, Ke ykav ous Parang, et al. Global can- cer statistics 2022: the trends projection analysis. Chemical biology letters , 10(1):451–451, 2023. 1 [14] Kenneth Clark, Bruce V endt, Kirk Smith, John Freymann, Justin Kirby , Paul K oppel, Stephen Moore, Stanley Phillips, David Maf ﬁtt, Michael Pringle, et al. The cancer imag- ing archiv e (tcia): maintaining and operating a public infor- mation repository . Journal of digital imaging , 26(6):1045– 1057, 2013. 3 [15] Xiaotang Gai, Jiaxiang Liu, Y ichen Li, Zijie Meng, Jian Wu, and Zuozhu Liu. 3d-rad: A comprehensi ve 3d radiology med-vqa dataset with multi-temporal analysis and di verse di- agnostic tasks. arXiv pr eprint arXiv:2506.11147 , 2025. 1 , 3 [16] Peng Gao, Shijie Geng, Renrui Zhang, T eli Ma, Rongyao Fang, Y ongfeng Zhang, Hongsheng Li, and Y u Qiao. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer V ision , 132(2): 581–595, 2024. 3 [17] Ross Girshick. Fast r-cnn. In Pr oceedings of the IEEE inter - national conference on computer vision , pages 1440–1448, 2015. 2 [18] Ibrahim Ethem Hamamci, Sezgin Er , Furkan Almas, A yse Gulnihan Simsek, Se vval Nil Esir gun, Irem Dogan, Muhammed Furkan Dasdelen, Bastian W ittmann, Enis Sim- sar , Mehmet Simsar, et al. A foundation model utilizing chest ct volumes and radiology reports for supervised-le vel zero- shot detection of abnormalities. CoRR , 2024. 1 , 3 [19] Ibrahim Ethem Hamamci, Sezgin Er , Chenyu W ang, Furkan Almas, A yse Gulnihan Simsek, Se vval Nil Esirgun, Irem Doga, Omer Faruk Durugol, W eicheng Dai, Murong Xu, et al. Developing generalist foundation models from a mul- timodal dataset for 3d computed tomography . arXiv pr eprint arXiv:2403.17834 , 2024. 3 [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Pr oceed- ings of the IEEE confer ence on computer vision and pattern r ecognition , pages 770–778, 2016. 7 , 8 [21] Xuehai He, Y ichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint , 2020. 3 , 5 [22] Shih-Cheng Huang, Liyue Shen, Matthew P Lungren, and Serena Y eung. Gloria: A multimodal global-local represen- tation learning framework for label-ef ﬁcient medical image recognition. In Pr oceedings of the IEEE/CVF international confer ence on computer vision , pages 3942–3951, 2021. 3 [23] Jeremy Irvin, Pranav Rajpurkar , Michael K o, Y ifan Y u, Sil- viana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: 9 A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artiﬁcial intelligence , pages 590–597, 2019. 1 [24] Jiho Jang, Chaerin K ong, Donghyeon Jeon, Seonhoon Kim, and Nojun Kwak. Unifying vision-language representation space with single-tower transformer . In Pr oceedings of the AAAI conference on artiﬁcial intelligence , pages 980–988, 2023. 3 [25] Chao Jia, Y infei Y ang, Y e Xia, Y i-T ing Chen, Zarana Parekh, Hieu Pham, Quoc Le, Y un-Hsuan Sung, Zhen Li, and T om Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International confer ence on machine learning , pages 4904–4916. PMLR, 2021. 1 [26] Chao Jia, Y infei Y ang, Y e Xia, Y i-T ing Chen, Zarana Parekh, Hieu Pham, Quoc Le, Y un-Hsuan Sung, Zhen Li, and T om Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International confer ence on machine learning , pages 4904–4916. PMLR, 2021. 2 [27] Alistair Johnson, T om Pollard, Roger Mark, Seth Berko witz, and Ste ven Horng. Mimic-cxr database (v ersion 2.1.0), 2024. RRID:SCR 007345. 3 , 5 [28] Alistair EW Johnson, T om J Pollard, Seth J Berk owitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Stev en Horng. Mimic-cxr, a de- identiﬁed publicly av ailable database of chest radiographs with free-text reports. Scientiﬁc data , 6(1):317, 2019. 1 [29] Jung Hoon Kim, Hyo W on Eun, Jae Ho Choi, Seong Sook Hong, W eechang Kang, and Y ong Ho Auh. Diagnostic per- formance of virtual gastroscopy using mdct in early gastric cancer compared with 2d axial ct: focusing on interobserver variation. American Journal of Roentgenology , 189(2):299– 305, 2007. 1 [30] Robert Michael Kwee and Thomas Christian Kwee. Imag- ing in local staging of gastric cancer: a systematic revie w . Journal of clinical oncolo gy , 25(15):2107–2116, 2007. 1 , 2 [31] Robert Michael Kwee and Thomas Christian Kwee. Imag- ing in local staging of gastric cancer: a systematic revie w . Journal of clinical oncolo gy , 25(15):2107–2116, 2007. 1 [32] Chunyuan Li, Cliff W ong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Y ang, T ristan Naumann, Hoifung Poon, and Jianfeng Gao. Llav a-med: Training a large language- and-vision assistant for biomedicine in one day . Advances in Neural Information Pr ocessing Systems , 36:28541–28564, 2023. 5 , 7 , 8 [33] Junnan Li, Dongxu Li, Caiming Xiong, and Stev en Hoi. Blip: Bootstrapping language-image pre-training for uniﬁed vision-language understanding and generation. In Interna- tional confer ence on machine learning , pages 12888–12900. PMLR, 2022. 1 [34] Junnan Li, Dongxu Li, Silvio Sav arese, and Stev en Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In In- ternational confer ence on mac hine learning , pages 19730– 19742. PMLR, 2023. 5 , 7 , 8 [35] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Y ang, Chunyuan Li, Y iwu Zhong, Lijuan W ang, Lu Y uan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pages 10965–10975, 2022. 3 [36] Mingjie Li, Bingqian Lin, Zicong Chen, Haokun Lin, Xi- aodan Liang, and Xiaojun Chang. Dynamic graph enhanced contrastiv e learning for chest x-ray report generation. In Pro- ceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 3334–3343, 2023. 3 [37] Jingyang Lin, Y ingda Xia, Jianpeng Zhang, Ke Y an, Le Lu, Jiebo Luo, and Ling Zhang. Ct-glip: 3d grounded language- image pretraining with ct scans and radiology reports for full-body scenarios. arXiv pr eprint arXiv:2404.15272 , 2024. 1 , 3 [38] Tsung-Y i Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, De va Ramanan, Piotr Doll ´ ar , and C La wrence Zitnick. Microsoft coco: Common objects in conte xt. In Eu- r opean Conference on Computer V ision (ECCV) , pages 740– 755. Springer , 2014. 7 [39] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Ar- naud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen A wm V an Der Laak, Bram V an Gin- neken, and Clara I S ´ anchez. A surve y on deep learning in medical image analysis. Medical image analysis , 42:60–88, 2017. 1 [40] Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Y an Y ang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge- enhanced dataset for medical visual question answering. In 2021 IEEE 18th international symposium on biomedical imaging (ISBI) , pages 1650–1654. IEEE, 2021. 3 , 5 [41] Bo Liu, K e Zou, Li-Ming Zhan, Zexin Lu, Xiaoyu Dong, Y idi Chen, Chengqiang Xie, Jiannong Cao, Xiao-Ming Wu, and Huazhu Fu. Gemex: A large-scale, groundable, and ex- plainable medical vqa benchmark for chest x-ray diagnosis. In Pr oceedings of the IEEE/CVF International Conference on Computer V ision , pages 21310–21320, 2025. 3 , 5 [42] Chang Liu, Y uanhe T ian, W eidong Chen, Y an Song, and Y ongdong Zhang. Bootstrapping large language models for radiology report generation. In Pr oceedings of the AAAI Confer ence on Artiﬁcial Intelligence , pages 18635–18643, 2024. 3 [43] Haotian Liu, Chunyuan Li, Y uheng Li, and Y ong Jae Lee. Improv ed baselines with visual instruction tuning. In Pr o- ceedings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pages 26296–26306, 2024. 5 , 7 , 8 [44] Ze Liu, Y utong Lin, Y ue Cao, Han Hu, Y ixuan W ei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Pr oceedings of the IEEE/CVF international conference on computer vision , pages 10012–10022, 2021. 5 , 7 , 8 [45] Ze Liu, Jia Ning, Y ue Cao, Y ixuan W ei, Zheng Zhang, Stephen Lin, and Han Hu. V ideo swin transformer . In Pr o- ceedings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pages 3202–3211, 2022. 5 [46] F . R. Lucchesi and N. D. Aredes. The cancer genome atlas stomach adenocarcinoma collection (tcga-stad). https : / / doi . org / 10 . 7937 / K9 / TCIA . 2016 . GDHL9KIM , 2016. Data set, The Cancer Imaging Archiv e. 5 10 [47] Michael Moor , Qian Huang, Shirley W u, Michihiro Y a- sunaga, Y ash Dalmia, Jure Leskov ec, Cyril Zakka, Ed- uardo Pontes Reis, and Pranav Rajpurkar . Med-ﬂamingo: a multimodal medical few-shot learner . In Machine Learn- ing for Health (ML4H) , pages 353–367. PMLR, 2023. 5 , 6 , 7 , 8 [48] Philip M ¨ uller , Georgios Kaissis, Congyu Zou, and Daniel Rueckert. Joint learning of localized representations from medical images and reports. In Eur opean conference on com- puter vision , pages 685–701. Springer , 2022. 1 , 3 [49] Alec Radford, Jong W ook Kim, Chris Hallacy , Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry , Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning , pages 8748–8763. PMLR, 2021. 1 , 2 [50] Alec Radford, Jong W ook Kim, Chris Hallacy , Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry , Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning , pages 8748–8763. PMLR, 2021. 2 [51] Laila Rasmy , Y ang Xiang, Ziqian Xie, Cui T ao, and Degui Zhi. Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ digital medicine , 4(1):86, 2021. 5 [52] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: T owards real-time object detection with region proposal networks. Advances in neural information pr ocess- ing systems , 28, 2015. 7 , 8 [53] Rachel E Sexton, Mohammed Najeeb Al Hallak, Maria Diab, and Asfar S Azmi. Gastric cancer: a comprehensiv e revie w of current and future treatment strate gies. Cancer and Metas- tasis Revie ws , 39(4):1179–1203, 2020. 1 [54] Dinggang Shen, Guorong W u, and Heung-Il Suk. Deep learning in medical image analysis. Annual re view of biomedical engineering , 19(1):221–248, 2017. 1 [55] Zhongyi Shui, Jianpeng Zhang, W eiwei Cao, Sinuo W ang, Ruizhe Guo, Le Lu, Lin Y ang, Xianghua Y e, T ingbo Liang, Qi Zhang, et al. Large-scale and ﬁne-grained vision- language pre-training for enhanced ct image understanding. arXiv pr eprint arXiv:2501.14548 , 2025. 1 , 3 [56] Amanpreet Singh, Ronghang Hu, V edanuj Goswami, Guil- laume Couairon, W ojciech Galuba, Marcus Rohrbach, and Douwe Kiela. Flav a: A foundational language and vision alignment model. In Pr oceedings of the IEEE/CVF con- fer ence on computer vision and pattern recognition , pages 15638–15650, 2022. 3 [57] EC Smyth, M V erheij, W Allum, D Cunningham, A Cer- vantes, D Arnold, ESMO Guidelines Committee, et al. Gas- tric cancer: Esmo clinical practice guidelines for diagnosis, treatment and follo w-up. Annals of oncology , 27:v38–v49, 2016. 2 [58] EC Smyth, M V erheij, W Allum, D Cunningham, A Cer- vantes, D Arnold, ESMO Guidelines Committee, et al. Gas- tric cancer: Esmo clinical practice guidelines for diagnosis, treatment and follo w-up. Annals of oncology , 27:v38–v49, 2016. 1 [59] Ekin T iu, Ellie T alius, Pujan Patel, Curtis P Langlotz, An- drew Y Ng, and Pranav Rajpurkar . Expert-lev el detection of pathologies from unannotated chest x-ray images via self- supervised learning. Natur e biomedical engineering , 6(12): 1399–1406, 2022. 3 [60] Michael Tschannen, Basil Mustafa, and Neil Houlsby . Image-andlanguage understanding from pixels only . arXiv pr eprint arXiv:2212.08045 , 3, 2022. 3 [61] Fuying W ang, Y uyin Zhou, Shujun W ang, V arut V ardhanab- huti, and Lequan Y u. Multi-granularity cross-modal align- ment for generalized medical visual representation learn- ing. Advances in neur al information pr ocessing systems , 35: 33536–33549, 2022. 3 [62] John N W einstein, Eric A Collisson, Gordon B Mills, Kenna R Shaw , Brad A Ozenberger , Kyle Ellrott, Ilya Shmulevich, Chris Sander , and Joshua M Stuart. The cancer genome atlas pan-cancer analysis project. Nature genetics , 45(10):1113–1120, 2013. 3 [63] Long W en, Xinyu Li, and Liang Gao. A transfer conv olu- tional neural network for fault diagnosis based on resnet- 50. Neural Computing and Applications , 32(10):6111–6124, 2020. 7 , 8 [64] Chaoyi Wu, Xiaoman Zhang, Y a Zhang, Y anfeng W ang, and W eidi Xie. Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. In Pr oceed- ings of the IEEE/CVF international conference on computer vision , pages 21372–21383, 2023. 3 [65] Lewei Y ao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. arXiv pr eprint arXiv:2111.07783 , 2021. 2 [66] Lewei Y ao, Jianhua Han, Y oupeng W en, Xiaodan Liang, Dan Xu, W ei Zhang, Zhenguo Li, Chunjing Xu, and Hang Xu. Detclip: Dictionary-enriched visual-concept paralleled pre- training for open-world detection. Advances in Neur al Infor- mation Pr ocessing Systems , 35:9125–9138, 2022. 3 [67] Jiahui Y u, Zirui W ang, V ijay V asudev an, Legg Y eung, Mo- jtaba Seyedhosseini, and Y onghui Wu. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 , 2022. 3 [68] Y an Zeng, Xinsong Zhang, Hang Li, Jia wei W ang, Jipeng Zhang, and W angchunshu Zhou. X ˆ { 2 } 2-vlm: All-in-one pre-trained model for vision-language tasks. IEEE transac- tions on pattern analysis and machine intelligence , 46(5): 3156–3168, 2023. 5 , 6 , 7 , 8 [69] Xiaoman Zhang, Chaoyi W u, Y a Zhang, W eidi Xie, and Y anfeng W ang. Knowledge-enhanced visual-language pre- training on chest radiology images. Natur e Communications , 14(1):4542, 2023. 3 [70] Xiaoman Zhang, Chaoyi W u, Ziheng Zhao, W eixiong Lin, Y a Zhang, Y anfeng W ang, and W eidi Xie. Pmc-vqa: V i- sual instruction tuning for medical visual question answer - ing. arXiv pr eprint arXiv:2305.10415 , 2023. 5 , 6 , 7 , 8 [71] Y ixiao Zhang, Xiaosong W ang, Ziyue Xu, Qihang Y u, Alan Y uille, and Daguang Xu. When radiology report generation meets knowledge graph. In Pr oceedings of the AAAI con- fer ence on artiﬁcial intelligence , pages 12910–12917, 2020. 3 11 [72] Hong-Y u Zhou, Chenyu Lian, Liansheng W ang, and Y izhou Y u. Adv ancing radiograph representation learning with masked record modeling. arXiv preprint , 2023. 3 [73] Kaiyang Zhou, Jingkang Y ang, Chen Change Lo y , and Ziwei Liu. Conditional prompt learning for vision-language mod- els. In Pr oceedings of the IEEE/CVF conference on com- puter vision and pattern r ecognition , pages 16816–16825, 2022. 3 [74] Kaiyang Zhou, Jingkang Y ang, Chen Change Lo y , and Ziwei Liu. Learning to prompt for vision-language models. In- ternational Journal of Computer V ision , 130(9):2337–2348, 2022. 3 12 A ppendix A. Abstract This supplementary document provides extended technical details and additional discussions for the Gastric-X bench- mark. Speciﬁcally , we present: (1) a description of the multi-phase CT normalization and alignment pipeline in Sec. B ; (2) detailed explanations of the provided clinical reports (CT imaging descriptions, endoscopy reports, and diagnostic conclusions) in Sec. C ; (3) a description of the creation, veriﬁcation, and prompting strategy for all VQA pairs in Sec. D ; (4) an illustration of the 134 biomedical indicators in Sec. E . B. Multi-phase CT Standardization Details Multi-phase CT scans encompass substantial variation across patients and acquisition phases. Our preprocessing pipeline aims to harmonize patterns, standardize geometric properties, and ensure spatial alignment across phases. Intensity normalization across phases. W e apply a uniﬁed clipping windo w of [ − 100 , 300] HU, consistent with rec- ommended gastric soft-tissue imaging ranges. For each CT volume, per -volume z-score normalization is performed af- ter clipping. In addition, we use histogram matching across phases to reduce heterogeneity . V oxel spacing standardization. All phases are resampled to isotropic spacing of 1 . 0 × 1 . 0 × 1 . 0 mm 3 using trilinear interpolation for image data. Handling different-sized CT slices. Raw scans contain variable numbers of axial slices. Each patient is associated with a coarse 3D bounding region around the stomach, man- ually annotated by clinical readers. These bounding boxes vary between 256 × 256 × 160 and 288 × 288 × 192 depending on patient-speciﬁc anatomy . V olumes are cropped or padded to the uniﬁed shape: 288 × 288 × 192 . Multi-phase alignment. Arterial and delayed phases are rigidly registered to the venous phase using a 6-DOF trans- formation optimized via mutual information. Registration is implemented with SimpleITK (Elastix backend). Quality control. Scans with corrupted slices, missing metadata, or excessiv e misalignment are e xcluded. Approx- imately 3–4% of volumes are ﬁltered by this process. C. The Clinical Reports Each patient record contains three types of clinical reports: CT imaging description r eport. A detailed morphological description authored by radiologists. It cov ers wall thick- ening, ulceration, enhancement patterns, perigastric fat in- ﬁltration, lymph node size/morphology , and incidental ﬁnd- ings. Endoscopy report. This endoscopy report provides a de- tailed assessment of the gastrointestinal mucosa, including ev aluation of surface texture, ulceration, pit-pattern char- acteristics, and any other notable structural changes. Le- sions are described with explicit documentation of their lo- cation, e xtent, and depth. Biopsy samples, when obtained, are recorded with corresponding anatomical sites to support accurate histologic correlation. Diagnostic conclusion report. A concise interpreti ve sum- mary presenting the radiologist’ s ov erall impression, in- cluding features suggestive of malignancy , estimated TNM staging when applicable, assessment of regional or distant nodal inv olvement, and any pertinent recommendations for further ev aluation or correlation with clinical or pathologi- cal ﬁndings. D. The Creation of VQA P airs The dataset contains 26,760 VQA pairs deriv ed from clin- ical reports. Their creation follows a multi-stage pipeline designed to ensure clinical correctness, semantic ground- ing, and div ersity of reasoning patterns. (1) Large-scale candidate generation using multiple LLMs. T wo publicly a vailable large language models, e.g., ChatGPT 4.0, Gemini 2.5 and Claude Sonnet 4.0 were prompted with structured instructions to generate initial question candidates. The prompts targeted clinically mean- ingful aspects such as lesion characterization, enhancement behavior across phases, staging-relev ant ﬁndings, anatomi- cal localization, and factual consistency checks. Among all generated candidates, ChatGPT 4.0 contributed 78.32% of the questions that ultimately passed clinical veriﬁcation. (2) Prompt design and contr olled extraction. T o sys- tematically guide generation, we deﬁned ﬁv e prompt cate- gories: (1) lesion-centric question design, (2) enhancement- phase reasoning, (3) staging-related reasoning, (4) anatom- ical localization questions, and (5) binary Y es/No factual veriﬁcation. Each prompt type was designed to reﬂect rea- soning processes typically employed in abdominal radiol- ogy . Outputs containing ambiguous phrasing or informa- tion not present in the source report were automatically re- mov ed. (3) Final VQA selection and answer ﬁdelity . Only Q/A pairs that strictly adhered to source-report evidence were retained. All answers are deriv ed exclusiv ely from the original CT imaging descriptions or diagnostic conclusions, without augmentation using external medical kno wledge. (4) Double-blind clinical veriﬁcation. All candidate Q/A pairs underwent sentence-le vel veriﬁcation by two indepen- 13 T able 7. Effecti veness of different prompting strategies for gener- ating clinically valid VQA questions. V alidity represents the per- centage of Q/A pairs conﬁrmed by both clinicians. Prompt T ype V alidity (%) Remarks Lesion-focused 92.4 Most clinically reliable and consistently grounded. Staging-focused 88.1 Dependent on level of stag- ing detail documented. Enhancement-phase 84.7 Sensitive to phase contrast variations. Localization 79.3 Occasional ambiguity in spatial descriptions. Y es/No factual 90.5 High factual precision b ut limited question div ersity . dent clinical e xperts: a radiologist with sev en years of ex- perience and a gastroenterology specialist with ten years of experience. Each clinician e v aluated the f actual correctness of both the question and its corresponding answer by di- rectly comparing them to the source report. Discrepancies were ﬂagged and resolved through consensus. This process ensures that the ﬁnal VQA set reﬂects clinically valid rea- soning and av oids hallucinated associations. Prompt effectiveness comparison. W e summarize the ef- fectiv eness of each prompt category in T able 7 , demonstrat- ing that lesion-focused prompts yield the highest clinical validity , while localization prompts exhibit slightly lo wer consistency due to occasional ambiguity in spatial refer - ences. E. The Biomedical Indicators The dataset includes 134 structured biomedical indicators encompassing demographic data, laboratory tests, tumor biomarkers, imaging metadata, sur gical information, patho- logical staging, histological ﬁndings, and postoperative out- comes are shown in T able 8 . These indicators originate from structured EHR and were processed to ensure consis- tency across patients. All sensitiv e identiﬁers (including patient names, ID numbers, phone numbers, and hospitalization codes) were remov ed or replaced with anonymized pseudonyms. V ari- ables unrelated to model training (such as historical comor- bidities or unused surgery-related entries), are retained for completeness but mark ed as not used in this study . 14 T able 8. Full List of 134 Structured Biomedical V ariables in the Gastric-X Dataset. De-identiﬁed is marked as ”De-ID”. Item Description Item Description Demographics Hospital ID (De-ID) Anon ymized hospitalization identiﬁer Patient Name (De-ID) Anonymized patient code Sex Biological sex Age (De-ID) Age at admission Bed Number (De-ID) Anonymized bed assignment Surgery Date (De-ID) Date of surgery (De-ID) Imaging ID CT imaging identiﬁer CT Description Radiology description CBC CBC T est Date Date of CBC test CBC White Blood Cell Count White blood cell count CBC Neutrophil Count Absolute neutrophil count CBC Neutrophil Ratio Neutrophil percentage CBC L ymphocyte Count Absolute lymphocyte count CBC L ymphocyte Ratio Lymphoc yte percentage CBC Hemoglobin Hemoglobin concentration CBC Platelet Count Platelet count Biochemistry Biochemistry T est Date Date of biochemistry test Biochemistry Fasting Glucose Fasting plasma glucose Biochemistry Prealbumin Serum prealbumin Biochemistry AL T Alanine aminotransferase Biochemistry AST Aspartate aminotransferase Biochemistry T otal Protein T otal serum protein Biochemistry Albumin Serum albumin Biochemistry T otal Bilirubin T otal bilirubin Biochemistry Direct Bilirubin Direct bilirubin Biochemistry Creatinine Serum creatinine T umor Markers Biochemistry Urea (BUN) Blood urea nitrogen T umor Markers T est Date Date of tumor marker test T umor Markers AFP Alpha-fetoprotein T umor Markers CEA Carcinoembryonic antigen T umor Markers CA125 Cancer antigen 125 T umor Markers CA724 Cancer antigen 724 T umor Markers CA199 Cancer antigen 19-9 Past Medical History Past conditions (not used) Surgery Details Surgery Date Date of surgery Resection Range Extent of resection Gastrointestinal Reconstruction Postoperative reconstruction type Occupation Patient occupation Education Lev el Highest educational le vel Marital Status Marital status Ethnicity Ethnic group Admission Method Mode of admission Insurance Status Insurance coverage ID Number (De-ID) Anonymized ID number Surgical and Admission Inf o Contact Number (De-ID) Contact phone Surgery Admission Date (De-ID) Admission date for surgery Surgery Dischar ge Date (De-ID) Discharge date Surgery Hospitalization Cost T otal hospital cost Admission T emperature T emperature at admission Admission Pulse Pulse rate at admission Admission Respiration Respiratory rate Admission Systolic Pressure Systolic BP Admission Diastolic Pressure Diastolic BP Height Height W eight W eight BMI Body mass index General Condition Performance status W eight Loss Recent weight loss Reduced Food Intake Reduced oral intake Smoking Status Smoking history Drinking Status Alcohol use Endoscopy Date (De-ID) Endoscopy date Endoscopy T umor Location T umor location Endoscopy T umor Size Tumor size Endoscopy Gross T ype Gross morphology Endoscopy Biopsy Pathology Biopsy pathology Endoscopy Appearance V isual ﬁndings Chief Surgeon (De-ID) Operating surgeon T umor Anatomy and Pathology T umor Anatomical Location T umor site Maximum T umor Diameter Maximal diameter Serosal In vasion Serosal in volv ement Gross T umor T ype Macroscopic type Linitis Plastica Linitis plastica presence Perigastric L ymph Nodes Perigastric node status Liv er Metastasis Li ver metastasis Adjacent Organ In vasion Neighboring or gan in vasion Peritoneal Seeding Peritoneal metastasis Ascites Ascites presence Pathology ID (De-ID) Specimen ID T umor Size (Long Axis) Long axis size T umor Size (Short Axis) Short axis size T umor Size (Height) Height Histologic Grade Differentiation grade Additional Histologic T ype Additional components Distance to Proximal Margin Proximal margin distance Distance to Distal Margin Distal margin distance Specimen Proximal Margin Proximal margin status Specimen Distal Margin Distal margin status Perineural In vasion Perineural in vasion V ascular Cancer Thrombus V ascular in vasion Staging and L ymph Nodes T Stage Pathologic T category N Stage Pathologic N cate gory M Stage Pathologic M category Overall Stage Final stage Positiv e L ymph Nodes Count of positiv e nodes Dissected L ymph Nodes T otal dissected nodes Molecular and IHC Ki-67 Proliferation index HER2 Status HER2 expression Continued on next page 15 Item Description Item Description CLDN Status Claudin status MLH1 MLH1 expression PMS2 PMS2 expression MSH2 MSH2 expression MSH6 MSH6 expression EBER EBV RN A (ISH) PD-L1 Score PD-L1 score MMR Status Mismatch repair status Complications and Outcomes Complication Sev erity Clavien–Dindo grade Secondary Surgery Reoperation Complication Occurrence Any complication Se vere Complication Occurrence Se vere complication T otal Hospital Stay T otal days Postoperativ e Stay Days after surgery Postoperativ e Fev er Fev er occurrence Fever Days Number of fever days Complication Category T ype of complication Intervention T reatment measures Complication Notes Additional notes 16

Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment