Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision-Language Foundation Models (VLMs), trained on large-scale multimodal datasets, have driven significant advances in Artificial Intelligence (AI) by enabling rich cross-modal reasoning. Despite their success in general domains, applying these models to medical imaging remains challenging due to the limited availability of diverse imaging modalities and multilingual clinical data. Most existing medical VLMs are trained on a subset of imaging modalities and focus primarily on high-resource languages, thus limiting their generalizability and clinical utility. To address these limitations, we introduce a novel Vietnamese-language multimodal medical dataset consisting of 2,757 whole-body PET/CT volumes from independent patients and their corresponding full-length clinical reports. This dataset is designed to fill two pressing gaps in medical AI development: (1) the lack of PET/CT imaging data in existing VLMs training corpora, which hinders the development of models capable of handling functional imaging tasks; and (2) the underrepresentation of low-resource languages, particularly the Vietnamese language, in medical vision-language research. To the best of our knowledge, this is the first dataset to provide comprehensive PET/CT-report pairs in Vietnamese. We further introduce a training framework to enhance VLMs’ learning, including data augmentation and expert-validated test sets. We conduct comprehensive experiments benchmarking state-of-the-art VLMs on downstream tasks. The experimental results show that incorporating our dataset significantly improves the performance of existing VLMs. We believe this dataset and benchmark will serve as a pivotal step in advancing the development of more robust VLMs for medical imaging, especially for low-resource languages and clinical use in Vietnamese healthcare. The source code is available at https://github.com/AIoT-Lab-BKAI/ViPET-ReportGen.

💡 Research Summary

This paper addresses two critical gaps in medical vision‑language modeling: the scarcity of functional imaging data, specifically PET/CT, and the under‑representation of low‑resource languages such as Vietnamese. The authors introduce ViMed‑PET, a novel multimodal dataset comprising 2,757 whole‑body PET/CT volumes (equivalent to 1,567,062 paired 2‑D slices) collected from a large tertiary hospital in Vietnam. Each volume is paired with a full‑length clinical report written in Vietnamese, which has been de‑identified, parsed, and converted into a structured JSON format. To increase the utility of the data, the authors split each study into three anatomical regions (head‑neck, chest, abdomen‑pelvis) with a 20‑slice overlap, yielding 8,271 image‑report pairs for training. Additional task‑specific augmentations generate 8,271 visual‑question‑answer (VQA) dialogues, 5,571 report‑generation samples, 10,000 study‑comparison pairs, and a clinically validated lung‑cancer test set containing 398 lesions.

The preprocessing pipeline includes rigorous privacy removal, metadata extraction, and quality control through manual review. Visual augmentation applies 3‑D rotations, translations, intensity perturbations, and modality‑specific noise, while textual augmentation introduces synonym swaps, sentence re‑ordering, and synthetic Q‑A generation to enrich linguistic diversity.

Benchmark experiments evaluate several state‑of‑the‑art vision‑language models: LLaVA‑Med, M3D, RadFM, and GPT‑4o (few‑shot). When applied directly to the Vietnamese PET/CT task, these models achieve near‑zero BLEU‑4 scores (0.01–0.06 %) and low ROUGE/BERTScore, confirming that existing VLMs are ill‑suited for functional imaging and low‑resource language contexts. Fine‑tuning the same models on ViMed‑PET leads to substantial improvements: BLEU‑4 increases by 12–18 percentage points, ROUGE‑1/ROUGE‑L improve by 8–12 pp, and BERTScore rises by 5–7 pp. On the expert‑validated lung‑cancer test set, the fine‑tuned models show a 23 pp gain in correctly describing lesion location, size, and metabolic activity, yet overall clinical accuracy remains below 70 %, highlighting the need for further methodological advances.

Key technical insights emerge from the study. First, PET/CT images contain quantitative biomarkers (e.g., standardized uptake values) that must be aligned with textual expressions; incorporating a multimodal loss that explicitly penalizes mismatches between image‑derived metrics and report numbers improves convergence. Second, for low‑resource languages, direct pre‑training on native text outperforms translation‑based pipelines, underscoring the value of native‑language corpora. Third, conventional NLP metrics (BLEU, ROUGE) inadequately capture clinical relevance; the authors propose structured clinical evaluation (e.g., TNM staging alignment) as a more meaningful benchmark.

The paper also discusses limitations. The dataset originates from a single institution, which may limit demographic and scanner variability. It focuses exclusively on PET/CT, leaving other functional modalities (SPECT, PET/MRI) unexplored. The current slice‑wise approach, while increasing sample count, does not fully exploit 3‑D context; future work should investigate volumetric transformers or hybrid CNN‑Transformer architectures. Moreover, real‑world deployment will require inference speed optimization, integration with hospital PACS systems, and compliance with medical device regulations.

In conclusion, ViMed‑PET provides a much‑needed resource for multilingual, functional‑imaging‑aware vision‑language research. The accompanying augmentation strategies and expert‑validated benchmarks demonstrate that incorporating this dataset can markedly improve the performance of existing VLMs, while also revealing persistent challenges in clinical accuracy. The authors anticipate that the dataset will become a standard benchmark for future development of robust, equitable medical AI systems that serve Vietnamese‑speaking populations and potentially other low‑resource language communities.

Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment