CT-Bench: A Benchmark for Multimodal Lesion Understanding in Computed Tomography
Artificial intelligence (AI) can automatically delineate lesions on computed tomography (CT) and generate radiology report content, yet progress is limited by the scarcity of publicly available CT datasets with lesion-level annotations. To bridge this gap, we introduce CT-Bench, a first-of-its-kind benchmark dataset comprising two components: a Lesion Image and Metadata Set containing 20,335 lesions from 7,795 CT studies with bounding boxes, descriptions, and size information, and a multitask visual question answering benchmark with 2,850 QA pairs covering lesion localization, description, size estimation, and attribute categorization. Hard negative examples are included to reflect real-world diagnostic challenges. We evaluate multiple state-of-the-art multimodal models, including vision-language and medical CLIP variants, by comparing their performance to radiologist assessments, demonstrating the value of CT-Bench as a comprehensive benchmark for lesion analysis. Moreover, fine-tuning models on the Lesion Image and Metadata Set yields significant performance gains across both components, underscoring the clinical utility of CT-Bench.
💡 Research Summary
**
CT‑Bench is introduced as the first large‑scale multimodal benchmark specifically designed for lesion‑level understanding in computed tomography (CT). The benchmark consists of two complementary components. The Lesion Image & Metadata Set contains 20,335 lesions drawn from 7,795 CT studies. For each lesion, a 2‑D axial slice (and optionally a lesion‑centered 3‑D sub‑volume) is paired with a radiologist‑verified textual description, size measurement, and a bounding‑box annotation extracted from PACS reports. This enriches the widely used DeepLesion bounding‑box data with clinically meaningful natural‑language labels, providing dense visual‑language supervision.
The second component, the CT‑Bench QA Benchmark, offers 2,850 multiple‑choice questions spanning seven tasks: (1) single‑slice lesion captioning, (2) text‑guided slice retrieval, (3) lesion localization with bounding boxes, (4) lesion size estimation, (5) single‑slice attribute classification, (6) multi‑slice lesion captioning, and (7) multi‑slice attribute classification. Each task includes hard‑negative examples that mimic real‑world diagnostic confusion, and all answers have been validated by senior radiologists. The benchmark therefore evaluates both visual grounding and higher‑level reasoning across single‑ and multi‑slice contexts.
To construct the dataset, the authors employed a three‑stage pipeline that combines GPT‑4 pre‑annotation with human‑in‑the‑loop refinement. After an initial manual annotation of 200 cases, GPT‑4 was fine‑tuned and used to generate provisional labels for subsequent batches. Annotators iteratively corrected these outputs across three feedback cycles, and a dual‑layer review (two annotators plus a medical expert) ensured high fidelity before the model was deployed on the remaining data. This approach balances scalability with clinical accuracy.
A comprehensive evaluation was performed on a suite of state‑of‑the‑art multimodal models: general vision‑language models (Dragonfly, Gemini), medical‑domain adapters (LLaVA‑Med, RadFM), and medical CLIP variants (BiomedCLIP). Experiments examined the impact of providing bounding‑box information versus omitting it. Across most tasks, spatial supervision improved performance, especially for image‑to‑text (Img2txt), context‑to‑text (Context2txt), and attribute classification (Img2attrib). Models that rely on global semantic alignment, such as Txt2img, showed little sensitivity to bounding‑box inputs.
BiomedCLIP emerged as the strongest baseline, achieving 41 % average accuracy without bounding boxes and 62 % when boxes were supplied, outperforming all other untuned models. Gemini excelled in text‑to‑bbox grounding (36 % with boxes) and size estimation (50 %); Dragonfly displayed balanced performance, notably in attribute‑related questions. GPT‑4V delivered moderate results, with a relative strength in attribute recognition but weaker spatial grounding. LLaVA‑Med and the original RadFM performed near chance on most QA tasks, highlighting a mismatch between their pre‑training and the nuanced CT domain.
A striking finding was catastrophic forgetting: fine‑tuning RadFM solely on the image‑captioning subset caused its performance on all QA tasks to drop to zero, underscoring the difficulty of preserving multi‑task knowledge when adapting models. Moreover, a clear performance gap was observed between single‑slice and multi‑slice tasks, indicating that current architectures largely process slices independently and lack effective volumetric reasoning. The authors argue that future models should incorporate 3‑D encoders, cross‑slice attention mechanisms, or hybrid 2‑D/3‑D designs to capture the spatial continuity inherent in CT volumes.
Human evaluation involved two senior radiologists and one junior physician reviewing 100 randomly selected cases per task. Agreement with the benchmark exceeded 90 % for tasks that included bounding‑box information, confirming that CT‑Bench aligns closely with expert clinical judgment. Agreement dropped noticeably when spatial cues were absent, reinforcing the clinical importance of precise lesion localization.
In discussion, the authors emphasize that while CT‑Bench provides a high‑quality resource, challenges remain: annotation is labor‑intensive, scaling the dataset will require semi‑automated pipelines, and existing multimodal models still fall far short of expert performance (the best model reaches only 62 % accuracy). They advocate for dedicated CT‑specific modeling strategies, improved volumetric reasoning, and continual‑learning frameworks to mitigate catastrophic forgetting.
Overall, CT‑Bench offers a comprehensive platform for training and evaluating multimodal AI systems on clinically relevant CT lesion tasks. By coupling richly annotated image‑metadata pairs with a rigorously designed QA suite, it exposes the limitations of current vision‑language models and demonstrates the substantial gains achievable through targeted fine‑tuning. The benchmark is poised to become a foundational dataset for advancing trustworthy, high‑performing AI in CT‑based diagnosis.
Comments & Academic Discussion
Loading comments...
Leave a Comment