Grounded Multimodal Retrieval-Augmented Drafting of Radiology Impressions Using Case-Based Similarity Search

Automated radiology report generation has gained increasing attention with the rise of deep learning and large language models. However, fully generative approaches often suffer from hallucinations and lack clinical grounding, limiting their reliabil…

Authors: Himadri Samanta

Grounded Multimodal Retrieval-Augmented Drafting of Radiology Impressions Using Case-Based Similarity Search
Grounded Multimo dal Retriev al-Augmen ted Drafting of Radiology Impressions Using Case-Based Similarit y Searc h Himadri Sekhar Saman ta Indep endent AI R ese ar cher, A ustin, T exas, USA Abstract Automated radiology rep ort generation has gained increasing atten tion with the rise of deep learning and large language mo dels. Ho wev er, fully generative approac hes often suffer from hallucinations and lac k clinical grounding, limit- ing their reliability in real-w orld workflo ws. In this study , we prop ose a m ul- timo dal retriev al-augmented generation (RA G) system for grounded drafting of chest radiograph impressions. The system combines con trastive image-text em b eddings, case-based similarit y retriev al, and citation-constrained draft generation to ensure factual alignmen t with historical radiology rep orts. A curated subset of the MIMIC-CXR dataset w as used to construct a m ul- timo dal retriev al database. Image em b eddings were generated using CLIP enco ders, while textual embeddings w ere deriv ed from structured impres- sion sections. A fusion similarit y framework w as implemen ted using F AISS indexing for scalable nearest-neigh b or retriev al. Retriev ed cases w ere used to construct grounded prompts for draft impression generation, with safet y mec hanisms enforcing citation cov erage and confidence-based refusal. Exp erimen tal results demonstrate that m ultimo dal fusion significantly impro ves retriev al p erformance compared to image-only retriev al, ac hieving Recall@5 ab o ve 0.95 on clinically relev an t findings. The grounded drafting pip eline pro duces in terpretable outputs with explicit citation traceabilit y , enabling improv ed trustw orthiness compared to con ven tional generative ap- proac hes. This w ork highligh ts the p oten tial of retriev al-augmen ted multi- mo dal systems for reliable clinical decision supp ort and radiology w orkflow augmen tation. Keywor ds: Radiology rep ort generation, m ultimo dal retriev al, retriev al-augmented generation, c hest X-ra y, medical imaging, explainable AI 1. Introduction Artificial in telligence has shown promise in assisting radiologists with im- age in terpretation and rep ort drafting. Deep learning has substan tially im- pro ved p erformance in medical image analysis tasks [1, 2], including thoracic imaging, where c hest radiographs are among the most widely p erformed di- agnostic studies [3, 4]. More recen tly , large language mo dels and multimodal systems hav e enabled automated clinical text generation, but they remain vulnerable to hallucinations, unsupported claims, and p o or calibration in safet y-critical settings [5, 6]. These limitations are esp ecially problematic in radiology , where factual consistency and traceabilit y are essen tial. Retriev al-augmented generation (RA G) offers an alternativ e b y constrain- ing generation to retriev ed evidence [7, 8]. Instead of pro ducing free-form text solely from model priors, RA G systems retriev e seman tically similar historical examples and condition generation on verifiable evidence. Such case-based reasoning aligns closely with radiology practice, where prior studies and sim- ilar historical cases frequently inform interpretation. In radiology , b oth the c hest X-ra y image and the rep ort impression are informativ e: the image cap- tures anatomical and pathological app earance, while the report impression enco des concise clinical in terpretation. In this work, w e presen t a grounded multimodal retriev al-augmen ted sys- tem for chest radiograph impression drafting. The proposed framew ork com- bines image and rep ort em b eddings, p erforms late fusion for similarity re- triev al, and indexes the resulting v ectors with F AISS for efficient nearest- neigh b or searc h [9]. Retriev ed cases are then used to generate grounded draft impressions with explicit case citations. T o impro ve safet y , the sys- tem applies a confidence threshold and refuses generation when similarity is insufficien t. W e additionally deploy the pip eline as a F astAPI service and pac k age it for reproducib le Do ck er-based inference. The main con tributions of this work are as follows: • W e construct a clean multimodal c hest X-ra y dataset subset from MIMIC- CXR with aligned image paths, study iden tifiers, and extracted impres- sion text suitable for retriev al and drafting exp eriments [10]. • W e demonstrate that multimodal fusion of image and text em b eddings mark edly improv es retriev al p erformance o ver image-only retriev al. 2 • W e introduce a grounded drafting mechanism with explicit case cita- tions and confidence-based refusal to reduce unsupp orted output gen- eration. • W e provide an end-to-end deploy able implementation, including a REST API and con tainerized execution en vironment, illustrating practical translation b ey ond noteb ook exp erimentation. 2. Related W ork Automated radiology rep ort generation has b een studied using encoder– deco der mo dels, transformer-based image captioning systems, and multi- mo dal vision-language arc hitectures. While these approac hes ha ve shown promising fluency , they often struggle with factual consistency and hallu- cination. Chest X-ra y mo deling has particularly b enefited from large pub- lic datasets and w eakly sup ervised labeling pip elines, such as CheXp ert [4]. Con trastive language-image pretraining further demonstrated that aligned m ultimo dal em b eddings can supp ort strong cross-modal representations for retriev al and transfer learning [11]. Medical-domain adaptations, includ- ing BioViL and XrayBER T, highligh t the growing role of clinically aligned vision-language pretraining [12, 13]. In parallel, retriev al-based and case-based reasoning methods ha ve b een explored in radiology b ecause they naturally align with clinical practice and can improv e in terpretability . Efficien t similarity searc h frameworks suc h as F AISS supp ort practical deploymen t of large em b edding databases [9]. More recen tly , RA G pip elines ha v e b een prop osed as a wa y to improv e factualit y and evidence grounding in language generation [7, 8, 14, 15]. How ev er, rela- tiv ely few systems com bine radiology images and rep ort text within a unified retriev al arc hitecture that also includes citation verification and confidence- based refusal. Our w ork addresses this gap b y in tegrating multimodal re- triev al, grounded drafting, safety gating, and deplo yable serving in a single w orkflow. Explainabilit y and deplo yment realism are also cen tral concerns in clinical AI [16, 17, 18]. Prior w ork has emphasized that successful healthcare AI must not only ac hieve strong offline metrics but also provide in terpretable outputs, calibrated uncertain ty , and op erational pathw a ys to translation [6, 19]. Our design c hoices are motiv ated b y these principles. 3 3. Dataset W e used a curated subset of the MIMIC-CXR dataset and the MIMIC- CXR-JPG release [10]. The developmen t pro cess in tentionally a voided do wn- loading the full m ulti-terab yte dataset. Instead, light w eigh t metadata tables and the compressed rep ort arc hive were first downloaded from Ph ysioNet un- der creden tialed access. A reproducible subset of 2,000 studies w as sampled using metadata mappings betw een sub ject iden tifiers, study identifiers, and image iden tifiers. After expansion to DICOM-level ro ws, the initial subset con tained 3,301 entries. T o construct a modeling-ready image dataset, the study manifest w as then joined to MIMIC-CXR-JPG metadata in order to retain only studies with a v ailable JPG images. This yielded 3,149 ro ws with corresp onding im- age files. The IMPRESSION section was extracted from eac h rep ort using rule-based section parsing, as the impression pro vides the most concise clin- ically relev ant summary for retriev al and draft generation. Rows without a v alid impression section w ere remov ed, resulting in a final clean multimodal dataset of 2,696 image–impression pairs. T able 1 summarizes the dataset construction pro cess. T able 1: Dataset construction summary . Stage Coun t Sampled studies 2000 DICOM ro ws in initial manifest 3301 Ro ws with av ailable JPG images 3149 Ro ws with v alid impression text 2696 Final clean m ultimo dal dataset 2696 P ositive-label ev aluation subset 839 4. Metho ds 4.1. Image r epr esentation le arning Eac h chest X-ra y image was encoded using the CLIP ViT-B/32 image en- co der [11]. Images w ere loaded as R GB inputs, prepro cessed using the CLIP pro cessor, and conv erted in to 512-dimensional embeddings. All image em- b eddings w ere L2-normalized to facilitate cosine similarit y retriev al through inner-pro duct searc h. 4 4.2. T ext r epr esentation le arning F or eac h study , the extracted rep ort impression was enco ded using the CLIP text enco der. T exts w ere tokenized with padding and truncation, em- b edded in to the same 512-dimensional space, and L2-normalized. Because b oth enco ders are aligned through con trastive pretraining, the image and text v ectors are directly comparable. 4.3. Multimo dal fusion Late fusion w as used to combine image and text embeddings: e f usion = α e imag e + (1 − α ) e text , where α ∈ [0 , 1] controls the con tribution of the visual branch. The fused v ectors w ere normalized after combination. W e ev aluated multiple v alues of α and selected the setting with the best retriev al p erformance on the dev elopment subset. The b est-performing setting w as α = 0 . 5 . Figure 3 summarizes the fusion-w eight ablation. 4.4. R etrieval system All normalized fused embeddin gs were indexed using F AISS with an inner- pro duct similarity index [9]. Because the v ectors w ere normalized, inner pro duct is equiv alen t to cosine similarit y . At inference time, the system computes an em b edding for the query image and retrieves the top- K most similar cases. F or deplo ymen t, w e used an image-only query against the image-side index to a void a mo dalit y mismatc h b et w een query-time inputs and indexed v ectors, while m ultimo dal fusion w as retained for offline ev alu- ation and ablation. 4.5. Gr ounde d dr aft gener ation F or eac h query , the top- K retriev ed cases w ere used as supp orting evi- dence. Retriev ed report impressions w ere conv erted into short evidence snip- p ets, and a grounded draft impression w as pro duced either by a ligh t weigh t language mo del or by a deterministic evidence-based summarizer. The de- terministic fallback ensures that the draft remains constrained to retrieved evidence and can b e used when the language mo del output is malformed or insufficien tly grounded. 5 4.6. Citation verific ation Eac h draft output w as required to reference retrieved evidence using ex- plicit case identifiers (e.g., [Case 1], [Case 2]). Citation co verage was com- puted as the fraction of exp ected case markers present in the final draft. Missing citations w ere recorded for error analysis. This pro vides a simple but effectiv e mechanism to track evidence attribution. 4.7. Confidenc e-b ase d r efusal The similarity score of the top retrieved case was used as a confidence signal. When the top-1 similarit y score fell b elo w a predefined threshold, the system refused to generate a rep ort draft and instead returned a struc- tured refusal resp onse. This mechanism prev ents low-confidence or out-of- distribution inputs from pro ducing unsupported clinical text. 4.8. Deployment The final system w as deplo y ed as a F astAPI-based REST service with tw o primary endpoints: /health for health c hecks and /predict for inference. The service returns prediction status, confidence score, latency , generated draft, and retriev ed case iden tifiers. F or repro ducibilit y , the system w as pac k aged in Dock er with a v ersion-controlled dependency file. 5. Exp erimen ts 5.1. R etrieval evaluation W e ev aluated retriev al qualit y using Recall@ K with K ∈ { 1 , 5 , 10 } . Rele- v ance was defined using CheXp ert-deriv ed pathology lab els [4]. A query w as considered successful if at least one retriev ed case among the top- K shared the same p ositiv e clinical label as the query . W e additionally compared image-only retriev al against multimodal fusion retriev al. 5.2. F usion weight study T o understand the con tribution of image and text signals, we p erformed an alpha sweep o ver fusion w eigh ts and measured Recall@5. The b est setting w as retained for subsequent exp eriments. 6 5.3. Safety evaluation W e ev aluated the deplo y ed RA G la y er using refusal rate, av erage top- 1 retriev al similarity , and av erage citation co verage. Refusal rate quan tifies ho w often the safet y p olicy suppresses generation; citation cov erage measures the exten t to which the final rep ort explicitly references retrieved evidence. 6. Results 6.1. R etrieval p erformanc e Multimo dal fusion substan tially impro v ed retriev al p erformance ov er image- only retriev al. T able 2 summarizes the main retriev al metrics, while Fig- ures 1, 2, and 3 pro vide visual summaries. T able 2: Retriev al p erformance comparison. Metho d Recall@1 Recall@5 Recall@10 Image-only — 0.633 — F usion ( α = 0 . 5 ) 0.739 0.956 0.981 Best fusion setting — 0.975 — Figure 1: Recall@5 comparison b et w een image-only retriev al and multimodal fusion. 7 Figure 2: F usion retriev al p erformance across differen t v alues of K . Figure 3: F usion-w eight ablation sho wing peak retriev al p erformance near α = 0 . 5 . Image-only retriev al achiev ed Recall@5 of 0.633, indicating that visual similarit y alone captures some clinically useful signal but misses a substan tial fraction of relev an t cases. In contrast, m ultimo dal fusion ac hieved Recall@5 of 0.975 at the b est fusion setting and Recall@5 of 0.956 on the p ositiv e- case ev aluation subset. These results indicate that rep ort semantics provide complemen tary information b ey ond image app earance alone. 8 6.2. Safety and r eliability The deploy ed RAG system demonstrated strong safety metrics. As sho wn in T able 3 and Figure 4, the refusal rate on the in ternal ev aluation set was 0.000, a verage top-1 retriev al similarit y w as 0.980, and a v erage citation cov- erage w as 0.867. These results indicate that most in-distribution c hest X-ra y inputs retriev e strong matc hes and that the ma jorit y of generated statements remain explicitly grounded in cited evidence. T able 3: Safet y and grounding metrics. Metric V alue Refusal rate 0.000 A v erage top-1 retriev al score 0.980 A v erage citation co verage 0.867 Best fusion w eight α 0.5 Figure 4: Safety and grounding metrics: refusal rate, a v erage b est retriev al score, and a verage citation co verage. 6.3. Qualitative example T able 4 presen ts a representativ e case illustrating the prop osed grounded drafting approac h. 9 T able 4: Represen tative qualitative example of grounded drafting. Comp onen t Example Query finding Mild bibasilar atelectatic c hange on fron tal chest ra- diograph Retriev ed Case 1 Bibasilar atelectasis. Otherwise, no acute cardiopul- monary abnormalit y . Retriev ed Case 2 Mild bibasilar atelectasis. No other acute findings. Retriev ed Case 3 Mild bibasilar atelectasis. Otherwise, no acute car- diopulmonary pro cess. Generated draft Mild bibasilar atelectasis. [Case 1][Case 2] No acute cardiopulmonary abnormalit y . [Case 1][Case 3] 6.4. A r chite ctur e overview Figure 5 illustrates the complete multimodal RA G pip eline used in this w ork, from image and text enco ding to retriev al, grounded drafting, citation v erification, confidence gating, and API output. 10 Figure 5: System arc hitecture for grounded multimodal retriev al-augmented radiology drafting. 7. Error Analysis The main error patterns observ ed in dev elopment fell in to three cate- gories. First, image-only retriev al sometimes returned anatomically similar but clinically different cases, especially when subtle findings required report seman tics for disambiguation. Second, the light w eight language mo del o c- casionally copied evidence mark ers or rep eated retriev ed conten t v erbatim rather than producing a concise syn thesized impression. Third, some drafts omitted one or more exp ected citations, reducing citation cov erage. These is- sues motiv ated the use of a deterministic evidence-based fallback and explicit citation v erification. W e also v erified the b eha vior of the confidence-based refusal mec hanism on out-of-domain images. When a non-chest photograph w as submitted, the system returned a low similarity score and correctly refused to generate a draft under the default clinical threshold. Low ering the threshold enabled generation in demo mo de, but this setting is not recommended for clinical use. 11 8. Discussion This study demonstrates that m ultimo dal retriev al is a strong founda- tion for safer radiology draft generation. The retriev al results suggest that the textual impression signal con tributes critical clinical information that image-only represen tations do not fully capture. The strong Recall@5 and Recall@10 results indicate that the fused em b edding space successfully orga- nizes studies according to clinically meaningful similarit y . The RA G lay er pro vides an imp ortan t practical adv antage: rather than asking a mo del to generate a report from scratc h, the system constrains drafting to retrieved, previously observ ed evidence. This substantially im- pro ves interpretabilit y . In addition, confidence gating pro vides a pragmatic safet y con trol b y allowing the system to abstain when retriev al evidence is w eak. T ogether, these comp onents align with real clinical requirements for traceabilit y and conserv ativ e b eha vior [16, 6]. The deplo y ed API and Do c ker pack aging are also significant from an ap- plied p erspective. Man y academic systems stop at offline metrics, whereas our implemen tation demonstrates a path to w ard repro ducible and p ortable deplo yment. This increases the work’s relev ance to b oth academic and in- dustry audiences [17, 18, 19]. 9. Limitations This work has several limitations. First, exp erimen ts were conducted on a curated subset rather than the full MIMIC-CXR dataset, whic h may limit generalization. Second, the rep ort drafting comp onen t relied on sim- ple prompt-based or deterministic summarization rather than a radiology- sp ecialized large language model. Third, ev aluation emphasized retriev al and evidence-grounding metrics, but did not yet include formal radiologist review of generated outputs. Finally , citation cov erage is a useful but imp er- fect proxy for factual correctness; future work should incorp orate stronger faithfulness ev aluation, larger-scale v alidation, and external b enc hmarking [20, 21]. 10. Conclusion W e presen ted a grounded m ultimo dal retriev al-augmented radiology co- pilot for chest X-ray impression drafting. The prop osed framework combines 12 image and rep ort em b eddings, uses F AISS for efficient case retriev al, and generates citation-grounded drafts with a confidence-based refusal policy . Exp erimen ts on a curated MIMIC-CXR subset show that m ultimo dal fusion greatly improv es retriev al p erformance ov er image-only baselines and that the deploy ed system achiev es strong retriev al confidence and citation cov- erage. These results supp ort the use of multimodal retriev al and explicit evidence grounding as promising directions for safer and more interpretable AI assistance in radiology w orkflows. References [1] G. Litjens, T. Kooi, B. Bejnordi, A. Setio, F. Ciompi, et al., A survey on deep learning in medical image analysis, Medical Image Analysis 42 (2017) 60–88. [2] R. Miotto, F. W ang, S. W ang, X. Jiang, J. Dudley , Deep learning for healthcare: review, opp ortunities and c hallenges, Briefings in Bioinfor- matics 19 (6) (2018) 1236–1246. [3] P . Ra jpurk ar, J. Irvin, K. Zh u, B. Y ang, H. Meh ta, et al., Chexnet: Radiologist-lev el pneumonia detection on chest x-rays with deep learn- ing, arXiv preprin t arXiv:1711.05225 (2017). [4] J. Irvin, P . Ra jpurk ar, M. Ko, Y. Y u, S. Ciurea-Ilcus, et al., Chexp ert: A large chest radiograph dataset with uncertain t y lab els and exp ert comparison, Pro ceedings of AAAI (2019). [5] Op enAI, Gpt-4 tec hnical rep ort, arXiv preprin t (2023). [6] C. Kelly , A. Karthik esalingam, M. Suleyman, G. Corrado, D. King, Key c hallenges for delivering clinical impact with artificial in telligence, BMC Medicine 17 (195) (2019). [7] P . Lewis, E. Perez, A. Piktus, F. P etroni, V. Karpukhin, et al., Retriev al- augmen ted generation for kno wledge-in tensive nlp tasks, A dv ances in Neural Information Pro cessing Systems (2020). [8] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, et al., Retriev al-augmen ted generation for large language mo dels: A surv ey , arXiv preprin t arXiv:2312.10997 (2024). 13 [9] J. Johnson, M. Douze, H. Jegou, Billion-scale similarit y searc h with gpus, IEEE T ransactions on Big Data (2021). [10] A. E. Johnson, T. J. P ollard, S. J. Berk o witz, N. R. Green baum, M. P . Lungren, C.-y . Deng, R. G. Mark, S. Horng, Mimic-cxr, a de-iden tified publicly a v ailable database of c hest radiographs with free-text reports, Scien tific Data 6 (317) (2019). [11] A. Radford, J. W. Kim, C. Hallacy , A . Ramesh, G. Goh, et al., Learn- ing transferable visual models from natural language sup ervision, in: Pro ceedings of ICML, 2021. [12] J.-B. Delbrouck, P . Chambon, G. Gohy , P . Sounac k, J. Chav es, et al., Bio vil: A kno wledge-enriched vision-language mo del for medical image understanding and generation, Scien tific Rep orts (2022). [13] B. Bo ec king, N. Usuyama, S. Bann ur, S. Hyland, Z. Liu, et al., Mak- ing the most of text seman tics to impro ve biomedical vision-language pro cessing, EMNLP (2022). [14] R. T ang, et al., Medrag: retriev al-augmen ted generation for medicine, arXiv preprin t arXiv:2312.10912 (2023). [15] Q. Chen, et al., Clinical retriev al-augmen ted generation for evidence- based decision supp ort, NPJ Digital Medicine (2023). [16] A. Holzinger, C. Biemann, C. Pattic his, D. Kell, What do we need to build explainable ai systems for the medical domain?, arXiv preprin t arXiv:1712.09923 (2017). [17] A. Estev a, A. Robicquet, B. Ramsundar, V. Kulesho v, M. DePristo, et al., A guide to deep learning in healthcare, Nature Medicine 25 (2019) 24–29. [18] E. T op ol, High-p erformance medicine: the con vergence of human and artificial in telligence, Nature Medicine 25 (2019) 44–56. [19] Z. Zhou, et al., F oundation mo dels for generalist medical artificial in tel- ligence, Nature (2023). 14 [20] Y. W ang, et al., A survey on m ultimo dal learning in medical imaging: progress, c hallenges, and future directions, Nature Machine Intelligence (2023). [21] A. Karargyris, J. T. W u, A. Sharma, M. Morris, et al., Radimagenet: an op en radiologic deep learning research dataset for effectiv e transfer learning, Radiology: Artificial Intelligence 3 (5) (2021). 15

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment