S2D-ALIGN: Shallow-to-Deep Auxiliary Learning for Anatomically-Grounded Radiology Report Generation
📝 Abstract
Radiology Report Generation (RRG) aims to automatically generate diagnostic reports from radiology images. To achieve this, existing methods have leveraged the powerful cross-modal generation capabilities of Multimodal Large Language Models (MLLMs), primarily focusing on optimizing cross-modal alignment between radiographs and reports through Supervised Fine-Tuning (SFT). However, by only performing instance-level alignment with the image-text pairs, the standard SFT paradigm fails to establish anatomically-grounded alignment, where the templated nature of reports often leads to sub-optimal generation quality. To address this, we propose \textsc{S2D-Align}, a novel SFT paradigm that establishes anatomically-grounded alignment by leveraging auxiliary signals of varying granularities. \textsc{S2D-Align} implements a shallow-to-deep strategy, progressively enriching the alignment process: it begins with the coarse radiograph-report pairing, then introduces reference reports for instance-level guidance, and ultimately utilizes key phrases to ground the generation in specific anatomical details. To bridge the different alignment stages, we introduce a memory-based adapter that empowers feature sharing, thereby integrating coarse and fine-grained guidance. For evaluation, we conduct experiments on the public \textsc{MIMIC-CXR} and \textsc{IU X-Ray} benchmarks, where \textsc{S2D-Align} achieves state-of-the-art performance compared to existing methods. Ablation studies validate the effectiveness of our multi-stage, auxiliary-guided approach, highlighting a promising direction for enhancing grounding capabilities in complex, multi-modal generation tasks.
💡 Analysis
Radiology Report Generation (RRG) aims to automatically generate diagnostic reports from radiology images. To achieve this, existing methods have leveraged the powerful cross-modal generation capabilities of Multimodal Large Language Models (MLLMs), primarily focusing on optimizing cross-modal alignment between radiographs and reports through Supervised Fine-Tuning (SFT). However, by only performing instance-level alignment with the image-text pairs, the standard SFT paradigm fails to establish anatomically-grounded alignment, where the templated nature of reports often leads to sub-optimal generation quality. To address this, we propose \textsc{S2D-Align}, a novel SFT paradigm that establishes anatomically-grounded alignment by leveraging auxiliary signals of varying granularities. \textsc{S2D-Align} implements a shallow-to-deep strategy, progressively enriching the alignment process: it begins with the coarse radiograph-report pairing, then introduces reference reports for instance-level guidance, and ultimately utilizes key phrases to ground the generation in specific anatomical details. To bridge the different alignment stages, we introduce a memory-based adapter that empowers feature sharing, thereby integrating coarse and fine-grained guidance. For evaluation, we conduct experiments on the public \textsc{MIMIC-CXR} and \textsc{IU X-Ray} benchmarks, where \textsc{S2D-Align} achieves state-of-the-art performance compared to existing methods. Ablation studies validate the effectiveness of our multi-stage, auxiliary-guided approach, highlighting a promising direction for enhancing grounding capabilities in complex, multi-modal generation tasks.
📄 Content
S2D-ALIGN: Shallow-to-Deep Auxiliary Learning for Anatomically-Grounded Radiology Report Generation Jiechao Gao1,*, Chang Liu2, Yuangang Li3 1Stanford University 2University of Science and Technology of China 3University of California, Irvine jiechao@stanford.edu, christzhaung@gmail.com, yuanganl@uci.edu Abstract Radiology Report Generation (RRG) aims to automati- cally generate diagnostic reports from radiology images. To achieve this, existing methods have leveraged the power- ful cross-modal generation capabilities of Multimodal Large Language Models (MLLMs), primarily focusing on opti- mizing cross-modal alignment between radiographs and re- ports through Supervised Fine-Tuning (SFT). However, by only performing instance-level alignment with the image- text pairs, the standard SFT paradigm fails to establish anatomically-grounded alignment, where the templated na- ture of reports often leads to sub-optimal generation qual- ity. To address this, we propose S2D-ALIGN, a novel SFT paradigm that establishes anatomically-grounded alignment by leveraging auxiliary signals of varying granularities. S2D-ALIGN implements a shallow-to-deep strategy, progres- sively enriching the alignment process: it begins with the coarse radiograph-report pairing, then introduces reference reports for instance-level guidance, and ultimately utilizes key phrases to ground the generation in specific anatom- ical details. To bridge the different alignment stages, we introduce a memory-based adapter that empowers feature sharing, thereby integrating coarse and fine-grained guid- ance. For evaluation, we conduct experiments on the pub- lic MIMIC-CXR and IU X-RAY benchmarks, where S2D- ALIGN achieves state-of-the-art performance compared to existing methods. Ablation studies validate the effectiveness of our multi-stage, auxiliary-guided approach, highlighting a promising direction for enhancing grounding capabilities in complex, multi-modal generation tasks. Introduction Medical imaging, such as X-rays and Computed Tomogra- phy (CT), serves as an indispensable non-invasive tool in modern diagnostics, offering a crucial way to visualize the internal structures of human body conditions. Following the interpretation of these images, radiologists are required to record detailed diagnostic reports that translate complex vi- sual findings into precise medical language, forming a criti- cal basis for subsequent clinical decision-making. This man- ual process, however, is not only time-consuming but also susceptible to errors and omissions, particularly for less ex- perienced radiologists, which can potentially degrade the *Corresponding author Copyright © 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org ). All rights reserved. quality of patient care. To mitigate these challenges, the task of Radiology Report Generation (RRG) has been motivated by recent studies (Jing, Xie, and Xing 2018a; Li et al. 2018; Chen et al. 2020b; Liu et al. 2021a; Chen et al. 2021a; Qin and Song 2022a), aiming to develop automatic solutions to alleviate the workload of radiologists, where this research direction has raised great attention from the communities of both artificial intelligence and clinical medicine. Recent breakthroughs in Large Language Models (LLMs) (Touvron et al. 2023) have motivated Multimodal Large Language Models (MLLMs) (Zhu et al. 2023; Liu et al. 2023) as the cornerstone for RRG, effectively overcoming the alignment challenges inherent in earlier methods (Liu et al. 2023) trained from scratch on limited datasets. Adapt- ing these general MLLMs for the medical domain primarily involves two competing strategies, i.e., In-Context Learn- ing (ICL) and Supervised Fine-Tuning (SFT). ICL methods (Yan et al. 2023), which keep the LLM parameters frozen, typically rely on external annotators like RadGraph (Jain et al. 2021) to convert visual information into structured text (e.g., entities and relations), upon which few-shot demon- strations guide the generation. However, their performance is highly sensitive to the quality of these text-based repre- sentations and the choice of demonstration examples, limit- ing their robustness in complex clinical scenarios. Conse- quently, SFT has emerged as the dominant paradigm, es- tablishing end-to-end alignment by directly fine-tuning the MLLM on radiograph-report pairs (Liu et al. 2024; Wang et al. 2025; Hyland et al. 2023; Tu et al. 2023). Despite its prevalence, the standard SFT framework faces a critical bot- tleneck, where it performs alignment only at a coarse granu- larity between the entire image and its corresponding report. This coarse-grained approach, confounded by the templated and often redundant nature of radiology reports, fails to es- tablish precise correspondence between specific pathologi- cal findings and their anatomical locations. This deficiency in alignment granularity directly undermines the factual cor- rectness and clinical reliability of the generate
This content is AI-processed based on ArXiv data.