EIR: Enhanced Image Representations for Medical Report Generation

February 23, 2026

Reading time: 6 minute

...

📝 Abstract

Generating medical reports from chest X-ray images is a critical and time-consuming task for radiologists, especially in emergencies. To alleviate the stress on radiologists and reduce the risk of misdiagnosis, numerous research efforts have been dedicated to automatic medical report generation in recent years. Most recent studies have developed methods that represent images by utilizing various medical metadata, such as the clinical document history of the current patient and the medical graphs constructed from retrieved reports of other similar patients. However, all existing methods integrate additional metadata representations with visual representations through a simple “Add and LayerNorm” operation, which suffers from the information asymmetry problem due to the distinct distributions between them. In addition, chest X-ray images are usually represented using pre-trained models based on natural domain images, which exhibit an obvious domain gap between general and medical domain images. To this end, we propose a novel approach called Enhanced Image Representations (EIR) for generating accurate chest X-ray reports. We utilize cross-modal transformers to fuse metadata representations with image representations, thereby effectively addressing the information asymmetry problem between them, and we leverage medical domain pre-trained models to encode medical images, effectively bridging the domain gap for image representation. Experimental results on the widely used MIMIC and Open-I datasets demonstrate the effectiveness of our proposed method.

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

N today’s medical practice, especially during critical situations such as COVID-19 [1] or similar pandemics, a medical report serves as the primary medium for conveying the doctor’s diagnosis [2]. Radiologists can analyze both normal and abnormal regions in radiology images from different views, utilizing their medical expertise and accumulated professional experience to write detailed reports [3]. However, this process is time-consuming and laborious for the radiologists. To alleviate the burden of increased demand for imaging examinations on radiologists and to assist less experienced radiologists in identifying abnormalities, there is a growing demand for research and development in the automatic generation of medical reports that demonstrate both the accuracy of clinical descriptions regarding the associated disease and corresponding symptoms, along with language fluency in generating realistic texts [4].

In recent years, fueled by the progress of image captioning [5], [6], which is a highly relevant task in computer vision, many approaches [7]- [11] have been proposed to generate medical reports automatically. These approaches follow principles similar to those employed in the image captioning task. The early methods, such as CNN-HRNN [12], [13], usually adopt an encoder-decoder structure to generate medical reports directly, where the image features are typically extracted by some typical Convolutional Neural Networks (CNNs) [14], [15] which conduct as the encoder and then fed into Recurrent Neural Networks (RNNs) [16] which conduct as the decoder to convert the visual features from the medical images to reports. Recently, to enhance the generation capabilities of the approach, there has been a shift in the choice of text decoder from RNNs to more powerful models, such as Long Short-Term Memory (LSTM) [17] and Transformers [18]. However, these image captioning methods [7]- [9] only consider images as input to generate simplistic descriptive sentences but disregard other available metadata, such as the history of clinical documents with the current patient and the existing reports of the other similar patients, which are crucial for producing comprehensive, contextually rich, and structured reports.

Recently, Nguyen et al. [19] proposed an approach to enhance the image representation aiming to generate more accurate reports by incorporating the historical medical reports of the current patient as additional input. Subsequently, Liu et al. [20] and Li et al. [21] utilized a universal graph constructed from the retrieved reports of other similar patients as additional input. However, all of these methods merely incorporate the representations of the additional metadata with visual representations through a simple “Add and LayerNorm” operation, which raises the issue of information asymmetry due to the distinct distributions between the image and other metadata representations. In addition, the effectiveness of I Fig. 1.

An example of chest X-ray images and their metadata from MIMIC-CXR [24 ]. The metadata for images may include a medical report with multiple sections or a pre-constructed graph in [25] incorporating all of different kinds of metadata as additional input for medical report generation remains unexplored in existing methods.

Furthermore, most existing methods [19], [22], [23] solely rely on the pre-trained models based on general domain images, such as ImageNet, to extract visual representations from images. However, they overlook the significant domain gap that exists between images from the general domain and the medical domain. As a result, these methods are unable to generate reports that accurately describe specific crucial abnormalities within the medical dataset. This constraint stems from the necessity for detailed recognition in medical tasks, the intricate and specialized nature of numerous complex medical terminologies, and the inadequate representation of medical images by pre-trained models trained on general domain data.

To tackle the mentioned concerns, we present an innovative approach called enhanced image representations (EIR) for generating accurate chest X-ray reports. Our approach integrates various metadata as supplementary input and leverages a pre-trained model specifically trained on medical images. Specifically, our approach consists of three main modules, i.e., the encoding module, the aggregation module, and the decoding module. The encoding module encodes the images, along with their metadata, which includes the clinical documents of the current patient and the universal graph constructed from the retrieved reports of other similar patients. We leverage pre-trained models that are specifically trained on medical domain data to encode the medical images, which enables us to effectively bridge the domain gap and ensure accurate representation of the medical images. The aggregation module utilizes cross-modal transformers to combine the metadata representations

View Original ArXiv

This content is AI-processed based on ArXiv data.

EIR: Enhanced Image Representations for Medical Report Generation

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found