의료 비전 언어 모델 시각 정렬을 위한 경량 디스틸레이션

Reading time: 5 minute
...

📝 Abstract

Medical Large Vision-Language Models (Med-LVLMs) have shown promising results in clinical applications, but often suffer from hallucinated outputs due to misaligned visual understanding. In this work, we identify two fundamental limitations contributing to this issue: insufficient visual representation learning and poor visual attention alignment. To address these problems, we propose MEDALIGN, a simple, lightweight alignment distillation framework that transfers visual alignment knowledge from a domain-specific Contrastive Language-Image Pre-training (CLIP) model to Med-LVLMs. MEDALIGN introduces two distillation losses: a spatial-aware visual alignment loss based on visual token-level similarity structures, and an attention-aware distillation loss that guides attention toward diagnostically relevant regions. Extensive experiments on medical report generation and medical visual question answering (VQA) benchmarks show that MEDALIGN consistently improves both performance and interpretability, yielding more visually grounded outputs.

💡 Analysis

Medical Large Vision-Language Models (Med-LVLMs) have shown promising results in clinical applications, but often suffer from hallucinated outputs due to misaligned visual understanding. In this work, we identify two fundamental limitations contributing to this issue: insufficient visual representation learning and poor visual attention alignment. To address these problems, we propose MEDALIGN, a simple, lightweight alignment distillation framework that transfers visual alignment knowledge from a domain-specific Contrastive Language-Image Pre-training (CLIP) model to Med-LVLMs. MEDALIGN introduces two distillation losses: a spatial-aware visual alignment loss based on visual token-level similarity structures, and an attention-aware distillation loss that guides attention toward diagnostically relevant regions. Extensive experiments on medical report generation and medical visual question answering (VQA) benchmarks show that MEDALIGN consistently improves both performance and interpretability, yielding more visually grounded outputs.

📄 Content

Medical Large Vision-Language Models (Med-LVLMs), such as LLaVA-Med-1.5 and HuatuoGPT-Vision, have shown strong potential in clinical applications (Li et al. 2024;Chen et al. 2024a,c;Thawakar et al. 2024;Moor et al. 2023). However, recent studies (Xia et al. 2024;Gu et al. 2024;Chen et al. 2024b;Chang et al. 2025a;Wang et al. 2024) have revealed that these models often produce inaccurate or hallucinated responses that fail to faithfully reflect the input medical images. To the best of our knowledge, no existing work has proposed targeted methods to mitigate hallucinations specifically in Med-LVLMs. Existing hallucination mitigation strategies primarily focus on general-purpose LVLMs and follow two main directions: (1) enhancing visual grounding and reducing over-reliance on textual input through contrastive decoding -applied at either the attention or input level (Leng et al. 2024;Favero et al. 2024;Liu, Zheng, and Chen 2025;Tu et al. 2025;Chen et al. 2025); and (2) correcting attention biases, such as the overemphasis on background elements or “register” tokens among visual inputs (Darcet et al. 2024;Woo et al. 2024;Gong et al. 2024). While these techniques may alleviate hallucinations, they do not explicitly improve the distribution of visual attention or ensure that the model focuses on clinically relevant regions. More critically, they overlook a key contributor to hallucination in Med-LVLMs: the quality of the learned visual representations. Preliminary Analysis. To address these limitations, we conduct a preliminary analysis to investigate two fundamental factors driving hallucinations in Med-LVLMs: (1) the quality of the visual representations learned by the model, and (2) the alignment of visual attention during generation.

(1) Insufficient Visual Representation Learning. Unlike images in the general domain that contain diverse objects, medical images often feature recurring anatomical structures, such as the lungs, heart, and ribs in chest X-rays. Ideally, a well-trained Med-LVLM should learn similar representations for the same organ across different images. To evaluate the quality of visual representation learning in existing Med-LVLMs, we adopt LLaVA-Med-1.5 as a representative model. We randomly sample 100 abdominal CT scans from the SLAKE (Liu et al. 2021) dataset, which includes Regionof-Interest (RoI) annotations for image patches. For analysis, we extract the visual token representations from various layers of the Transformer-based large language model (LLM) used in LLaVA-Med-1.5 and visualize five key entities using t-SNE. The results, shown in Figure 1 (a), illustrate how visual representations evolve across layers. We can observe that LLaVA-Med-1.5 fails to clearly distinguish key entities in medical images, resulting in entangled and dispersed visual representations. For example, the representations of “liver cancer” are heavily mixed with those of “liver” and other nearby organs. These results suggest that current Med-LVLMs exhibit insufficient visual representation learning, particularly for clinically critical concepts, which may potentially leads to poor visual reasoning and increased risk of hallucinations.

(2) Visual Attention Misalignment. A well-trained Med-LVLM should understand both the input image and the corresponding text prompt and assign higher attention weights to image regions relevant to the medical concepts mentioned in the prompt. However, as previously discussed, the issue of insufficient visual representation learning in Med-LVLMs leads to a secondary problem in the LLM component: visual any signs of liver cancer? Ground Truth: Yes Given these findings, a straightforward strategy to enhance Med-LVLMs is to replace the original CLIP encoder with a domain-specific expert encoder like UniMed-CLIP. However, this requires re-training the visual projection layer and adapting the entire Med-LVLM to the new feature space using large-scale data, which is computationally intensive. This limitation motivates the design of a lightweight, non-invasive approach to transfer alignment knowledge from expert CLIP models without fully replacing the original visual encoder. Our Approach. We propose MEDALIGN, a novel alignment distillation framework designed to enhance Med-LVLMs by transferring both visual representations and attention patterns from a domain-specific expert CLIP model. As illustrated in Figure 3, given an input image-prompt pair, we extract two alignment signals from the expert CLIP: (1) visual representations and (2) visual attention maps. These signals are then distilled into intermediate layers of the Med-LVLM to improve its alignment with clinically relevant content. To enable lightweight and non-invasive integration, we introduce two core components. First, a spatial-aware visual alignment loss captures the pairwise similarity structure among image patches -reflected in the expert CLIP’s visual features -and transfers it to the Med-LVLM’s internal representation

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut