Boosting Medical Visual Understanding From Multi-Granular Language Learning
Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at https://github.com/HUANGLIZI/MGLL.
💡 Research Summary
The paper introduces Multi‑Granular Language Learning (MGLL), a contrastive learning framework designed to overcome the limitations of conventional image‑text pre‑training methods such as CLIP when applied to medical imaging. In clinical contexts, a single image often carries several high‑level disease categories (e.g., diabetic macular edema, diabetic retinopathy) and multiple layers of annotation ranging from coarse diagnostic labels to fine‑grained clinical explanations. CLIP’s single‑label, single‑granularity alignment is insufficient for these scenarios.
MGLL addresses this gap by (1) incorporating structured multi‑label supervision, (2) integrating textual descriptions across different granularities, and (3) introducing three complementary loss components: a soft CLIP loss, a point‑wise binary cross‑entropy loss, and a smooth Kullback‑Leibler (KL) divergence loss. The soft CLIP loss replaces the hard one‑to‑one pairing of CLIP with a probabilistic alignment that allows each image feature Vᵢ to be associated with a set of text features {Tᵢ₁,…,Tᵢ_Mi}. The association weights wᵢₖ are derived from a co‑occurrence matrix, ensuring that more frequently co‑occurring label‑text pairs receive higher influence. This formulation enables multi‑label optimization without biasing the model toward a single dominant label.
The point‑wise loss further refines alignment at the individual pair level. By applying a sigmoid to the similarity logits and computing binary cross‑entropy against ground‑truth match indicators yᵢⱼ, the model learns to discriminate each image‑text pair, which improves fine‑grained discrimination among closely related disease sub‑categories.
Cross‑granularity consistency is enforced through the smooth KL loss. For each granularity level, the model produces a probability distribution Pᵢ over the label space. MGLL computes the mean distribution M across all granularities and minimizes the KL divergence D_KL(Pᵢ‖M) for each i. This encourages the representations of coarse and fine annotations to converge toward a shared latent space, mitigating over‑fitting to any single granularity and enhancing generalization across heterogeneous label sets.
Architecturally, MGLL is plug‑and‑play: it retains the standard Vision Transformer (ViT) for image encoding and BERT for text encoding, adding no extra parameters. Only the loss functions are altered, preserving computational efficiency and allowing seamless integration with existing vision‑language foundations.
To evaluate the approach, the authors constructed two large‑scale multi‑granular datasets: MGLL‑Fundus (retinal images) and MGLL‑Xray (chest radiographs). Each image is paired with multiple textual annotations spanning disease categories, severity grades, and clinical narratives. After pre‑training MGLL on these datasets, the model was fine‑tuned on more than ten public medical benchmarks covering classification, segmentation, and retrieval tasks. Across all metrics—multi‑label accuracy, AUROC, F1‑score—MGLL consistently outperformed baselines including vanilla CLIP, MedCLIP, Multi‑Label‑CLIP, and recent multi‑label contrastive methods. Notably, the model excelled at distinguishing fine‑grained severity levels (e.g., severe vs. moderate diabetic macular edema) where CLIP typically collapses to coarse categories.
Theoretical analysis contrasts MGLL with CLIP’s objective. CLIP maximizes the similarity of a single image‑text pair while minimizing similarity to all negatives, leading to a single peak in the probability distribution. MGLL’s soft CLIP loss spreads probability mass over multiple relevant texts, and the KL alignment forces the distributions of different granularities to share a common mean, which mathematically reduces variance and improves robustness.
In summary, MGLL contributes three key innovations: (1) soft multi‑label alignment via co‑occurrence‑weighted contrastive loss, (2) point‑wise binary supervision for fine‑grained discrimination, and (3) smooth KL‑based cross‑granularity consistency. These components together enable a vision‑language model to learn richer, more flexible representations for medical images without incurring additional computational overhead. The work demonstrates that multi‑granular language supervision is a powerful avenue for advancing medical visual understanding and sets a new benchmark for future multimodal medical AI research.
Comments & Academic Discussion
Loading comments...
Leave a Comment