Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling

Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision-language models (VLMs) have recently been integrated into multiple instance learning (MIL) frameworks to address the challenge of few-shot, weakly supervised classification of whole slide images (WSIs). A key trend involves leveraging multi-scale information to better represent hierarchical tissue structures. However, existing methods often face two key limitations: (1) insufficient modeling of interactions within the same modalities across scales (e.g., 5x and 20x) and (2) inadequate alignment between visual and textual modalities on the same scale. To address these gaps, we propose HiVE-MIL, a hierarchical vision-language framework that constructs a unified graph consisting of (1) parent-child links between coarse (5x) and fine (20x) visual/textual nodes to capture hierarchical relationships, and (2) heterogeneous intra-scale edges linking visual and textual nodes on the same scale. To further enhance semantic consistency, HiVE-MIL incorporates a two-stage, text-guided dynamic filtering mechanism that removes weakly correlated patch-text pairs, and introduces a hierarchical contrastive loss to align textual semantics across scales. Extensive experiments on TCGA breast, lung, and kidney cancer datasets demonstrate that HiVE-MIL consistently outperforms both traditional MIL and recent VLM-based MIL approaches, achieving gains of up to 4.1% in macro F1 under 16-shot settings. Our results demonstrate the value of jointly modeling hierarchical structure and multimodal alignment for efficient and scalable learning from limited pathology data. The code is available at https://github.com/bryanwong17/HiVE-MIL.


💡 Research Summary

The paper tackles the problem of few‑shot, weakly supervised whole‑slide image (WSI) classification, a setting that is increasingly common in computational pathology due to privacy constraints, rarity of certain diseases, and the prohibitive cost of obtaining slide‑level annotations. While recent works have begun to incorporate vision‑language models (VLMs) such as CLIP, BLIP, or domain‑adapted variants (PLIP, CONCH) into multiple‑instance learning (MIL) pipelines, they typically suffer from two fundamental shortcomings: (1) they do not explicitly model hierarchical interactions within each modality across different magnifications (e.g., 5× coarse tissue patterns vs. 20× cellular details), and (2) they fail to align visual and textual representations on the same scale, leaving the multimodal fusion weak and noisy.

HiVE‑MIL (Hierarchical Vision‑Language MIL) is introduced to fill this gap. The method first extracts patches at two magnifications: low‑resolution (5×) patches serve as coarse nodes, and each low‑resolution patch is further subdivided into a 4 × 4 grid of high‑resolution (20×) patches, yielding fine nodes. Both visual and textual embeddings are obtained from frozen VLM encoders. Textual prompts are generated automatically by a large language model (LLM) using a hierarchical template that asks for morphological descriptors at 5× and sub‑descriptors at 20× for each class. Learnable prompt tokens are prepended to each textual description to match the dimensionality of visual embeddings.

All nodes are placed into a unified heterogeneous graph with two edge families: (i) Hierarchical edges connecting parent (coarse) and child (fine) nodes within the same modality, and (ii) Intra‑scale heterogeneous edges linking visual and textual nodes that share the same magnification. A Modality‑Scale Attention (MSA) module processes hierarchical edges, allowing information to flow from global to local scales while preserving modality‑specific context.

To avoid spurious visual‑textual pairings, HiVE‑MIL incorporates a two‑stage Text‑Guided Dynamic Filtering (TGDF) module. In stage 1, low‑resolution patches whose cosine similarity with the corresponding low‑resolution textual prompt falls below a learned threshold are discarded. In stage 2, only high‑resolution patches that belong to retained low‑resolution patches and also exceed a second similarity threshold are kept. This top‑down filtering dramatically reduces noise before constructing intra‑scale heterogeneous edges.

Semantic consistency across scales in the textual space is enforced by a Hierarchical Text Contrastive Loss (HTCL). HTCL treats low‑ and high‑resolution textual embeddings of the same class as positive pairs and pushes apart embeddings of different classes, thereby aligning the hierarchical textual semantics.

The overall training objective combines (a) a cross‑entropy loss on slide‑level logits, (b) the hierarchical visual‑textual alignment loss (derived from the graph), and (c) HTCL, all weighted appropriately. The graph neural network updates node representations iteratively, and the final slide‑level representation is obtained by aggregating the refined node embeddings.

Extensive experiments were conducted on three TCGA cancer cohorts: breast (BRCA), lung (LUAD/LUSC), and kidney (KIRC/KIRP). The authors evaluate few‑shot scenarios with 4, 8, and 16 labeled slides per class. Baselines include classic MIL models (ABMIL, CLAM, TransMIL), recent VLM‑MIL approaches (TOP, FOCUS, Multi‑Scale CLIP), and domain‑adapted VLM backbones (PLIP, CONCH). HiVE‑MIL consistently outperforms all baselines, achieving up to a 4.1 percentage‑point increase in macro‑averaged F1 under the 16‑shot setting. Ablation studies demonstrate that removing TGDF, HTCL, or MSA each leads to a noticeable drop in performance, confirming the contribution of each component. Moreover, the method remains robust when swapping the underlying VLM encoder, indicating good generality.

In summary, HiVE‑MIL makes three key contributions: (1) a hierarchical graph that explicitly models parent‑child relationships across magnifications for both visual and textual modalities, (2) a text‑guided dynamic filtering mechanism that prunes weak visual‑text pairs before heterogeneous edge construction, and (3) a hierarchical contrastive loss that aligns textual semantics across scales. By jointly addressing intra‑modality hierarchy and inter‑modality alignment, HiVE‑MIL sets a new state‑of‑the‑art for few‑shot WSI classification and offers a blueprint for extending hierarchical multimodal learning to other gigapixel or high‑resolution domains.


Comments & Academic Discussion

Loading comments...

Leave a Comment