ConStruct: Structural Distillation of Foundation Models for Prototype-Based Weakly Supervised Histopathology Segmentation
Weakly supervised semantic segmentation (WSSS) in histopathology relies heavily on classification backbones, yet these models often localize only the most discriminative regions and struggle to capture the full spatial extent of tissue structures. Vision-language models such as CONCH offer rich semantic alignment and morphology-aware representations, while modern segmentation backbones like SegFormer preserve fine-grained spatial cues. However, combining these complementary strengths remains challenging, especially under weak supervision and without dense annotations. We propose a prototype learning framework for WSSS in histopathological images that integrates morphology-aware representations from CONCH, multi-scale structural cues from SegFormer, and text-guided semantic alignment to produce prototypes that are simultaneously semantically discriminative and spatially coherent. To effectively leverage these heterogeneous sources, we introduce text-guided prototype initialization that incorporates pathology descriptions to generate more complete and semantically accurate pseudo-masks. A structural distillation mechanism transfers spatial knowledge from SegFormer to preserve fine-grained morphological patterns and local tissue boundaries during prototype learning. Our approach produces high-quality pseudo masks without pixel-level annotations, improves localization completeness, and enhances semantic consistency across tissue types. Experiments on BCSS-WSSS datasets demonstrate that our prototype learning framework outperforms existing WSSS methods while remaining computationally efficient through frozen foundation model backbones and lightweight trainable adapters.
💡 Research Summary
**
The paper introduces ConStruct, a novel framework for weakly‑supervised semantic segmentation (WSSS) of histopathology images that synergistically combines a large‑scale vision‑language foundation model (CONCH) with a state‑of‑the‑art segmentation backbone (SegFormer). Traditional WSSS pipelines rely on class activation maps (CAMs) derived from classification backbones, which tend to highlight only the most discriminative regions and thus fail to capture the full spatial extent of tissue structures—especially problematic in histopathology where intra‑class heterogeneity and inter‑class homogeneity are prevalent. ConStruct addresses these limitations through three tightly integrated components:
-
Structural Knowledge Distillation – A teacher‑student scheme where a frozen SegFormer (MiT‑B1) provides multi‑scale structural cues, while a frozen CONCH ViT‑B/16 supplies morphology‑aware semantic features. Light‑weight adapters (≈6.3 M trainable parameters, 3.7 % of the total) refine the CONCH features. Instead of aligning raw feature vectors, the method aligns pairwise token‑to‑token affinity matrices (computed as cosine similarity between token embeddings) across selected scales using an MSE loss. This relational distillation transfers boundary‑aware structural knowledge from SegFormer to CONCH without heavy computational overhead.
-
Text‑Guided Prototype Initialization – For each tissue class, a detailed pathology description is encoded with CONCH’s frozen text encoder, producing class‑specific text embeddings. A shared two‑layer MLP projects these embeddings into the visual feature space, after which an adaptive layer maps them to the dimensionality of the refined visual tokens. The resulting prototype bank is thus initialized with domain‑knowledge rather than random or purely image‑derived vectors. During training, cosine similarity between the refined visual tokens and the prototypes yields class activation maps (CAMs). Global average pooling of these maps provides image‑level predictions supervised by a binary cross‑entropy loss.
-
Mask Refinement & Contrastive Alignment – The prototype‑derived CAMs are thresholded adaptively (α = 0.5) to obtain foreground/background masks. Cropped foreground and background regions are re‑encoded by CONCH to obtain region embeddings. An InfoNCE‑style contrastive loss pulls foreground embeddings toward their class prototypes while pushing them away from other class prototypes and a memory bank of negatives; background embeddings are attracted to a dedicated background prototype. This contrastive term (ℓ_sim) encourages spatially coherent, class‑consistent pseudo‑masks.
The overall training objective combines classification (L_cls), structural distillation (L_struct), and contrastive alignment (L_sim) with weights λ_cls = 1.0, λ_struct = 1.5, λ_sim = 0.2. All backbone weights remain frozen; only adapters, prototypes, and contrastive heads are updated, yielding a highly parameter‑efficient system.
During inference, six test‑time augmentations (horizontal flip and brightness scaling) are applied, and the resulting CAMs are averaged. A dense Conditional Random Field (CRF) post‑processes the averaged maps to sharpen object boundaries.
Experimental validation is performed on the BCSS‑WSSS benchmark, which includes multiple tissue classes and varying annotation costs. ConStruct achieves superior mean Intersection‑over‑Union (mIoU) and Dice scores compared to CAM‑based methods (Grad‑CAM, Score‑CAM), MIL‑based approaches, and recent prototype‑based frameworks such as LPD and ProtoSeg. Notably, the structural distillation component contributes an 8–12 % point gain in boundary accuracy, while the text‑guided prototype initialization improves class coverage and reduces false negatives. Ablation studies confirm that removing any of the three components degrades performance, underscoring their complementary roles.
Key contributions and implications:
- Demonstrates that domain‑specific textual descriptions can effectively initialize prototypes, bridging the semantic gap inherent in weak supervision.
- Introduces a lightweight relational distillation mechanism that transfers fine‑grained structural cues without requiring full fine‑tuning of large backbones.
- Shows that a frozen‑backbone, adapter‑only training regime can achieve state‑of‑the‑art WSSS performance while keeping the trainable parameter count low, facilitating deployment in resource‑constrained clinical settings.
Limitations and future directions include the reliance on image‑level labels (extension to whole‑slide image (WSI) level supervision remains open) and the manual crafting of pathology prompts. The authors suggest integrating large language models for automatic prompt generation and scaling the teacher‑student distillation to hierarchical WSI representations as promising avenues for further research.
Comments & Academic Discussion
Loading comments...
Leave a Comment