DINO-Mix: Distilling Foundational Knowledge with Cross-Domain CutMix for Semi-supervised Class-imbalanced Medical Image Segmentation
Semi-supervised learning (SSL) has emerged as a critical paradigm for medical image segmentation, mitigating the immense cost of dense annotations. However, prevailing SSL frameworks are fundamentally “inward-looking”, recycling information and biases solely from within the target dataset. This design triggers a vicious cycle of confirmation bias under class imbalance, leading to the catastrophic failure to recognize minority classes. To dismantle this systemic issue, we propose a paradigm shift to a multi-level “outward-looking” framework. Our primary innovation is Foundational Knowledge Distillation (FKD), which looks outward beyond the confines of medical imaging by introducing a pre-trained visual foundation model, DINOv3, as an unbiased external semantic teacher. Instead of trusting the student’s biased high confidence, our method distills knowledge from DINOv3’s robust understanding of high semantic uniqueness, providing a stable, cross-domain supervisory signal that anchors the learning of minority classes. To complement this core strategy, we further look outward within the data by proposing Progressive Imbalance-aware CutMix (PIC), which creates a dynamic curriculum that adaptively forces the model to focus on minority classes in both labeled and unlabeled subsets. This layered strategy forms our framework, DINO-Mix, which breaks the vicious cycle of bias and achieves remarkable performance on challenging semi-supervised class-imbalanced medical image segmentation benchmarks Synapse and AMOS.
💡 Research Summary
The paper tackles two pervasive problems in semi‑supervised medical image segmentation: severe class imbalance and the resulting confirmation bias that causes minority structures to be ignored. Existing semi‑supervised methods are “inward‑looking”: they generate pseudo‑labels or consistency constraints solely from the target dataset, so any early preference for majority classes is amplified during training. To break this vicious cycle, the authors propose an “outward‑looking” framework called DINO‑Mix, which combines two complementary mechanisms.
-
Foundational Knowledge Distillation (FKD). A frozen DINOv3 vision transformer, pre‑trained on massive natural‑image collections via self‑supervision, is used as an external semantic teacher. Because DINOv3 learns generic visual cues (texture, shape, structure) without being tied to any class distribution, its representations are unbiased with respect to the medical dataset’s skewed statistics. For each unlabeled 3D volume, the student encoder produces a deep feature map; the same volume is sliced and fed through the DINOv3 encoder, after which the slice‑wise features are re‑assembled into a 3D teacher volume. A lightweight 3‑D projector aligns channel dimensions, and an L2‑normalized mean‑squared error loss (L_distill) forces the student’s features to match the teacher’s. The stop‑gradient operation guarantees that only the student updates. This external signal supplies strong gradients for visually distinctive but low‑confidence minority regions, preventing the model from reinforcing its own erroneous predictions.
-
Progressive Imbalance‑aware CutMix (PIC). At the data level, the method dynamically mixes patches from labeled and unlabeled images, biasing the sampling toward rare classes early in training. Class frequencies N_c are measured on a representative labeled subset, and an imbalance ratio I_c = min_j N_j / N_c is computed. A class‑balancing distribution P_bal is derived using a focusing exponent γ, emphasizing tail classes. During training, a progressive factor α_t = min(1, E/(η·E_max)) linearly interpolates between P_bal and a uniform distribution P_uni, yielding a epoch‑dependent sampling distribution P_E. Consequently, the curriculum starts with heavy minority‑class exposure and gradually shifts to a balanced, general‑purpose mixing regime, avoiding over‑fitting to rare classes while still ensuring sufficient representation.
The overall DINO‑Mix pipeline builds on a standard EMA‑teacher consistency backbone, adds an auxiliary classifier to separate representation learning from imbalance handling, and jointly optimizes the consistency loss, the auxiliary classification loss, L_distill, and the PIC mixing loss.
Experimental validation is performed on two challenging benchmarks: Synapse (multi‑organ CT) and AMOS (multi‑institution MRI), under extremely low label ratios (1 %–5 %). DINO‑Mix consistently outperforms state‑of‑the‑art semi‑supervised methods such as Mean Teacher, U‑A‑MT, CPS, and CReST, achieving higher Dice and mean IoU scores. The most striking gains are observed on minority classes (e.g., small vessels, lesions), where IoU improvements exceed 10 % absolute in several cases, demonstrating that the external teacher effectively rescues under‑represented structures.
Strengths of the work include: (i) the novel use of a frozen, domain‑agnostic foundation model that supplies bias‑free semantic guidance; (ii) a curriculum‑style CutMix that adaptively focuses on rare classes without sacrificing overall generalization; (iii) a relatively simple integration of 2‑D foundation features into a 3‑D segmentation pipeline.
Limitations are noted: DINOv3 is trained on natural images, so its visual priors may not perfectly align with medical textures; the need for a projection head adds extra parameters; and processing volumes slice‑by‑slice discards some 3‑D contextual continuity.
Future directions could explore medical‑domain self‑supervised foundation models as teachers, design fully 3‑D token‑based teachers to preserve volumetric context, or combine vision‑language models to inject anatomical terminology. Overall, DINO‑Mix presents a compelling paradigm shift from inward to outward learning, offering a robust solution to the long‑standing class‑imbalance problem in semi‑supervised medical image segmentation.
Comments & Academic Discussion
Loading comments...
Leave a Comment