Investigating the Robustness of Subtask Distillation under Spurious Correlation

Investigating the Robustness of Subtask Distillation under Spurious Correlation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Subtask distillation is an emerging paradigm in which compact, specialized models are extracted from large, general-purpose ‘foundation models’ for deployment in environments with limited resources or in standalone computer systems. Although distillation uses a teacher model, it still relies on a dataset that is often limited in size and may lack representativeness or exhibit spurious correlations. In this paper, we evaluate established distillation methods, as well as the recent SubDistill method, when using data with spurious correlations for distillation. As the strength of the correlations increases, we observe a widening gap between advanced methods, such as SubDistill, which remain fairly robust, and some baseline methods, which degrade to near-random performance. Overall, our study underscores the challenges of knowledge distillation when applied to imperfect, real-world datasets, particularly those with spurious correlations.


💡 Research Summary

This paper investigates how spurious correlations in the data used for subtask distillation affect the robustness of various distillation techniques. Subtask distillation aims to compress a large foundation model into a compact student model specialized for a specific downstream task. While the teacher model provides rich knowledge, the student still relies on a finite dataset that may contain biases or spurious relationships. To systematically study this, the authors construct a synthetic benchmark based on ImageNet. They select five “wading bird” classes (spoonbill, flamingo, crane, limpkin, bustard) as the subtask and overlay a 56×56 MNIST digit in the top‑left corner of each 224×224 image. During training, the digit class is perfectly correlated with the bird class (e.g., spoonbill ↔ digit 0), creating a spurious correlation. At test time, digits are assigned randomly, breaking the correlation. The contamination rate ρ (percentage of images with the digit) is varied across 0 %, 50 %, and 100 % to simulate increasing levels of spurious bias.

The study evaluates five distillation methods: (1) Output‑Only (standard KL‑divergence on softmax logits), (2) Attention Transfer (AT), which aligns teacher‑student attention maps, (3) Variational Information Distillation (VID), which maximizes mutual information between teacher and student representations, (4) VKD, a recent method employing task‑dependent normalization and orthogonal adapters, and (5) SubDistill, a newly proposed approach that identifies subtask‑relevant subspaces in each teacher layer, projects both teacher and student activations onto these subspaces, and aligns them via an orthogonal rotation. SubDistill thus combines subtask awareness with layer‑wise alignment.

Three teacher‑student pairs are examined to cover a range of architectures: (a) ResNet‑18 → ResNet‑18‑S (a slightly slimmer ResNet‑18), (b) WideResNet‑101 → MBNet‑v4 (a large CNN teacher distilled to a mobile‑friendly network), and (c) ViT‑B16 → EfficientFormer‑v2 (a transformer teacher distilled to an efficient transformer‑based student). For each configuration, hyper‑parameters are tuned by grid‑searching a single λ weight for all layer‑wise losses (λ ∈ {0.01, 0.1, 1, 10, 100}). Training runs for 100 epochs with AdamW (lr = 0.001, decay = 0.5 every 25 epochs) and early‑stopping on a validation set that mirrors the contamination statistics of the test set. Each experiment is repeated three times with different random seeds.

Quantitative results (Table I) show a clear degradation trend as ρ increases. With no contamination (ρ = 0 %) all methods achieve >90 % accuracy. At ρ = 50 % the Output‑Only baseline drops to 70–84 % depending on the pair, while AT, VID, and VKD lose roughly 15–25 % absolute accuracy. At the extreme ρ = 100 % the baseline collapses to near‑random performance (24–45 % accuracy). In stark contrast, SubDistill maintains high accuracy across all settings: 86 %–96 % for ResNet‑18‑S, 66 %–92 % for WideResNet‑101→MBNet‑v4, and 58 %–96 % for ViT‑B16→EfficientFormer‑v2. The gap is especially pronounced for the larger teacher‑student pairs, where the baseline methods sometimes fall to ~20 % (random guess for five classes) while SubDistill stays well above 80 %.

To understand why SubDistill is more robust, the authors visualize student representations using t‑SNE on the global average‑pooled layer. The teacher’s embeddings form tight clusters by bird class, independent of the MNIST digit. Baseline students (Output‑Only, AT, VID, VKD) instead cluster by digit, indicating that the spurious feature dominates their learned representation. SubDistill’s student embeddings closely resemble the teacher’s, preserving class‑based clusters and ignoring the digit.

Further qualitative analysis employs an occlusion‑based XAI method. Images are divided into a 4 × 4 grid of 56 × 56 patches; the impact of removing each patch on the model’s output is measured. SubDistill’s relevance maps highlight the central bird region and largely ignore the top‑left digit patch, whereas baseline models assign high relevance to the digit region, confirming that they rely on the spurious cue for prediction.

The authors conclude that (i) spurious correlations can severely compromise subtask distillation, especially for methods that only align outputs, and (ii) techniques that enforce tight, layer‑wise student‑teacher alignment and incorporate subtask‑specific information (e.g., SubDistill) are substantially more resilient. They also note that the effect is more pronounced for larger, more expressive teachers, suggesting that careful model selection and data curation are essential.

Limitations include the synthetic nature of the benchmark (ImageNet + MNIST) which may not capture the full complexity of real‑world spurious factors in domains such as medicine or finance, and the additional computational overhead of SubDistill’s subspace extraction. Future work is suggested to explore diverse domains, multi‑factor spurious settings, and more efficient subspace identification techniques.

Overall, the paper provides a thorough empirical assessment of distillation robustness under biased data, highlights the vulnerability of conventional methods, and demonstrates that SubDistill offers a promising path toward reliable, compact models even when training data contain strong spurious signals.


Comments & Academic Discussion

Loading comments...

Leave a Comment