REDistill: Robust Estimator Distillation for Balancing Robustness and Efficiency
Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student by aligning their predictive distributions. However, conventional KD formulations - typically based on Kullback-Leibler divergence - assume that the teacher provides reliable soft targets. In practice, teacher predictions are often noisy or overconfident, and existing correction-based approaches rely on ad-hoc heuristics and extensive hyper-parameter tuning, which hinders generalization. We introduce REDistill (Robust Estimator Distillation), a simple yet principled framework grounded in robust statistics. REDistill replaces the standard KD objective with a power divergence loss, a generalization of KL divergence that adaptively downweights unreliable teacher output while preserving informative logit relationships. This formulation provides a unified and interpretable treatment of teacher noise, requires only logits, integrates seamlessly into existing KD pipelines, and incurs negligible computational overhead. Extensive experiments on CIFAR-100 and ImageNet-1k demonstrate that REDistill consistently improves student accuracy in diverse teacher-student architectures. Remarkably, it achieves these gains without model-specific hyper-parameter tuning, underscoring its robustness and strong generalization to unseen teacher-student pairs.
💡 Research Summary
Knowledge Distillation (KD) traditionally relies on minimizing the Kullback–Leibler (KL) divergence between a teacher’s softened output distribution and that of a student model. While effective when the teacher is accurate, KL’s logarithmic penalty makes the training highly sensitive to noisy or over‑confident teacher predictions, leading to error propagation in the student. Existing remedies—such as swapping logits, amplifying ground‑truth probabilities, or masking classes—are heuristic, require extensive hyper‑parameter tuning, and often distort inter‑class relationships.
REDistill addresses these shortcomings by grounding KD in robust statistics. The authors replace the KL term with a power‑divergence loss, a parametric family that generalizes KL and includes a robustness parameter λ. When λ > 0 the loss uses a (1‑λ)‑logarithm (a Box–Cox transform of the natural log), which smoothly down‑weights large likelihood ratios p/q. This reduces the influence of outlier teacher outputs while preserving useful logit information.
Through influence‑function analysis, the paper shows that the per‑sample influence under the power‑divergence loss scales as 1/(1+λ)·qθ, meaning larger λ values directly improve robustness to noisy teacher signals. Theoretical work on asymptotic unbiasedness and empirical studies suggest λ = 2⁄3 as a near‑optimal trade‑off between statistical efficiency (small |λ|) and robustness (large λ). Consequently, REDistill’s loss is defined as:
L_REDISTILL = KL(y, qθ) + α·D_{2/3}(p_target, qθ_target) + β·D_{2/3}(p_nontarget, qθ_nontarget)
where D_{2/3} denotes the power‑divergence of order 2⁄3, and α, β balance target versus non‑target contributions (as in decoupled KD). The method also accommodates temperature scaling: when logits are softened by τ, the power‑divergence term is multiplied by τ² to keep gradient magnitudes comparable to the original KD formulation.
Experimental evaluation spans CIFAR‑100 and ImageNet‑1k, covering 14 teacher‑student pairs across architectures such as ResNet, WideResNet, VGG, MobileNetV2, and SHN‑V2. REDistill consistently outperforms baseline KD, DKD, LSKD, and RLD, achieving 0.3–1.0 % higher top‑1 accuracy without any model‑specific hyper‑parameter search. Notably, the same λ = 2⁄3 and fixed α, β values are used throughout, demonstrating the method’s robustness and ease of deployment. Computational overhead is negligible; the power‑divergence term adds only a simple scalar transformation to the standard KL loss, preserving the training speed and memory footprint of conventional KD.
In summary, REDistill provides a principled, statistically‑grounded alternative to heuristic teacher‑correction strategies. By integrating a robust divergence into the KD objective, it automatically mitigates the adverse effects of noisy or over‑confident teachers while retaining the efficiency of logit‑based distillation. The approach is model‑agnostic, hyper‑parameter light, and readily extensible to scenarios such as multi‑teacher distillation, datasets with heavy label noise, or domains beyond vision, marking a significant step toward more reliable and generalizable knowledge transfer.
Comments & Academic Discussion
Loading comments...
Leave a Comment