Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning
Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth labels and the teacher’s own predictions using the same architecture and training data. Although SD has been empirically shown to often improve generalization, its formal guarantees remain limited. We study SD for ridge regression in unconstrained setting in which the mixing weight $ξ$ may be outside the unit interval. Conditioned on the training data and without any distributional assumptions, we prove that for any squared prediction risk (including out-of-distribution), the optimally mixed student strictly improves upon the ridge teacher for every regularization level $λ> 0$ at which the teacher ridge risk $R(λ)$ is nonstationary (i.e., $R’(λ) \neq 0$). We obtain a closed-form expression for the optimal mixing weight $ξ^\star(λ)$ for any value of $λ$ and show that it obeys the sign rule: $\operatorname{sign}(ξ^\star(λ))=-\operatorname{sign}(R’(λ))$. In particular, $ξ^\star(λ)$ can be negative, which is the case in over-regularized regimes. To quantify the risk improvement due to SD, we derive exact deterministic equivalents for the optimal SD risk in the proportional asymptotics regime (where the sample and feature sizes $n$ and $p$ both diverge but their aspect ratio $p/n$ converges) under general anisotropic covariance and deterministic signals. Our asymptotic analysis extends standard second-order ridge deterministic equivalents to their fourth-order analogs using block linearization, which may be of independent interest. From a practical standpoint, we propose a consistent one-shot tuning method to estimate $ξ^\star$ without grid search, sample splitting, or refitting. Experiments on real-world datasets and pretrained neural network features support our theory and the one-shot tuning method.
💡 Research Summary
The paper provides a rigorous statistical analysis of self‑distillation (SD) for ridge regression, extending the concept beyond the usual convex combination of labels and teacher predictions. In the “unconstrained” setting the mixing weight ξ is allowed to take any real value, which enables the method to exploit regimes where the teacher is over‑regularized and a negative contribution from the teacher’s predictions can actually improve performance. The authors first fix the training data and consider the conditional squared prediction risk of the ridge teacher, R(λ), for any regularization level λ>0. They prove that whenever R(λ) is non‑stationary (i.e., its derivative R′(λ)≠0), there exists an optimal mixing weight ξ⁎(λ) such that the student’s risk R_sd(ξ⁎,λ) is strictly smaller than R(λ). The optimal ξ⁎(λ) admits a closed‑form expression and obeys the simple sign rule
sign(ξ⁎(λ)) = –sign(R′(λ)).
Consequently, if the teacher’s risk decreases with λ (under‑regularized region) the optimal ξ⁎ is positive, while in the over‑regularized region where the risk rises with λ the optimal ξ⁎ becomes negative. This sign rule explains why “negative distillation” can be beneficial, a phenomenon previously observed only empirically.
To quantify the improvement, the authors move to the proportional asymptotics regime where both the sample size n and the feature dimension p tend to infinity with a fixed aspect ratio γ = p/n. They assume a general anisotropic covariance Σ for the features and a deterministic signal β. While classical ridge analysis provides second‑order deterministic equivalents for the risk (involving traces of (Σ+λI)⁻¹), SD introduces fourth‑order interactions between the teacher and student predictors. The paper’s technical core is a novel block‑linearization technique that expands the product of two resolvents (XᵀX+λI)⁻¹ and (XᵀX+λ′I)⁻¹ to fourth‑order accuracy. This yields exact deterministic equivalents for both the optimal mixing weight and the resulting SD risk:
ξ⁎(λ) = – R′(λ) / (τ₁(λ)+τ₂(λ)),
R_sd⁎(λ) = R(λ) –
Comments & Academic Discussion
Loading comments...
Leave a Comment