Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth labels and the teacher's own predictions using the same architecture and training data. Although SD has been empirically shown to often improve generalization, its formal guarantees remain limited. We study SD for ridge regression in unconstrained setting in which the mixing weight $ξ$ may be outside the unit interval. Conditioned on the training data and without any distributional assumptions, we prove that for any squared prediction risk (including out-of-distribution), the optimally mixed student strictly improves upon the ridge teacher for every regularization level $λ> 0$ at which the teacher ridge risk $R(λ)$ is nonstationary (i.e., $R'(λ) \neq 0$). We obtain a closed-form expression for the optimal mixing weight $ξ^\star(λ)$ for any value of $λ$ and show that it obeys the sign rule: $\operatorname{sign}(ξ^\star(λ))=-\operatorname{sign}(R'(λ))$. In particular, $ξ^\star(λ)$ can be negative, which is the case in over-regularized regimes. To quantify the risk improvement due to SD, we derive exact deterministic equivalents for the optimal SD risk in the proportional asymptotics regime (where the sample and feature sizes $n$ and $p$ both diverge but their aspect ratio $p/n$ converges) under general anisotropic covariance and deterministic signals. Our asymptotic analysis extends standard second-order ridge deterministic equivalents to their fourth-order analogs using block linearization, which may be of independent interest. From a practical standpoint, we propose a consistent one-shot tuning method to estimate $ξ^\star$ without grid search, sample splitting, or refitting. Experiments on real-world datasets and pretrained neural network features support our theory and the one-shot tuning method.
Knowledge distillation (KD), introduced by Buciluǎ et al. (2006); Ba and Caruana (2014); Hinton et al. (2015), is conventionally used for model compression, transferring knowledge from a large teacher to a smaller student. Recently, this paradigm has been adapted to the setting where teacher and student share the same architecture and training data, a process known as self-distillation (SD) (Furlanello et al., 2018;Zhang et al., 2021). While it may seem counterintuitive that a model would improve by learning from its own predictions, extensive empirical evidence shows that SD can in fact boost generalization (Chen et al., 2017(Chen et al., , 2022;;Li et al., 2017;Ahn et al., 2019;Li et al., 2021;Gou et al., 2021). Despite these successes, it remains unclear whether and when such improvements can be guaranteed.
Formally, let f be a teacher trained on {(x i , y i )} n i=1 using a loss function ℓ. Self-distillation trains a student f sd on the same data by minimizing a mixed objective that is an affine interpolation of the losses incurred with respect to the ground-truth labels y i ’s and the teacher’s predictions f(x i )’s. In detail, the SD procedure seeks to find the student model f sd minimizing
where ξ is the mixing parameter (Lopez-Paz et al., 2015); see Figure 1. When ξ = 1, the student learns solely from the teacher’s predictions; we call this pure-distillation (PD) and denote the resulting predictor by f pd .
The mixing parameter ξ balances the influence of ground-truth labels against teacher predictions. Standard distillation methods restrict ξ to lie in [0, 1], interpreting the target loss as a convex combination. Recent work by Das and Sanghavi (2023) shows that this constraint can be suboptimal under high label noise, where the optimal mixing weight ξ ⋆ may in fact be found to be greater than 1.
Motivated by this, we adopt a fully unconstrained perspective and allow ξ ∈ R, including negative values. Note that setting ξ = 0 recovers the teacher predictor, hence optimizing over ξ cannot perform worse than the teacher. With this in mind, we pose the key questions about SD:
(Q1) When does the optimally mixed student f sd trained using an optimal ξ ⋆ ∈ R strictly outperform the teacher f , and how large can the gain be? (Q2) Can optimal SD from a suboptimal teacher achieve performance comparable to an optimally tuned teacher? (Q3) How can we efficiently tune the optimal ξ ⋆ ∈ R without computationally expensive grid search?
We provide complete answers to all these questions for ridge regression, a model in which SD admits an explicit affine path (in the response) and the risk of both the teacher and the students can be characterized sharply, capturing the interplay between regularization and distillation.
Below we describe in detail the main contributions of the paper; see Figure 2 for a visual summary.
Structural nonasymptotic guarantees (Section 2). Addressing (Q1), we derive deterministic identities for self-distilled ridge that hold conditionally on the observed training data, without any distributional assumptions, and for any squared prediction risk (including the out-of-distribution risk). In particular, we show that for every λ > 0 such that the teacher ridge-path risk λ → R(λ) Figure 2: Strict improvement of SD risk with unconstrained mixing. Test squared prediction risk of ridge regression (R, in blue), pure-distilled ridge (R pd , in light blue) and optimal self-distilled ridge (R ⋆ sd , in green) as functions of the ridge penalty λ. Results are shown on raw features from realworld datasets: BlogFeedback and Communities and Crime datasets, and on pretrained ResNet-18 features. The optimal mixing parameter ξ ⋆ (λ) is in red and the one-shot risk estimate R ⋆ sd (λ) computed from the training data is shown in green dashed line. Note that ξ ⋆ (λ) lies in [0, 1] only for a narrow range of λ and can be strongly negative for large λ. We also observe that: (i) R ⋆ sd (λ) is strictly smaller than R(λ) at every λ that is not the stationary point of R(λ), (ii) the sign of ξ ⋆ (λ) is opposite to the sign of R ′ (λ), and (iii) the sign change of ξ ⋆ happens at the stationary point of R(λ). (Experiments with ξ restricted to [0, 1] appear in Figure 16.) is nonstationary (i.e., R ′ (λ) ̸ = 0), optimal mixing yields a strict improvement over the teacher (Theorem 2.2). Addressing (Q2), we provide a curvature-based sufficient condition under which the global minimum over λ of the SD risk R ⋆ sd (λ), obtained using the optimal mixing ξ ⋆ (λ), is strictly smaller than the smallest ridge risk of the teacher (Proposition 2.3).
Precise proportional asymptotics (Section 3). Returning to (Q1) in the proportional regime -in which the sample and feature sizes n, p → ∞ but their aspect ratio p/n → γ ∈ (0, ∞) -we derive exact deterministic equivalents for the optimal SD risk and mixing weight under general anisotropic covariance and deterministic signals (Theorem 3.1). These formulas quantify the SD gains in te
This content is AI-processed based on open access ArXiv data.