Residual-as-Teacher: Mitigating Bias Propagation in Student--Teacher Estimation

Residual-as-T eac her: Mitigating Bias Propagation in Studen t–T eac her Estimation Kak ei Y amamoto † Martin J. W ain wright ⋆ kakei@mit.edu mjwain@mit.edu Lab for Information and Decision Systems Statistics and Data Science Cen ter EECS † ,⋆ and Mathematics ⋆ Massac husetts Institute of T echnology Marc h 27, 2026 Abstract W e study statistical estimation in a studen t–teacher setting, where predictions from a pre- trained teac her are used to guide a studen t mo del. A standard approac h is to train the studen t to directly matc h the teacher’s outputs, whic h w e refer to as studen t soft matching ( SM ). This approac h directly propagates any systematic bias or mis-sp eciﬁcation presen t in the teac her, thereb y degrading the studen t’s predictions. W e propose and analyze an alternative scheme, kno wn as residual-as-teacher ( RaT ), in which the teac her is used to estimate residuals in the studen t’s predictions. Our analysis sho ws how the student can thereby emulate a proximal gradien t sc heme for solving an oracle optimization problem, and this pro v ably reduces the eﬀect of teac her bias. F or general studen t–teacher pairs, w e establish non-asymptotic excess risk b ounds for an y RaT ﬁxed p oint, along with conv ergence guarantees for the student-teac her iterativ e scheme. F or kernel-based student–teac her pairs, we prov e a sharp separation: the RaT metho d achiev es the minimax-optimal rate, while the SM metho d incurs constant prediction error for an y sample size. Exp eriments on b oth syn thetic data and ImageNette classiﬁcation under cov ariate shift corrob orate our theoretical ﬁndings. 1 In tro duction The student–teac her paradigm, in which one predictive mo del is used to guide the training of another, has b ecome increasingly prev alent ov er the past decade. It has prov en useful for a v ariety of problems, including semi-supervised learning, model compression and distillation, as w ell as adaptation to distribution shift. The core idea in all of these settings is to use the predictions of a pre-trained teac her mo del, often relativ ely complex and/or opaque in nature, to train a student mo del that migh t b e more computationally eﬃcien t, easier to interpret, or b etter b ehav ed on a new co v ariate population. In broad terms, the idea of mo del distillation is relativ ely old, emerging alongside the rise of neural netw orks, b o osting, and ensem ble methods in the 1990s. It has attracted renewed in terest o ver the past decade, fueled by the explosion in the complexit y of prediction methods, most notably with deep neural net w orks. Complex models can generate high-quality predictions, but may lac k in terpretability , or require signiﬁcan t memory and computation to generate predictions. A natural idea, then, is to train a simpler model (the studen t) to mimic the predictions of the more complex mo del (the teacher). This form of student-teac her in teraction has led to a substantial and evolving 1 line of work (e.g., [ BS96 ; BCNM06 ; BC14 ; HVD15 ; FH17 ; F ur+18 ; VSR20 ]). A notable early example is the “b orn-again tree” of Breiman and Shang [ BS96 ], where a single decision tree is used to approximate the predictions of a tree ensem ble; see also the contemporaneous w ork [ CS96 ], as w ell as the follo w-up pap ers [ F ur+18 ; VSR20 ]. Bucila et al. [ BCNM06 ] formalized the idea of mo del compression in more general settings, and studied how to generate synthetic cov ariates. F or classiﬁcation problems, there is a choice b etw een having the student match the teac her’s predicted class lab els, known as “hard” information, versus “soft” information such as predicted probabilities or logits. F or classiﬁcation, Hin ton et al. [ HVD15 ] proposed minimizing the KL div ergence to the teac her’s probabilities, and this use of soft information has prov en to b e sup erior in general [ FH17 ; Ara+20 ]. Student–teac her in teractions are also used to address the problem of cov ariate shift: here a teac her trained on a source distribution is used to guide training on a (p ote n tially diﬀerent) target distribution. This approach app ears in domain adaptation and semi-supervised learning, where teac her predictions pro vide auxiliary information to the student, especially relev an t in regions of the cov ariate space lacking resp onses or lab els. Finally , on a contemporary note, mo del distillation w as an imp ortant comp onent of training the DeepSeek large-language mo del [ Dee25 ], and OpenAI alleges that ChatGPT was used as a teac her. Soft matc hing and teac her biases: In standard student–teac her approaches, the student is trained to matc h the teacher’s outputs, either by least-squares (for real-v alued outputs), or cross- en tropy loss (for probabilities arising from classiﬁcation problems). W e refer to direct imitation pro cedures of this t yp e as student (soft) matching ( SM ). When the teac her’s predictions are accurate, then direct imitation by the studen t can yield strong p erformance. In particular, as observed from the earliest work [ BS96 ], a student trained on a teacher’s outputs can outp erform a student trained on the original data, since the teac her eﬀectively denoises the response data. On the other hand, if the teacher’s predictions are biased or systematically incorrect, then the SM approach propagates these errors to the student. As a result, ev en with abundan t data, the studen t can inherit p ersisten t prediction error from the teacher. This undesirable phenomenon has been referred to as conﬁrmation bias, and the fo cus of this pap er is new metho dology and theory for tac kling it. Man y classes of complex predictiv e models are kno wn to exhibit particular biases. F or exam- ples, tree-based metho ds [ Bre+84 ; Bre01 ] and stump-based b o osting metho ds [ Bre98 ; F ri01 ; CG16 ] allo w for step-function discontin uities, but with a strong preference for axis-aligned changes. On the other hand, lo cal smo othing methods [ W at64 ; Nad64 ; F G96 ] are strongly biased against dis- con tinuit y , but without any axis preference. Regression pro cedures based on repro ducing kernel Hilb ert spaces [ Gu02 ; W ah90 ; W ai19 ] use explicit p enalization, which induces bias depending on the cov ariate distribution and kernel function. Shallo w neural netw orks using rectiﬁed linear units (ReLUs) are biased to w ards functions with low oscillation, easily expressible sums of piecewise linear functions [ T el16 ]. Deep neural net works, with man y lay ers of hidden units, also exhibit v arious forms of inductive bias, notably a sp ectral bias tow ards lo wer frequency comp onents of functions [ Rah+19 ; Bas+20 ; XZX20 ]. An y such teac her biases can degrade the qualit y of the trained studen t, and v arious heuristics ha ve been explored to mitigate this eﬀect, including temp erature scaling of the teac her’s out- puts [ HVD15 ], conﬁdence-based ﬁltering [ Ara+20 ], noise injection schemes [ Xie+20 ; Ara+20 ], and w eighting and ensembling metho ds [ TV17 ]. All of these approaches seek to m itigate the eﬀect of teac her biases, but do not alter the underlying imitation ob jectiv e. 2 Residual-as-teac her: In this pap er, we propose and analyze an alternativ e mechanism for lev er- aging teac her predictions that is designed to mitigate bias. Instead of applying the teacher to gener- ate fresh responses for the studen t, w e instead use the teac her to estimate the err ors in the studen t’s predictions. This change naturally leads to an iterative pro cedure, known as the r esidual-as-te acher ( RaT ) algorithm, in whic h the studen t mo del is successiv ely reﬁned by the teacher’s feedback. Al- though this residual feedback echoes bo osting-style up dates [ FHT00 ; F ri01 ; BY03 ; ZY05 ], the RaT pro cedure do es not construct a stagewise additive model; rather, our analysis sho ws that the RaT algorithm is a surrogate for a pro ximal up date scheme o ver the studen t function class, with ﬁxed p oin ts corresp onding to the optima of an oracle p enalized optimization problem (cf. equation ( 2 )). Moreo ver, w e sho w through a combination of theoretical analysis and empirical study that RaT estimates are sup erior to the SM approac h in mitigating the teacher’s biases. As a preview of our results, Figure 1 illustrates the diﬀerence betw een student matc hing ( SM ) and the RaT approac h for a simple one-dimensional regression problem based on the least-squares loss. In all cases, the studen t is a t wo-la yer ReLU neural net work with h = 128 hidden units, and is trained on the target data via least-squares regression. The teac her is ﬁt on source data only , and Figure 1 shows results for three diﬀeren t teac hers: • gradien t b o osting with depth 2 trees, and 8 rounds of bo osting; • k ernel ridge regression using a Gaussian kernel function with bandwidth σ = 0 . 5 and regulariza- tion parameter γ = 3, and • a t w o-la yer ReLU neural netw ork with 10 hidden units. In all cases, these teacher classes exhibit signiﬁcan t forms of bias: the b o osting scheme in volv es shallo w trees with limited rounds; the KRR procedure has a ﬁxed and ov erly large regularization parameter; and the small n umber of hidden units in the ReLU net w ork induces bias, since the ReLU units are piecewise-linear. The SM pro cedure trains the studen t to matc h the teacher’s predictions (see Section 3.2 for a precise description), whereas RaT iterativ ely ﬁts the teacher to the studen t’s residuals on source data, and uses these residual estimates to reﬁne the student (see Section 2.2 for the exact pro cedure). The resulting diﬀerences are quite clear in the ﬁgure: SM tends to inherit systematic teac her bias, while RaT progressiv ely corrects it. The analysis of this pap er pro vides a ﬁrm theoretical grounding for these empirical observ ations. Our con tributions: In this pap er, w e in tro duce the RaT pro cedure and analyze it in detail, including b oth the statistical properties of ﬁxed p oints, and the computational properties of the iterativ e updates themselv es. In Theorem 1 , w e prov e non-asymptotic upper b ounds on the excess risk of the RaT estimate; apart from leading to statistical consistency and rate guarantees, these b ounds elucidate the w a y in which the teac her’s bias en ters the estimates. W e also prov e a related result ( Prop osition 1 ) that pro vides risk b ounds for the studen t soft-matching ( SM ) estimator, and rev eals that the teacher’s bias en ters in an entirely diﬀeren t wa y . This suggests that there should b e a fundamen tal gap b et w een the t wo pro cedures, and Theorem 2 formalizes this intuition. W e consider studen t–teac her pairs based on kernel ridge regression (KRR) estimators, in whic h the teac her has a ﬁxed bias whereas the student estimator can b e tuned, and there is cov ariate shift, meaning that the teac her and studen t observe co v ariates from diﬀerent distributions. By suitable tuning, despite the teac her bias and co v ariate shift, the RaT estimator is able to ac hieve the minimax- optimal risk, while the SM estimator has prediction error low er b ounded by a universal constant. In Theorem 3 , w e study the conv ergence prop erties of iterative sc hemes designed to compute the 3 Density No shif t A Sour ce T ar get 1.5 1.0 0.5 0.0 0.5 1.0 1.5 y B Boosting T rue SM R aT C KRR D Neural net Density Gaussian E 1.5 1.0 0.5 0.0 0.5 1.0 y F G H Density Unifor m I 1.5 1.0 0.5 0.0 0.5 1.0 1.5 y J K L 0.0 0.2 0.4 0.6 0.8 1.0 x Density Beta M 0.0 0.2 0.4 0.6 0.8 1.0 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 y N 0.0 0.2 0.4 0.6 0.8 1.0 x O 0.0 0.2 0.4 0.6 0.8 1.0 x P R aT vs SM under Covariate Shif t Figure 1. Comparison of the RaT and SM estimates when the studen t class F is a tw o-lay er neural net w ork with 128 hidden units. Shown are results for four diﬀerent cov ariate shifts (rows 1–4), and three diﬀerent teacher mo dels (columns 2–4). Each row corresp onds to a diﬀerent source–target distribution pair, shown in the leftmost column via their marginal densities. The remaining columns rep ort results for three teac her classes: Bo osting, Kernel ridge regression (KRR), and ReLU neural net w ork ﬁts. In each panel, the true regression function is shown as a dotted orange line, while the estimates obtai ned b y SM and RaT are sho wn in red and blue, resp ectively . Gray p oints indicate source samples. Intermediate RaT iterates are shown as fain t curves to illustrate the reﬁnement pro cess. RaT ﬁxed p oin t. Finally , we complemen t our theoretical results with numerical studies on b oth syn thetic data, and on cov ariate-shifted v ersions of ImageNette dataset for image classiﬁcation. P ap er organization: The remainder of this pap er is organized as follows. W e b egin in Section 2 b y setting up the studen t–teacher distillation more precisely , allo wing for the possibility of co v ariate shift b etw een studen t and teacher. Section 3 is devoted to statemen ts of our main theoretical results, including excess risk bounds ( Theorem 1 in Section 3.1 ); comparison with studen t matc hing 4 ( Section 3.2 ); a separation result ( Theorem 2 in Section 3.3 ); and algorithmic guaran tees ( Theorem 3 in Section 3.4 ). In Section 4 , w e pro vide numerical experiments that support our theory . Section 4.1 giv es results on the separation b et w een SM / RaT for synthetic problems, whereas w e rep ort results for a cov ariate-shifted version of the ImageNette dataset in Section 4.2 . Pro ofs of our three main theorems are giv en in Section 5 , with certain more tec hnical calculations given in the app endices; w e pro ve all the remaining propositions and corollaries in the app endices as w ell. 2 Problem and metho d form ulation In this section, we b egin in Section 2.1 with a more precise sp eciﬁcation of the student–teac her estimation problem of in terest, b efore describing the residual-as-teac her estimator that we analyze in this pap er ( Section 2.2 ). 2.1 T arget/source data and studen t/teac her pairs W e consider a general prediction problem, inv olving a cov ariate or feature vector x ∈ X , and a resp onse or output y ∈ Y . When y is real-v alued, the prediction problem is a form of regression, whereas when y takes discrete v alues, we are solving a classiﬁcation problem. Our goal is to learn a mapping f : X → Y such that f ( x ) is a “go o d predictor” of the resp onse y . As usual, w e formalize the quality of f in terms of a scalar-v alued loss function ℓ ( f ( x ) , y ). Standard examples include the least-squares loss ℓ ( f ( x ) , y ) = (1 / 2)  f ( x ) − y  2 for y ∈ R , and the logistic regression loss ℓ ( f ( x ) , y ) = log (1 + e f ( x ) ) − y f ( x ) for binary lab els y ∈ { 0 , 1 } . A t the core of the set-up are t wo pairs: • the studen t–teacher pair of (p otentially diﬀerent) prediction classes F stud ≡ F and G teach ≡ G , used to construct functions mapping co v ariates to predictions, and • the target–source pair ( Q X , P X ) of (p otentially diﬀeren t) distributions ov er the cov ariates used b y the student and teac her, respectively . In the standard approac h to co v ariate shift, there is no notion of student–teac her pair, whereas in the t ypical student–teac her set-up, there is no cov ariate shift. The set-up describ ed here allows for b oth possibilities. The sour c e data consists of lab eled pairs { ( x i , y i ) } n i =1 dra wn from an unkno wn distribution P X × P Y | X o ver the join t space X × Y . Second, the tar get data consists of target cov ariates { e x j } m j =1 dra wn according to Q X . Given these tw o datasets, the student operates ov er the function class F stud , and has the goal of learning a function f ∈ F stud that p erforms well in predicting at the target co v ariates { e x j } m j =1 . On the other hand, the te acher is deﬁned by a function class G teach , and implemen ts empirical risk minimization (ERM) pro cedures based on the source data. Smo othed target risk: F or the bulk of this paper, w e condition on the target co v ariates { e x j } m j =1 , and we measure the student’s p erformance in terms of the smo othe d tar get risk , giv en b y ¯ L m ( f ) = m X j =1 E Y  ℓ ( f ( e x j ) , Y ) | X = e x j  , (1a) 5 where we tak e exp ectations ov er each conditional distribution Y | X = e x j . When the samples e x j are dra wn i.i.d. from Q X , w e observ e that (1 /m ) ¯ L m ( f ) is an un biased estimate of the p opulation quan tity ¯ L Q ( f ) : = E ( X,Y ) ∼ Q X × P Y | X  ℓ ( f ( X ) , Y )  . (1b) Consequen tly , if w e can bound ¯ L m ( b f ) for an estimate b f , then w e can use standard empirical pro cess theory (e.g., [ Gee00 ; VW96 ; W ai19 ]) to obtain bounds on ¯ L Q ( b f ). F or this reason, w e focus primarily on ¯ L m in our analysis. Studen t oracle estimand: In man y applications, the natural ob ject to minimize is a constrained or regularized version of the smo othed target risk ¯ L m . In particular, let Pen : F → R b e some p enalt y function deﬁned on the student class F . W e fo cus on approximating the b est regularized estimate f † : = arg min f ∈ F  ¯ L m ( f ) + Pen( f )  , (2) whic h w e refer to as the student or acle estimand . One sp ecial case of p enalty function P en is a { 0 , ∞} -v alued function for membership in some subset e F ⊂ F of the full student class. In this case, the student oracle estimand f † corresp onds to a pro jection operation, under the smoothed target risk ¯ L m , on to the constrain t set e F . As a simple but concrete example, consider functions that are linear in some feature map x 7→ ( φ 1 ( x ) , . . . , φ k ( x )) ∈ R k , sa y of the form f θ ( x ) = P k a =1 θ a φ a ( x ) for some weigh t vector θ ∈ R k . In this case, the full studen t class takes the form F : =  f θ | θ ∈ R d } , and the p enalty function could deﬁne pro jection on to the sparse sub-class e F : = { f θ ∈ F | ∥ θ ∥ 1 ≤ R } for some radius R . This form ulation is useful when the goal is to disco v er studen t functions that are more in terpretable, in that they use some highly relev ant subset of all possible features. Our general formulation ( 2 ) encompasses this particular case, as well as a wide range of other examples. 2.2 Residual-as-teac her Ha ving deﬁned the target estimand f † , no w let us describ e the RaT estimator analyzed in this pap er. The estimate is deﬁned as a ﬁxed p oint of an op erator that can b e decomp osed into tw o steps: (i) a proximal up date implemented by the student, and (ii) estimation of residuals by the teac her. W e describ e eac h of these in turn, and then deﬁne the notion of a RaT ﬁxed point, as w ell as a natural pro cedure (Picard iteration) for computing a ﬁxed p oint. Studen t pro ximal up date and oracle consistency: W e b egin with the pro ximal up date asso ciated with the student. F or a stepsize η > 0 and v ector u ∈ R m , we deﬁne the pr oximal mapping u 7→ Prox η ( u ) ∈ F via Prox η ( u ) : = arg min f ∈ F n 1 2 η ∥ u − f ( e x m 1 ) ∥ 2 2 + Pen( f ) o where f ( e x m 1 ) : =  f ( e x 1 ) , . . . , f ( e x m )  ∈ R m . (3) 6 An imp ortan t fact is that this up date op erator is directly related to the smo othed target risk ( 1a ), and the student oracle estimand ( 2 ), as w e no w describe. In particular, deﬁne the functional gradien t v ector ∇ ¯ L m ( f ) ∈ R m with comp onen ts  ∇ ¯ L m ( f )  j : = E Y h ∂ ∂ z ℓ ( z , Y ) | z = f ( e x j ) | X = e x j i . (4) Under con vexit y conditions, for any η > 0, the studen t oracle estimand f † deﬁned in equation ( 2 ) satisﬁes the ﬁxed p oin t relation: Oracle self-consistency: f † = Prox η  f † ( e x m 1 ) − η ∇ ¯ L m ( f † )  | {z } ≡H η ( f † ) (5) This consistency condition follo ws by analysis of a proximal gradient scheme for computing the minimizer ( 2 ). See Section A for the details. The ﬁxed p oint relation ( 5 ), while of conceptual v alue, is not practically useful, since the oracle functional gradien t ∇ f ¯ L m ( f ) cannot b e computed: it dep ends on exp ectations ov er the unknown conditional distribution Y | X = e x j . How ev er, this gradient can b e appro ximated b y applying the teac her class to the student’s empirical residuals, as w e no w describe. Estimation of residuals: Given a student function f ∈ F , the empiric al r esidual asso ciated with the source sample ( x i , y i ), is giv en b y e ( f ( x i ) , y i ) | {z } ≡ e i : = ∂ ∂ z ℓ ( z , y i ) | z = f ( x i ) for i = 1 , 2 , . . . , n . (6a) Using the residuals { e i } n i =1 as resp onse targets, w e solv e the least-squares regression b g ∈ arg min g ∈ G n n X i =1  e i − g ( x i )  2 o , (6b) thereb y obtaining the b est-ﬁtting teacher b g ∈ G to the studen t’s residuals. Ev aluating this estimate o ver the target samples { e x j } m j =1 yields the m -v ector [ b G ( f )] j : = b g ( e x j ) for j = 1 , . . . , m . (6c) Residual-as-teac her estimate: W e hav e deﬁned t wo op erators: the student pro ximal up- date ( 3 ), and the mapping ( 6c ) from student residuals to b G ( f ). The comp osition of these tw o steps deﬁnes an operator on the student function space F , and for a stepsize η > 0, w e deﬁne the RaT estimate in terms of the ﬁxed p oint relation: RaT self-consistency: b f RaT = Prox η  b f RaT ( e x m 1 ) − η b G ( b f RaT )  | {z } ≡ b H η ( b f RaT ) . (7) 7 As discussed in Section A , this deﬁnition is sensible in that ﬁxed p oints exist under relatively mild conditions, and moreo v er, that the set of ﬁxed p oin ts do not dep end on the stepsize η > 0. In particular, if a function b f RaT satisﬁes the ﬁxed p oin t relation ( 5 ) for some stepsize η > 0, then the ﬁxed p oin t relation holds for any stepsize ˜ η > 0. See Section A.2 for details. In certain cases, the RaT ﬁxed p oint is unique, and can b e computed in closed form; see the discussion in Section 3.3 for a broad class of examples based on kernel student–teac her pairs. In the general setting, the ﬁxed p oint needs to b e computed by iterating b et ween the student–teac her up dates. The simplest such scheme is given by the following form of Picard iteration. Given an initial choice f 0 ∈ F of studen t function, it generates a sequence of student functions { f k } k ≥ 0 as follo ws: Pro ximal RaT algorithm: Given stepsize η > 0 and initial student function f 0 , rep eat for k = 0 , 1 , 2 , . . . , K : (1) T rain teac her to predict residuals ( 6b ), and ev aluate b G ( f k ) ∈ R m via equation ( 6c ). (2) P erform the appro ximate pro ximal gradien t up date: f k +1 = Prox η  f k ( e x m 1 ) − η b G ( f k )  where f k ( e x m 1 ) ≡ ( f k ( e x 1 ) , . . . , f k ( e x m )). (8) W e analyze the con v ergence of these up dates, including the c hoice of stepsize, in Section 3.4 ; in particular, see Theorem 3 . Connection to b o osting: The RaT updates are related to bo osting pro cedures [ FHT00 ; F ri01 ], b ecause b oth rely on regression to estimate the residuals of the current ﬁt. How ever, the goals of RaT and b o osting are fundamen tally diﬀerent: the RaT pro cedure seeks to approximate the ﬁxed p oin t f † , the b est p enalized ﬁt within the student class, whereas b o osting is a stage-wise pro cedure for generating a sequence of additiv e ﬁts. In a b o osting algorithm, the eﬀective function class gro ws with the n umber of iterations, so that its statistical b ehavior dep ends critically on controlling this gro wth, typically via early stopping (e.g., [ BY03 ; ZY05 ; R WY14 ; WYW19 ]). In con trast, at each round, the RaT pro cedure ( 8 ) applies a pro ximal up date that pro jects eac h iterate back in to the ﬁxed studen t class F . Consequently , all iterates remain within F , so that the function complexit y remains ﬁxed. Moreov er, the RA T estimate b f itself can characterized as a ﬁxed p oint ( 7 ), indep endent of an y algorithm used to obtain it. Ov erall, RaT is not a sequential procedure for additiv e mo deling, but instead a teac her-aided w a y to appro ximate a pro ximal ﬁxed p oin t o v er the ﬁxed class F . As our analysis sho ws, its statistical p erformance gov erned b y the accuracy of the teacher-induced gradien t com bined with the complexit y of the studen t class. 3 Main results Ha ving set up the RaT pro cedure, w e are no w ready to state our main results. W e begin in Section 3.1 b y discussing the statistical prop erties of the RaT estimate ( 7 ), b efore comparing and contrasting it with the student soft-matching ( SM ) pro cedure in Section 3.2 . Section 3.3 analyzes the SM and 8 RaT metho ds in detail for k ernel-based student–teac her pairs, and reveals a signiﬁcant p erformance gap. Finally , in Section 3.4 , we give computational guaran tees on the pro ximal RaT algorithm. 3.1 Risk b ounds relativ e to oracle In this section, we derive b ounds on the excess risk and the estimation error of any RaT ﬁxed p oint b f RaT , as deﬁned b y the condition ( 7 ), relative to the oracle minimizer f † , as deﬁned b y equation ( 2 ). In stating these results (and throughout the pap er), for a real-v alued function f : X → R on the co v ariate space, and a vector u ∈ R m , we adopt the shorthand notation ⟨ f , u ⟩ m : = 1 m m X j =1 f ( e x j ) u j and ∥ f ∥ 2 m : = 1 m m X j =1 f 2 ( e x j ) . Theorem 1. Supp ose that or acle risk function ¯ R ( f ) = 1 m  ¯ L m ( f ) + P en( f )  is c onvex in the ﬁtte d values, and let f † b e a minimizer. Then for any RaT ﬁxe d p oint b f RaT : (a) The RaT exc ess risk is b ounde d as ¯ R ( b f RaT ) − ¯ R ( f † ) ≤ D b f RaT − f † , ∇ ¯ L m ( b f RaT ) − b G ( b f RaT ) E m . (9a) (b) When ¯ L m is µ -str ongly c onvex, any RaT estimate satisﬁes the b ound ∥ b f RaT − f † ∥ 2 m ≤ 1 µ D b f RaT − f † , ∇ ¯ L m ( b f RaT ) − b G ( b f RaT ) E m . (9b) See Section 5.1 for the pro of of this claim. Recall that both the oracle estimate f † and the RaT ﬁxed p oint b f RaT are deﬁned b y ﬁxed point relations, namely equations ( 5 ) and ( 7 ), resp ectively . With this con text, Theorem 1 can be under- sto o d as a stabilit y result for pro ximal ﬁxed p oin ts: it shows that the statistical accuracy of any RaT solution can b e controlled via the accuracy of the teacher-induced gradient op erator at that solution. In particular, it reduces the statistical analysis of RaT to b ounding the gradient estima- tion error, without requiring analysis of an iterative pro cedure used to compute the ﬁxed p oint. Notably , the b ound dep ends only on the gradient error ev aluated at the ﬁxed p oin t, rather than uniformly ov er the function class; this fact is critical for obtaining sharp rates in later sections. Sp ecialization to least-squares: As an imp ortant sp ecial case, Theorem 1 has implications for the least-squares loss ℓ ( f ( x ) , y ) = (1 / 2)  f ( x ) − y  2 . It is µ -strongly conv ex with µ = 1, so that that the b ound ( 9b ) is in force. In this case, the smo othed functional gradient at f ( x ) is giv en b y f ( x ) − E [ Y | X = x ], so that the oracle gradien t has comp onents [ ∇ ¯ L m ( b f RaT )] j = b f RaT ( e x j ) − f ∗ ( e x j ) where f ∗ ( x ) = E [ Y | X = x ]. The teac her b G ( b f RaT ) is trained via regression of the source residuals { e i } n i =1 on to the source co v ariates { x i } n i =1 . F or the least-squares loss, the source residuals are given by e i = b f RaT ( x i ) − y i for i = 1 , . . . , n . 9 A very crude form of consistency can b e obtained by applying Cauc h y–Sch warz to the b ound ( 9b ). Doing so and simplifying terms yields ∥ b f RaT − f † ∥ 2 m ≤ ∥∇ ¯ L m ( b f RaT ) − b G ( b f RaT ) ∥ 2 m . Th us, we see that the procedure will b e consisten t if we can ensure that the teac her-based surrogate gradien t b G ( b f RaT ) is uniformly consisten t as an estimate of the true gradient ∇ ¯ L m ( b f RaT ). (T o b e clear, this Cauch y–Sch warz argument is v ery loose in general, and our later analysis pro vides a far more precise analysis.) Bounds for approximate ﬁxed p oin ts: Our theory can also b e extended to give guarantees for a function that satisﬁes the RaT ﬁxed p oin t relation in an approximate sense. Deﬁne the RaT op erator b H η ( f ) : = Prox η  f ( e x m 1 ) − η b G ( f )  (10a) along with the pr oximal op er ator defe ct b D η ( f ) : = b H η ( f ) − f , (10b) whic h measures the deﬁciency in a function f as a RaT ﬁxed p oint (cf. equation ( 7 )). F or an y function f k , deﬁne the update f k +1 : = b H η ( f k ). W e then ha v e the follo wing extension of Theorem 1 , b ounding the excess risk of f k +1 . Under the assumptions of part (a), we hav e ¯ R ( f k +1 ) − ¯ R ( f † ) ≤  f k +1 − f † , ∇ ¯ L m ( f k +1 ) − b G ( f k ) − 1 η b D η ( f k )  m , (11a) whereas under the assumptions of part (b), we hav e ∥ f k +1 − f † ∥ 2 m ≤ 1 µ  f k +1 − f † , ∇ ¯ L m ( f k +1 ) − b G ( f k ) − 1 η b D η ( f k )  m . (11b) W e establish these more general claims as part of the pro of of Theorem 1 in Section 5.1 . When f k = b f RaT , then we ha ve f k = f k +1 and b D η ( f k ) = 0, so that with this particular c hoice, the b ounds ( 11a ) and ( 11b ) imply the b ounds ( 9a ) and ( 9b ), resp ectiv ely . Relaxed con vexit y requiremen ts: Let us clarify how the conv exit y assumptions can b e weak- ened. First, assuming that ¯ L m is con v ex in the ﬁtted v alues f ( x ) is relatively mild; it holds for man y standard losses (e.g., least-squares, logistic regression, log loss etc.). In assuming that ¯ R is con vex, we are requiring that, in addition, the p enalty function, and the set of ﬁtted v alues f ( x ) obtained as f v aries o v er the studen t class F , are conv ex. This latter requirement is more stringen t. F rom the pro of, how ev er, the b ound ( 9a ) only requires con v exity along the line joining b f RaT and f † , meaning a directional and lo calized form of conv exit y . Meanwhile, the b ound ( 9b ) exploits a non-expansivity prop erty for the proximal op erator Prox η , holding lo cally b etw een b f RaT and f † ; see equation ( 39 ) for the precise deﬁnition. Extensions to noncon vex settings may b e p ossible un- der prox-regularit y assumptions, which ensure lo cal single-v aluedness and Lipsc hitz contin uit y of the proximal mapping [ RP96 ; R W98 ]. Last, we ha ve man y empirical results based on p otentially non-con vex teachers, including v arious types of neural nets (see Figure 1 and Figure 5 for some ex- amples). These numerical studies suggest that the RaT estimator can b e well-behav ed in a broader setting than the theoretical set-up given here. 10 3.2 Some comparison with studen t soft-matc hing As noted in the introduction, in the standard approac h to student–teac her estimation, the student is trained to directly appro ximate the predicted outputs of the teac hers. He re we describ e this direct matching approach more precisely , and then prov e some guarantees for it that reveal key diﬀerences compared to the RaT estimate. W e b egin by describing the SM approac h for a least- squares regression problem with real-v alued resp onses y ∈ R , where it is simplest to describ e. W e then give the extension to classiﬁcation problems in volving logits. Studen t soft-matc hing ( SM ) for regression: F or a scalar regression problem, the SM approach uses the teacher to construct a pseudo-response b y j ∈ R for each target co v ariate e x j . These pseudo- resp onses are used to construct the syn thetic dataset { ( e x j , b y j ) } m j =1 , and the empirical ob jective L SM m ( f ) : = m X j =1 ℓ ( f ( e x j ) , b y j ) , (12a) whic h is meant to act as a surrogate to the smo othed target risk ¯ L m from equation ( 1a ). Given our goal of estimating the oracle estimand f † from equation ( 2 ), a natural SM estimator, and the one that w e analyze, is giv en b y b f SM : = arg min f ∈ F n L SM m ( f ) + Pen( f ) o . (12b) SM for classiﬁcation: No w let us describe the extension of this approac h to a K -ary classiﬁcation problem. Supp ose that w e use the standard “one-hot” enco ding of the K -classes, meaning that class lab els are v ectors y ∈ R K , taking v alues in the set L = { e 1 , . . . , e K } of standard basis v ectors. 1 In this case, for each target sample j = 1 , . . . , m , the “hard” matc hing approach uses the teac her to construct synthetic one-hot lab el vectors b y j ∈ L , and then trains the studen t to match these lab els. The “soft” matc hing approach, which has prov en to b e b etter b ehav ed in practice, is to use the teacher to predict either the logits or the probabilities of the class lab els. In the latter case, the v ector b y j b elongs to the K -dimensional probability simplex, and the student is trained to minimize the Kullbac k–Leibler divergence to these predictions. This leads to an instance of the SM empirical loss ( 12a ), where ℓ ( f ( e x j ) , b y j ) : = K X a =1 b y j a log  b y j a f a ( e x j )  (13) deﬁnes the KL divergence b etw een the teacher probabilities, and the studen t predicted probability f ( e x j ) ∈ R K . 3.2.1 A general b ound for the SM estimate W e no w state a guarantee for the SM estimate b f SM deﬁned in equation ( 12b ). It in v olves the SM gradien t v ector ∇ L SM m ( f ) ∈ R m , with en tries giv en b y [ ∇ L SM m ( f )] j = ∂ ∂ z ℓ ( z , b y j ) | z = f ( e x j ) for j = 1 , . . . , m . (14) 1 In particular, the vector e ℓ ∈ R K is zero in all p ositions except the ℓ th , where it takes the v alue one. 11 More precisely , w e ha ve Prop osition 1. Under the c onditions of The or em 1 , the SM estimate b f SM b ase d on te acher pr e dictions b y satisﬁes the b ound ¯ R ( b f SM ) − ¯ R ( f † ) ≤ D b f SM − f † , ∇ ¯ L m ( b f SM ) − ∇ L SM m ( b f SM ) E m , and (15a) ∥ b f SM − f † ∥ 2 m ≤ 1 µ D b f SM − f † , ∇ ¯ L m ( b f SM ) − ∇ L SM m ( b f SM ) E m . (15b) This claim follows by arguments similar to those used to prov e Theorem 1 ; see Section B.1 for details. Observ e that the SM b ound ( 15b ) and RaT b ound ( 9b ) ha v e a parallel structure, but using two diﬀer ent estimates of the true gradient ∇ ¯ L m ( f )—namely , ∇ L SM m ( f ) and b G ( f ), respectively . T o gain intuition, it is useful to compute these tw o quan tities for the least-squares loss given by ℓ ( f ( x i ) , y i ) = (1 / 2)  f ( x i ) − y i  2 , for whic h w e ha ve the residual or functional gradient f ( x i ) − y i . F or both methods, we can think of the teac her as a function T : R n → R m that maps an n -v ector from the source domain to an m -v ector in the target domain. Given a v ector y = ( y 1 , . . . , y n ) ∈ R n of source resp onses, the SM method uses the teac her to compute a v ector b y = T ( y ) of pseudo-resp onses. Consequen tly , the b ounds of Prop osition 1 will apply in terms of the SM gradien t ∇ L SM m ( f ) ∈ R m with elements [ ∇ L SM m ( f )] j = f ( e x j ) − [ T ( y )] j . (16a) On the other hand, the RaT estimate uses the teacher in a rather diﬀerent wa y . With the least- squares loss, the residual asso ciated with the pair ( x i , y i ) is given by f ( x i ) − y i . Th us, the RaT pro cedure uses the teac her to compute the gradien t estimate b G ( f ) ∈ R m , with en tries giv en b y [ b G ( f )] j = h T  f ( x n 1 ) − y i j where f ( x n 1 ) = ( f ( x 1 ) , . . . , f ( x n )). (16b) These tw o expressions mak e clear the diﬀerences: the SM pro cedure applies the teac her to the resp onses y ∈ R n , whereas the RaT pro cedure applies the teacher to the residual v ector f ( x n 1 ) − y . 3.2.2 Explicit bias-v ariance decomp osition This diﬀerence in how the teacher is applied—viz. equations ( 16a ) versus ( 16b )—has imp ortant consequences for the biases of the resulting estimates. Let us now state a result that reveals these diﬀerences more clearly . F or a regression problem with resp onses Y , we can write E [ Y | X = x ] = f † ( x ) + g ∗ ( x ), where f † is the studen t oracle estimand ( 5 ), and the term g ∗ , when non-zero, captures mis-sp eciﬁcation in the studen t mo del. With this decomposition, we can write each source observ ation y i ∈ R in the form y i = f † ( x i ) + g ∗ ( x i ) + w i , for i = 1 , . . . , n , (17) where each w i is a conditionally zero-mean noise v ariable. 12 T o decomp ose the teacher into bias and v ariance comp onents, for any function h , w e deﬁne T ( h ) : = E w T ( h + w ), where the exp ectation is tak en o v er the random noise v ector w = ( w 1 , . . . , w n ). W e then deﬁne tw o bias terms B 2 SM ( f † + g ∗ ) : = ∥T ( f † ( x n 1 ) + g ∗ ( x n 1 )) − ( f † + g ∗ ) ∥ 2 m , and (18a) B 2 RaT ( g ∗ ) : = ∥T (∆( x n 1 ) − g ∗ ( x n 1 )) − (∆ − g ∗ ) ∥ 2 m where ∆ : = b f RaT − f † . (18b) In addition, w e deﬁne the zero-mean random noise vectors v SM : = T  f † ( x n 1 ) + g ∗ ( x n 1 ) + w  − T ( f † + g ∗ ) , and (19a) v RaT : = T (∆( x n 1 ) − g ∗ ( x n 1 ) − w ) − T (∆ − g ∗ ) . (19b) Giv en these deﬁnitions, w e ha ve the follo wing guaran tee: Corollary 1. F or le ast-squar es loss, the SM estimate satisﬁes the b ound ∥ b f SM − f † ∥ 2 m ≤ 2 D b f SM − f † , v SM E m + B 2 SM ( f † + g ∗ ) , (20a) wher e as the RaT estimate satisﬁes the b ound ∥ b f RaT − f † ∥ 2 m ≤ 2 D b f RaT − f † , v RaT E m + B 2 RaT ( g ∗ ) , (20b) See Section B.2 for the pro of. Corollary 1 highligh ts ho w the SM estimator inv olves a bias comp onent that dep ends on the full function f ∗ = f † + g ∗ , whereas the bias in the RaT b ound in volv es only the mis-sp eciﬁcation comp onen t g ∗ . This diﬀerence has important implications for the (in)consistency of the SM proce- dure. Notably , in Section 3.3 , w e exhibit broad classes of problems for which the RaT estimate is consisten t, whereas the SM estimate remains inconsistent. The v arious terms in Corollary 1 tak e a simpler form when the teacher procedure is r esp onse- line ar , meaning that its b eha vior can be represen ted by a matrix T ∈ R m × n that depends only on the cov ariates { x i } n i =1 and { e x j } m j =1 . V arious non-parametric smo others, among them kernel ridge regression (KRR) [ Gu02 ; W ah90 ; W ai19 ], Nadara y a-W atson smo othing [ W at64 ; Nad64 ], and lo cal p olynomial regression metho ds [ FG96 ], hav e this response-linear form. As a concrete example, consider the teac her that p erforms kernel ridge regression using a pre-sp eciﬁed k ernel function K : X × X → R , and a given regularization parameter λ > 0. In this case, the teacher’s mapping from source to target space is describ ed by m × n dimensional matrix T λ : = K mn  K nn + nλ I n  − 1 (21) where the n × n source kernel matrix K nn has ( i, j ) th en try K ( x i , x j ), whereas the m × n target-source k ernel matrix K mn has ( i, j ) th en try K ( e x j , x i ). F or an y response-linear teacher, with the KRR being one example, a straigh tforw ard calculation sho ws that we hav e the equiv alence v SM ≡ v RaT = T ( w ) ∈ R m . (22a) 13 Consequen tly , for an estimate b f ∈ { b f SM , b f RaT } , we hav e the upp er bound ∥ b f − f † ∥ 2 m ≤ 2 D b f − f † , T ( w ) E m | {z } Sto c hastic error + B 2 (22b) where the squared-bias term B 2 , giv en b y B 2 SM ( f † + g ∗ ) and B 2 RaT ( g ∗ ), determines the main diﬀerences b et w een the tw o estimators. In Section D , we discuss the form that the sto chastic error term in equation ( 22b ) tak es for diﬀerent resp onse-linear teac hers. 3.3 Exact MSEs and SM - RaT separation In our analysis thus far, we hav e studied the gap b etw een an estimate b f , either the RaT estimate b f RaT or the SM estimate b f SM , and the oracle student estimand f † . In practice, one actually has a family of suc h oracle estimands, sa y of the form ( f † ) γ = arg min f ∈ F n ¯ L m ( f ) + γ P en( f ) o for some regularization parameter γ > 0. (23) Note that ( f † ) γ is, at least in general, diﬀeren t from the ground truth function f ∗ , due to the eﬀect of the studen t regularization term γ Pen( f ), and the diﬀerence ( f † ) γ − f ∗ is the appro ximation error. In practice, given an estimator b f , one adjusts the choice of γ so as to trade oﬀ the approximation error with statistical error. In this section, w e study this issue when the student class consists of a family of k ernel ridge estimators (KRR); here the student r e gularization parameter γ > 0 corresp onds to the w eigh t on the squared Hilb ert norm that serves as the regularizer. Our main result is to establish that a biase d te acher leads to a fundamental separation b et ween the p erformance of the RaT and SM estimators. More precisely , we exhibit a simple class of problems for which the RaT pro cedure, with a suitable c hoice of regularization parameter γ , provides a consisten t estimate of f ∗ as the source sample size n grows. This consistency is guaran teed ev en when the teacher in tro duces a constant lev el of bias λ > 0, indep endent of the sample size n . On the other hand, the SM pro cedure, even when the student regularization parameter γ > 0 is c hosen arbitrarily , c annot c orr e ct the teac her bias. Underlying this result are exact expressions for the mean-squared errors of the RaT and SM estimates using student KRR estimators (see Prop osition 2 in Section 3.3.1 ). W e then use these expressions to establish our separation result ( Theorem 2 ) in Section 3.3 . Set-up: W e b egin with the problem set-up studied in this section. W e mak e n i.i.d. source observ ations of the form y i = f ∗ ( x i ) + w i , for i = 1 , . . . , n , (24) where the noise terms w i are zero-mean with v ar( w i ) = σ 2 . Our goal is to estimate the unknown function f ∗ via the teacher-studen t interaction, and w e use the least-squares loss for this regression problem. W e assume that the student class is deﬁned by a kernel function ˜ K : X × X → R , along with the asso ciated reproducing k ernel Hilbert space H . F or a regularization parameter γ > 0, w e consider the γ -r e gularize d ensemble of student pro ximal up dates with Pen( f ) = γ m 2 ∥ f ∥ 2 H . Th us, we hav e a family { b f γ RaT } γ > 0 of RaT estimates, indexed b y the choice of student regularization parameter 14 γ > 0. Similarly , we hav e an ensem ble { b f γ SM } γ > 0 of SM estimates. Note that the oracle studen t function ( f † ) γ , deﬁned via equation ( 23 ) with Pen( f ) = ( γ / 2) m ∥ f ∥ 2 H , is not equal to f ∗ , due to bias introduced b y the p enalty . Supp ose that the original regression problem is w ell-sp eciﬁed, meaning that w e can write the unkno wn regression function f ∗ in the form f ∗ ( · ) = P m j =1 θ ∗ j ˜ K ( · , e x j ) for a weigh t vector θ ∗ ∈ R m . In this setting, if w e w ere to make full observ ations of lab eled target pairs { ( e x j , e y j ) } m j =1 , it is then p ossible to estimate f ∗ consisten tly by making a suitable choice of the regularization parameter. Our goal is to explore whether or not, for the RaT and SM estimators, suc h consistency is p ossible giv en only lab e led source observ ations. 3.3.1 Exact expressions W e b egin by deriving exact expressions for the RaT and SM ﬁxed estimates, along with their mean- squared errors (MSEs). Our analysis applies to an arbitrary resp onse-linear teacher, as represented b y a matrix T ∈ R m × n . By construction, each of the RaT and SM solutions corresp ond to functions of the form f θ ( · ) : = m X j =1 θ j ˜ K ( · , e x j ) for some w eight vector θ ∈ R m . (25) W e use b θ RaT and b θ SM to denote the corresp onding weigh t v ectors, so that b f RaT ≡ f b θ RaT and b f SM ≡ f b θ SM . RaT and SM Fixed Poin ts: In this case, for an y teac her matrix T ∈ R m × n , it is possible to giv e explicit expressions for the RaT and SM weigh t vectors b θ RaT and b θ SM . More sp eciﬁcally , deﬁne the target kernel matrix e K mm ∈ R m × m with ( j, j ′ )-th entry given by ˜ K ( e x j , e x j ′ ). W e also deﬁne the source-target matrix e K nm ∈ R n × m with ( i, j )-th en try ˜ K ( x i , e x j ). In terms of these matrices, it can b e sho wn that the unique SM ﬁxed estimate is giv en b y b θ SM = ( e K mm + mγ I m ) − 1 T ( y ) . (26a) This follows in a straightforw ard wa y , since the PL pro cedure simply applies the student regression metho d—in this case, the γ -regularized KRR pro cedure—to the teacher’s predictions T ( y ) ∈ R m . Starting from the deﬁnition ( 7 ) of the RaT ﬁxed p oint b f RaT , it can b e sho wn that it takes the form b f RaT ≡ f b θ RaT , where b θ RaT : =  A + mγ I m  − 1 T ( y ) , where A : = T e K nm ∈ R m × m . (26b) See Section B.3 for the deriv ation of these relations. F or any estimate b f ≡ f b θ , its target-based MSE, relative to the true function f ∗ , can b e expressed as MSE( b f ) : = E w ∥ b f − f ∗ ∥ 2 m = E w h 1 m m X j =1  b f ( e x j ) − f ∗ ( e x j )  2 i . (27) W e use MSE( b f SM ) and MSE( b f RaT ) to denote this mean-squared error for the SM and RaT estimators, resp ectiv ely . 15 Prop osition 2 (Exact MSEs for SM / RaT ) . Under the pr evious assumptions, we have MSE( b f RaT ) = 1 m ∥ mγ e K mm ( A + mγ I ) − 1 θ ∗ ∥ 2 2 + σ 2 m | | | e K mm ( A + mγ I ) − 1 T | | | 2 F (28a) wher e the matrix A was pr eviously deﬁne d ( 26b ) , as wel l as MSE( b f SM ) = 1 m ∥ e K mm ( I − B − 1 γ A ) θ ∗ | {z } SM bias ve ctor ∥ 2 2 + σ 2 m | | | e K mm B − 1 γ T | | | 2 F (28b) wher e B γ : = e K mm + mγ I . The SM bias ve ctor further de c omp oses into distribution-shift and r e gularization c omp onents: e K mm ( I − B − 1 γ A ) θ ∗ = ( e K mm − A ) θ ∗ | {z } shift bias + mγ B − 1 γ A θ ∗ | {z } ridge bias . (28c) W e prov e these results in Section B.3 . 3.3.2 T eac her-induced separation b etw een RaT and SM Prop osition 2 suggests that there might b e a fundamental gap b et ween the p erformance of the RaT and SM estimators. This intuition turns out to be correct, and in this section, we give a result that formalizes this p erformance gap in a particular setting. Recall that Prop osition 2 applies to a student mo del class based on a kernel function ˜ K . Giv en this set-up, one natural mo del of a biased teacher is one that p erforms kernel ridge regression using the kernel function ˜ K and the source data, along with a te acher r e gularization p ar ameter λ > 0. This leads to a family of λ -biased teacher matrices, giv en b y T λ = K mn  K nn + λn I n ) − 1 (29) where K mn has ( j, i ) th -en try e K ( e x j , x i ), and K nn has ( i, i ′ ) th -en try e K ( x i , x i ′ ). The third relev ant k ernel matrix is the studen t k ernel matrix e K mm previously deﬁned for the target co v ariates { e x j } m j =1 , whic h has ( j , j ′ ) th -en try K ( e x j , e x j ′ ). In order to v ary the ric hness of the studen t/teacher classes, w e consider kernel matrices whose eigensp ectra decay in a p olynomial manner. More precisely , the matrix K exhibits α -eigendeca y if its eigenv alues satisfy the relation 2 σ j ( K ) ≍ j − 2 α . With this notion, we study the following set-up: • The bias of the teacher mo del is captured by the parameter λ > 0. The richness of the kernel class relativ e to the source co v ariates is deﬁned b y the eigenspectrum of the matrix K nn , assumed to hav e α -p olynomial decay for some α > 1 / 2. • Similarly , the studen t mo del has bias parameter γ > 0, and the sp ectrum of the target k ernel matrix K mm has β -polynomial deca y for some β ∈  1 / 2 , 1 / 2 + α  . 2 More precisely , there are univ ersal constants 0 < c 0 ≤ c 1 < ∞ such that c 0 j − 2 α ≤ σ j ( K ) ≤ c 1 j − 2 α . 16 Giv en a ﬁxed teacher bias parameter λ , our goal is to understand the diﬀerences b etw een SM and RaT when the student bias parameter γ is fr e ely adjustable . In the following statemen t, the quan tities c 0 , c 1 , c 2 > 0 are all universal constan ts indep enden t of ( n, R, σ 2 ). Theorem 2 (Separation result for SM / RaT , with a matching lo wer b ound) . Consider a λ -biase d te acher for some λ > 0 , and a sour c e sample size n ≥ λ − 1 − 1 2 α . Then ther e is a Hilb ert sp ac e H with kernel ˜ K and sour c e-tar get c ovariates for which ( K nn , e K mm ) exhibit ( α, β ) -eigende c ay such that, for any r adius R ≥ 1 , we have: RaT upp er: sup ∥ f ∗ ∥ H ≤ R E w ∥ b f γ RaT − f ∗ ∥ 2 m ≤ c 0 R 2  σ 2 nR 2  2 β 2 α +1 with γ = 1 λ  σ 2 R 2 n  2( α + β ) 2 α +1 , (30a) SM lo wer: sup ∥ f ∗ ∥ H ≤ R inf γ > 0 E w ∥ b f γ SM − f ∗ ∥ 2 m ≥ c 1 R 2 > 0 . (30b) Mor e over, for sour c e-tar get p airs m = n ≥ max  λ − 1 − 1 2 α , ( R 2 /σ 2 ) 1 2 α  , any estimator e f b ase d on lab ele d sour c e sample { ( x i , y i ) } n i =1 and the unlab ele d tar get c ovariates { e x j } n j =1 has MSE lower b ounde d as Minimax lo w er: sup ∥ f ∗ ∥ H ≤ R E w ∥ b f − f ∗ ∥ 2 m ≥ c 2 min n R 2  σ 2 nR 2  2 β 2 α +1 , R 2 o . (30c) See Section 5.2 for the pro of. A t a high level, the main take-a w ay is that the RaT pro cedure, with a prop erly adjusted level of studen t regularization, achiev es the minimax-optimal rate for estimating the true regression func- tion f ∗ . In con trast, the SM estimate suﬀers from a constan t lev el bias, preven ting consistency . Bey ond this p erformance gap, the RaT consistency result exhibits three distinct regimes that are w orthy of further comment: No cov ariate shift: First of all, the setting of “no cov ariate shift” can b e captured by setting α = β . In this case, viewing ( σ, R ) as constan ts, the RaT estimate ac hieves estimation error that scales as (1 /n ) 2 α 2 α +1 . Note that this is the standard minimax rate for a kernel that exhibits α -p olynomial decay (e.g., [ W ai19 ]). As a sp ecial case, if w e use spline k ernels, then the RKHS corresp onds to a Sobolev space with smoothness α > 1 / 2. Setting α  = β induces co v ariate shift, and interestingly , its eﬀect can either b e malign or b enign. (It is far more common to view cov ariate shift as harmful than helpful.) F or the sake of these comparisons, supp ose that w e hav e the same num b er of target samples as source samples (i.e., m = n ). In this case, if we w ere given m -lab eled target samples, then training the student directly would yield a rate of (1 /n ) 2 β 2 β +1 . Let us compare this guaran tee to the minimax-optimal guaran tee of (1 /n ) 2 β 2 α +1 that can b e ac hiev ed when only the source data has lab els. Malign cov ariate shift: First, if α > β , then w e ha ve 2 β 2 β +1 > 2 β 2 α +1 , so that the idealized estimator based on lab eled target samples exhibits faster con vergence. This is an example of 17 harmful cov ariate shift, in that it would hav e b een more helpful to obtain lab eled samples from the target cov ariate distribution (instead of the source). Benign co v ariate shift: The opp osite conclusion holds if α < β : here it turns out to be b eneﬁcial to observ ed labeled source data instead of target data. Intuitiv ely , when α < β , the source co v ariate distribution P X places more mass on the kernel sp ectrum with larger eigenv alues, and it is these eigen-functions that are most relev an t in estimation. (Regularization in k ernel ridge regression amoun ts to down-w eighting the impact of directions asso ciated with smaller eigen v alues.) 3.4 Computational guaran tees In the previous section, w e analyzed a particular student–teac her pair for whic h the RaT estimate can b e computed in closed form (viz. equation ( 26b )). In general, computing the RaT estimate requires iterating b etw een the student–teac her steps, and the simplest such scheme is the Picard iteration described previously in Section 2.2 . Giv en a stepsize η > 0, it begins with an initial studen t function f 0 ∈ F , and then generates the sequence f k +1 = Prox η  f k ( e x m 1 ) − η b G ( f k )  for k = 0 , 1 , 2 , . . . , (31) using the student pro ximal update Prox η : R m → F from equation ( 3 ), and the gradien t estimator b G from equation ( 6c ). Figure 2 illustrates some representativ e con v ergence b eha vior that we observ e in practice for this algorithm. Here w e are measuring performance in terms of the operator defect b D η ( f ) : = Prox η  f ( e x m 1 ) − η b G ( f )  − f , (32) whic h measures how close f ∈ F is to b eing a RaT ﬁxed p oin t. Panels (a) and (b) corresp ond to a stepsize η = 0 . 10 and η = 0 . 20, resp ectiv ely , and we sho w results for training a 10-class logistic classiﬁer as the student, using three diﬀeren t types of neural netw ork teachers. The teachers are lab eled with pairs ( h 1 , h 2 ), corresp onding to the num b er of units in the tw o hidden lay ers of the arc hitecture. F or this particular problem, at one extreme, the c hoice ( h 1 , h 2 ) = (16 , 16), shown in ligh t purple, leads to a teacher neural net with suﬃcien t ﬂexibility to yield a relatively accurate gradien t approximation, whereas the c hoice ( h 1 , h 2 ) = (4 , 4), shown in red, leads to a teac her that is relativ ely imp ov erished. The theory to b e developed will give suﬃcien t conditions under which: (a) w e are guaran teed to obtain conv ergence of this t yp e; and (b) the role of teac her-provided gradient appro ximation. Let us no w turn to some theoretical results. As previously discussed, the up date ( 31 ) can b e un- dersto o d as mimicking a proximal gradient up date, using the exact gradient ∇ ¯ L m ( f ) for the oracle ob jective function ( 2 ). F rom kno wn results on proximal metho ds, the con vergence prop erties of this metho d dep end on the co-co ercivity and monotonicit y prop erties of the gradient op erator [ BC17 ; Bec17 ; Nes18 ]. Accordingly , it is natural that the conv ergence prop erties of the up date ( 31 ) should dep end on analogous prop erties for the approximate gradient estimator b G provided by the teacher. W e introduce these prop erties next. 18 0 20 40 60 80 100 120 Iteration 1 0 1 1 0 0 Nor malized MSE of logit change P r o x. defect vs. iteration NN ar ch: (4, 4) NN ar ch: (8, 8) NN ar ch: (16, 16) 0 20 40 60 80 100 120 Iteration 1 0 1 1 0 0 Nor malized MSE of logit change P r o x. defect vs. iteration NN ar ch: (4, 4) NN ar ch: (8, 8) NN ar ch: (16, 16) (a) Stepsize η = 0 . 10 (b) Stepsize η = 0 . 20 Figure 2. Conv ergence behavior of the Picard iteration ( 31 ) when using a neural net work teac her to construct the gradient estimate b G , and for multinomial classiﬁer ov er K = 10 classes (see Section 4.2 for more details.) Each plot sho ws the op erator defect norm ∥ b D η ( f k ) ∥ versus the iteration num b er k . The three diﬀeren t curv es corresp ond to three-la yer neural-net teac hers with three diﬀerent archi- tectures: hidden units ( h 1 , h 2 ) are marked in the labels. Solid lines correspond to the mean defect o v er T = 100 random trials, with the shaded areas showing 95% conﬁdence interv als. Stepsizes η are mark ed b elow each plot. Structural properties of gradient estimator: Giv en a parameter b β > 0, we say that b G is ( b β , ε )-approximately co-co ercive at e f if D f − e f , b G ( f ) − b G ( e f ) E m ≥ 1 b β n ∥ b G ( f ) − b G ( e f ) ∥ 2 m − ε 2 o for all f ∈ F . (33a) F or ε = 0, this deﬁnition corresp onds to the standard deﬁnition of co-co ercivit y for an op erator, so w e ha v e a relaxation of the standard condition. Similarly , for a parameter b µ > 0, we sa y that b G is ( b µ, ε )-approximately monotone at e f if D f − e f , b G ( f ) − b G ( e f ) E m ≥ b µ n ∥ f − e f ∥ 2 m − ε 2 o for all f ∈ F . (33b) In terms of these conditions, we hav e: Theorem 3 (Computational guaran tees for RaT ) . Supp ose that the gr adient estimator b G is appr oximately ( b β , ε ) -c o-c o er cive ( 33a ) at b f . Then after K iter ations with stepsize η = 1 / b β , we have min k =0 ,...,K ∥ b D η ( f k ) ∥ 2 m ≤ 2 K +1 ∥ f 0 − b f ∥ 2 m + 4 ε 2 b β 2 , (34a) wher e b f is any RaT ﬁxe d p oint. If in addition b G is ( b µ, ε ) -appr oximately monotone ( 33b ) at b f , then after K iter ations with stepsize η = b µ/ b β 2 , we have ∥ b D η ( f K ) ∥ 2 m ≤ 2  1 − b µ 2 b β 2  K ∥ f 0 − b f ∥ 2 m + 8 ε 2 . (34b) 19 See Section 5.3 for the pro of. Observ e that the guarantee ( 34b ) guarantees geometric conv ergence (up to the radius 8 ε 2 ). Ho wev er, it requires b oth the co-co ercivity and monotonicity conditions to hold. On the other hand, the guarantee ( 34a ) requires only co-co ercivity , but yields a slo w er 1 /K conv ergence rate. Both rates hav e analogues for standard gradient-based optimization: the 1 /K rate holds for a function with Lipschitz gradient, otherwise known as a smo oth function, whereas the geometric rate holds for a strongly conv ex and smo oth function. Let us describ e how it is possible to sho w that the co-coercivity ( 33a ) and monotonicity ( 33b ) conditions hold for v arious problems. T o give the in tuition, recall that b G ( f ) acts as an appro ximation to the gradien t ∇ ¯ L m ( f ) of the smoothed target risk ( 1a ). When ¯ L m is a µ -strongly conv ex function, then its gradien t ∇ ¯ L m satisﬁes the strong monotonicity condition D f − e f , ∇ ¯ L m ( f ) − ∇ ¯ L m ( e f ) E m ≥ µ ∥ f − e f ∥ 2 m . (35) As long as b G ( f ) approximates ∇ ¯ L m ( f ), then we can exp ect that b G satisﬁes the appro ximate strong monotonicit y condition ( 33b ) with a suitable ( b µ, ε ). Similar comments apply to when ∇ ¯ L m is a β -smo oth function, and the co-co ercivity condition. W e lea ve in-depth study for future work. 4 Numerical results In this section, w e describ e the results of v arious numerical studies that complemen t our theoretical analysis. 4.1 Separation b etw een SM and RaT pro cedures In Theorem 2 , we stated and pro ved a result that reveals a fundamental separation b etw een the SM and RaT estimators: while the RaT pro cedure is consistent when the studen t mo del is appropriately tuned, the SM estimate is inconsistent, even when the same studen t tuning is a v ailable. In this section, we describ e a suite of numerical simulations that further explore this separation result. These experiments serve three purposes: (i) to v erify the statistical error b ounds in a setting that closely matches the theorem; (ii) to demonstrate that the qualitative b eha vior p ersists for more general kernel mo dels, even when the theorem assumptions do not strictly hold; and (iii) to test whether the same phenomenon app ears for other studen t–teac her interactions, such as those based on neural net works. 4.1.1 V eriﬁcation of the theoretical rate W e ﬁrst construct a syn thetic class of problems that appro ximately matches the construction used in the pro of of Theorem 2 . Sp eciﬁcally , in a univ ariate setting, we consider the target distri- bution Q X ∼ N (0 , 1), and the family of source distributions P X ∼ N (0 , σ 2 P ) for an adjustable v ariance σ 2 P > 0. W e construct a k ernel using a ﬁnite-dimensional Hermite feature map; b y computing the empirical k ernel matrices, we can study their eigendecay , and observe approxi- mate p olynomial eigendeca y with exp onents ( α, β ). Figure 4.1.1 shows that the RaT estimator follo ws the predicted rate n − 2 β / (2 α +1) , while the SM estimator exhibits a non-v anishing error due to the shift bias from equation( 30b ). The actual v alues of ( α, β ), whic h determine the observed 20 40 90 190 410 S a m p l e s i z e n 1 0 1 1 0 0 Nor malized MSE A = 0 . 8 8 = 0 . 7 8 2 / ( 2 + 1 ) = 0 . 5 7 R aT SM S l o p e 0 . 5 7 S o u r c e s = 0 . 9 40 90 190 410 S a m p l e s i z e n Nor malized MSE B = 0 . 8 7 = 0 . 8 6 2 / ( 2 + 1 ) = 0 . 6 3 R aT SM S l o p e 0 . 6 3 S o u r c e s = 1 . 0 40 90 190 410 S a m p l e s i z e n Nor malized MSE C = 0 . 8 6 = 0 . 9 5 2 / ( 2 + 1 ) = 0 . 7 0 R aT SM S l o p e 0 . 7 0 S o u r c e s = 1 . 1 M S E v e r s u s s a m p l e s i z e n Figure 3. Mean-squared errors of RaT and SM under Gaussian cov ariate shift using a Hermite feature mo del, target distribution Q = N (0 , 1), and source distributions P = N (0 , σ 2 P ) with σ P ∈ { 0 . 9 , 1 . 0 , 1 . 1 } . Solid curves show the median MSE with interquartile bands ov er rep etitions, and dashed lines indicate the predicted scaling laws. Consistent with Theorem 2 , RaT ac hieves the rate n − 2 β / (2 α +1) , whereas SM exhibits a non-v anishing error ﬂo or due to distribution shift. rate, dep ends on the in teraction b etw een the source and target eigenspectra. In the setting of b enign c ovariate shift with α < β , as shown in the right-most panel of Section 4.1.1 , we see that the RaT estimator can b e faster than naiv e MSE rates determined solely b y α or β , i.e., n − 2 β / (2 α +1) < min { n − 2 α/ (2 α +1) , n − 2 β / (2 β +1) } . 4.1.2 Separation for more general kernels Next we consider a more generic kernel regression setting with target distribution Q = Unif [0 , 1], and source distribution P = Beta( a, 1). W e use k ernel ridge regression with the Laplace kernels, i.e., K ( x, y ) = exp( −∥ x − y ∥ 2 /ν s ) and e K ( x, y ) = exp( −∥ x − y ∥ 2 /ν t ) with 0 < ν t < ν s . Although the exact eigendecay assumptions of Theorem 2 no longer hold, the qualitativ e separation b etw een SM and RaT is still observed. Section 4.1.2 shows that across all conﬁgurations, the SM estimator exhibits a p ersisten t error ﬂo or, whereas the RaT estimator con tinues to improv e as the source sample size increases. This b eha vior is consistent with the theoretical analysis: SM suﬀers from a shift-induced bias, while RaT progressively corrects this bias through iterative feedbac k from the teac her mo del. 4.1.3 Neural-net work estimators Finally , w e set up an experiment to examine whether similar phenomena app ear when using non- k ernel pairs of student-teac hers. Concretely , consider a regression task in which b oth the teacher and studen t are one-hidden-lay er ReLU net works with k hidden units, corresp onding to functions of the form h a,b,w ( x ) = k X j =1 w j φ  ⟨ a j , x ⟩ + b j  for ( a j , b j ) ∈ R d × R , and w ∈ R k , (36) 21 40 90 190 410 S a m p l e s i z e n 1 0 0 Nor malized MSE A S l o p e 0 . 5 7 R aT SM S o u r c e P = B e t a ( 0 . 2 , 1 ) 40 90 190 410 S a m p l e s i z e n Nor malized MSE B S l o p e 0 . 4 7 R aT SM S o u r c e P = B e t a ( 1 . 0 , 1 ) 40 90 190 410 S a m p l e s i z e n Nor malized MSE C S l o p e 0 . 4 2 R aT SM S o u r c e P = B e t a ( 2 . 0 , 1 ) M S E v e r s u s s a m p l e s i z e n Figure 4. Mean-squared errors of the RaT and SM estimators for Laplace kernel ridge regression, using target distribution Q = Beta(1 , 1) and source distributions P = Beta( α, 1) with α ∈ 0 . 2 , 1 . 0 , 2 . 0. The teac her and student use Laplace k ernels with diﬀering bandwidths, so that the teac her is mis- sp eciﬁed relativ e to the studen t. Solid curv es sho w the median MSE with interquartile bands ov er rep eated trials. Each curve is normalized by its v alue at the smallest sample size. The dashed line represen ts a least-squares linear ﬁt on a log–log scale to the RaT curve. and φ ( t ) : = max(0 , t ) is the ReLU function. W e studied a teac her class G based on k = 2 hidden units; this v ery small choice induces a strong bias, analogous to the role play ed by the explicit p enalt y λ in a kernel pro cedure. W e construct the student class F with the num b er of hidden units ˜ k as a tuning parameter to be adjusted; it pla ys the role of γ from our earlier exp erimen ts. 30 80 180 400 900 S a m p l e s i z e n 1 0 3 1 0 2 1 0 1 T ar get test MSE A R aT SM S l o p e 0 . 8 4 Scaling of tar get test MSE 0.0 0.2 0.4 0.6 0.8 1.0 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 y B M S E S M = 0 . 1 7 2 8 M S E R a T = 0 . 0 0 1 1 SM R aT Single-run function fit R aT versus SM under Beta shif t Figure 5. Mean-squared errors (MSE) of RaT and SM estimators for neural netw ork studen ts with a regression-stump teacher, and Beta cov ariate shift. (a) T arget test MSE versus source sam ple size on a log-log scale. Solid lines sho w the median o ver rep eated runs, and shaded regions indicate the in terquartile range. The dashed line is a least-squares linear ﬁt in log–log co ordinates to the RaT median curv e. (b) F unction estimates for a single run, with ground truth function (blac k dashed), source samples (gra y p oin ts), SM teacher (dark red dotted), and ﬁnal function estimates (solid blue for RaT , solid red for SM ). F ain t blue curv es corresp ond to in termediate RaT iterates. The reported MSE v alues are computed on held-out target data. As seen in Figure 5 , we observe the same qualitative b ehavior in this neural net setting: the SM 22 estimator exhibits a substantially higher error ﬂoor, while the RaT estimator con tinues to impro ve as the sample size increases. These results suggest that the separation mechanism iden tiﬁed in Prop o- sition 2 is not limited to k ernel metho ds but reﬂects a more general phenomenon of iterativ e bias correction. 4.2 Co v ariate shift in ImageNette W e no w turn to experiments in a real data setting that further elucidate the diﬀerences betw een the SM and RaT approaches. These exp eriments made use of the ImageNette dataset [ Ho w19 ], a subset of the full Imagenet dataset restricted to K = 10 classes that are relativ ely easy to distinguish. After pre-pro cessing, eac h co v ariate x corresponds to a tensor with dimensions 224 × 224 × 3, represen ting an image with 224 × 224 pixels, and 3 color bands. The resp onses y ∈ R 10 are all standard-basis v ectors, with y = e j meaning that the class lab el is equal to j . There are N = 13394 samples in total, and in the exp eriments rep orted here, we used random splits into 70% training and 30% v alidation. Beginning with a ResNet18 neural netw ork [ He+16 ] trained on the full ImageNet dataset, w e p erformed 4 ep o chs of ﬁne-tuning up dates on the ImageNette subset, adjusting the w eights and biases of the last tw o la yers. Doing so yields a netw ork that achiev es classiﬁcation accuracy ≈ 98% and cross-entrop y loss ≈ 0 . 03 on the ImageNette dataset, reﬂecting the easiness of the original problem. The ﬁnal lay er of the ResNet18 mo del contains a total of D = 512 features; by p erforming PCA using the source training data, w e reduced these features to a total of d = 40 principal components, and b oth studen t and teac her mo dels operate in this 40-dimensional space. W e remark that this PCA step has a very minor eﬀect on accuracy , for instance increasing cross- en tropy loss from ≈ 0 . 03 to ≈ 0 . 07. In order to introduce cov ariate shift, w e then adapted the image corruption pro cedures from the Imagenet-C dataset [ HD19 ]; here the “C” corresp onds to the v arious types of corruption (15 in total) that can b e applied. The top panels of Figure 6 and Figure 7 resp ectively , sho w ﬁve lev els of the “Pixelate” and “Elastic Blur” corruptions. By doing so, we obtained for a giv en corruption (e.g., “Pixelate”) and level, a new set of images to act as the target set. W e explored the b ehavior of v arious teachers for this problem, also acting on the extracted ResNet18 features. Here we report some representativ e results for t wo diﬀeren t types of teachers, sp eciﬁcally designed to induce diﬀeren t lev els of teacher bias: • a three-la y er ReLU neural netw ork with hidden-unit dimensions ( h 1 , h 2 ) = (4 , 32), and • a three-la y er ReLU neural netw ork with hidden-unit dimensions ( h 1 , h 2 ) = (16 , 32). W e trained each teac her model using the AD AM algorithm with its default settings, and a w eigh t deca y parameter 0 . 01. In all cases, the student was a K = 10 class multinomial classiﬁer op erating on the extracted ResNet18 features, and w e trained the student mo del using AD AM with these same settings. F or the SM model, the student w as trained to ﬁt (via cross-en tropy loss) the teac her’s predicted probabilities, whereas for the RaT pro cedure, the studen t p erformed the pro ximal updates, using the teacher’s predicted logits. The n umber h 1 of hidden units in the ﬁrst lay er controls the amoun t of teacher bias: strong ( h 1 = 4) or medium ( h 1 = 16). With only h 1 = 4 hidden units in the ﬁrst lay er, the teacher is forced to p erform a drastic dimensionalit y reduction of the reduced last lay er (from dimension 40 down to 4 non-linear features). This structural constraint embeds bias in to the teac her, and we exp ect that it b ecomes more severe as the degree of cov ariate shift (as measured b y the level of corruption) increases. 23 1 2 3 4 5 6 7 8 Cor ruption level 0.1 0.0 0.1 0.2 0.3 0.4 Differ ence in CE loss CE Differ ence SM r elative to R aT SM-NN-(4, 32) -R aT -NN-(4, 32) SM-NN-(16, 32) -R aT -NN-(16, 32) 1 2 3 4 5 6 7 8 Cor ruption level 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Cr oss-entr opy loss (CE) CE versus cor ruption SM-NN-(4, 32) R aT -NN-(4, 32) (a) (b) Figure 6. T op row: a sample image from the “F rench horn” class, and its corrupted versions using the “pixelate” corruption. (a) Plots of the diﬀerence in cross-entrop y (CE) loss b etw een the SM and RaT pro cedures versus the corruption level. Tw o types of neural nets are shown: three-lay er arc hitectures with ( h 1 , h 2 ) = (4 , 32) hidden units, or ( h 1 , h 2 ) = (16 , 32) hidden units. Solid lines corresp ond to the mean CE diﬀerence ov er T = 100 random splits in to train and v alidation, and the shaded areas corresp ond to 95% conﬁdence in terv als based on these trials. (b) Similar plots for the ra w CE scores of the tw o metho ds, shown here for the (4 , 32)-teacher. Figure 6 and Figure 7 show some representativ e results, for the “pixelate” and “elastic-blur” corruptions resp ectively . In eac h plot, panel (a) plots the diﬀerence in cross-en trop y loss, measured on the v alidation set of the target dataset, betw een the SM and RaT pro cedures, versus the amoun t of co v ariate shift (as measured b y the level of corruption applied). The horizontal black line corre- sp onds to the zero-line, so that ab ov e this line, the RaT pro cedure exhibits sup erior p erformance. There are tw o curves, corresp onding to the tw o t yp es of teachers, and the shaded area corresp onds to a 95% conﬁdence in terv al, constructed using T = 100 trials run on random splits of the dataset in to training and v alidation. P anel (b) plots the raw v alues of the cross-entrop y (CE) loss for b oth SM and RaT pro cedures, again v ersus the corruption lev el and with 95% CIs sho wn. As can b e seen from b oth plots, when the corruption level is relatively low, then the SM pro cedure yields sligh tly lo wer CE loss than the RaT pro cedure. W e note that for low levels of corruption, the problem is relatively easy , since the uncorrupted classiﬁer has cross-entrop y loss ≈ 0 . 07. As the amoun t of corruption increases, then the RaT procedure starts to outperform. In Figure 6 , we see out-p erformance for b oth the (4 , 32)-teacher (heavily biased), and the (16 , 32)-teacher, whereas in Figure 7 , the out-performance is muc h stronger for the more heavily biased teacher. These observ ations are qualitativ ely consisten t with our general theory . 24 1 2 3 4 5 6 7 8 Cor ruption level 0.1 0.0 0.1 0.2 0.3 0.4 Differ ence in CE loss CE Differ ence SM r elative to R aT SM-NN-(4, 32) -R aT -NN-(4, 32) SM-NN-(16, 32) -R aT -NN-(16, 32) 1 2 3 4 5 6 7 8 Cor ruption level 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Cr oss-entr opy loss (CE) CE versus cor ruption SM-NN-(4, 32) R aT -NN-(4, 32) (a) (b) Figure 7. T op ro w: a sample image from the “Gas pump” class, and its corrupted versions using the “elastic-blur” corruption. (a) Plots of the diﬀerence in cross-en tropy (CE) loss betw een the SM and RaT pro cedures versus the corruption level. Tw o types of neural nets are shown: three-lay er arc hitectures with ( h 1 , h 2 ) = (4 , 32) hidden units, or ( h 1 , h 2 ) = (16 , 32) hidden units. Solid lines corresp ond to the mean CE diﬀerence ov er T = 100 random splits in to train and v alidation, and the shaded areas corresp ond to 95% conﬁdence in terv als based on these trials. (b) Similar plots for the ra w CE scores of the tw o metho ds, shown here for the (4 , 32)-teacher. 5 Pro ofs In this section, w e collect the pro ofs of our three theorems. W e fo cus on the main steps, with some more tec hnical details provided in the app endices. In addition, the pro ofs of Prop osition 1 , Corollary 1 and Prop os ition 2 all are deferred to Section B in the supplemen t. 5.1 Pro of of Theorem 1 W e now turn to the pro of of Theorem 1 . It suﬃces to pro ve the more general b ounds ( 11a ), and ( 11b ), since as noted in the discussion follo wing Theorem 1 , they imply the b ounds claimed in the theorem statemen t. Recall the p enalty function Pen : F → R that deﬁnes the studen t oracle function f † , as in equation ( 5 ). It is conv enien t to ha ve a v ersion that acts on vectors u ∈ R m , so that we deﬁne P en m ( u ) : = min { f ∈ F | f ( e x m 1 )= u } P en( f ) , (37) and w e in tro duce the shorthand Pen m ( g ) ≡ P en m ( g ( e x m 1 )). With a slight abuse of notation, we iden tify a function f ∈ F with its ﬁtted-v alue vector f ( e x m 1 ) ∈ R m when taking Euclidean inner pro ducts and norms. 25 Equipp ed with this notation, we are no w ready to pro ve each of the t w o b ounds ( 11a ), and ( 11b ), in turn. Recall that they apply to an arbitrary function f k ∈ F , and the one-step up date f k +1 = b H η ( f k ). By deﬁnition of the defect op erator, we then ha ve b D η ( f k ) = f k +1 − f k . Pro of of the b ound ( 11a ) : By assumption, the function ¯ R ( f ) = 1 m  ¯ L m ( f ) + P en( f )  is conv ex in the ﬁtted v alues. Therefore, b y the subgradien t inequalit y applied at the p oint f k +1 , we hav e ¯ R ( f † ) − ¯ R ( f k +1 ) ≥ 1 m D f † − f k +1 , ∇ ¯ L m ( f k +1 ) + z k +1 E , (38) v alid for an y c hoice of subgradient z k +1 ∈ ∂ Pen m ( f k +1 ). W e now construct a particular sub-gradient. By deﬁnition of the proximal up date, the ﬁtted v ector f k +1 ( e x m 1 ) minimizes the ob jectiv e u 7→ 1 2 η ∥ u −  f k ( e x m 1 ) − η b G ( f k )  ∥ 2 2 + Pen m ( u ) . Consequen tly , the zero-subgradient optimalit y condition ensures that there exists some v ector z k +1 ∈ ∂ Pen m ( f k +1 ) such that 0 = 1 η n f k +1 −  f k − η b G ( f k )  o + z k +1 = 1 η  f k +1 − f k  + b G ( f k ) + z k +1 . Since b D η ( f k ) = f k +1 − f k , we see that z k +1 : = − b G ( f k ) − 1 η b D η ( f k ) b elongs to the sub-diﬀerential ∂ Pen m ( f k +1 ). Substituting this choice into our lo wer b ound ( 38 ), w e ﬁnd that ¯ R ( f † ) − ¯ R ( f k +1 ) ≥ 1 m  f † − f k +1 , ∇ ¯ L m ( f k +1 ) − b G ( f k ) − 1 η b D η ( f k )  . Re-arranging b oth sides yields ¯ R ( f k +1 ) − ¯ R ( f † ) ≤  f k +1 − f † , ∇ ¯ L m ( f k +1 ) − b G ( f k ) − 1 η b D η ( f k )  m , whic h pro ves the claimed b ound ( 11a ). Pro of of the estimation b ound ( 11b ) : Our pro of is based on the fact that the proximal op erator Prox η : R m → R m is ﬁrmly non-expansive [ PB14 ; BC17 ], so that for an y vectors u, v ∈ R m , w e ha ve ∥ Prox η ( u ) − Prox η ( v ) ∥ 2 2 ≤ ⟨ u − v , Prox η ( u ) − Prox η ( v ) ⟩ . (39) Since f † is the oracle minimizer, it satisﬁes the exact ﬁxed p oint relation f † = Prox η  f † ( e x m 1 ) − η ∇ ¯ L m ( f † )  , and by deﬁnition, w e ha ve f k +1 = Prox η  f k ( e x m 1 ) − η b G ( f k )  . 26 W e now apply ﬁrm non-expansivity ( 39 ) with the c hoices u = f k ( e x m 1 ) − η b G ( f k ) , and v = f † ( e x m 1 ) − η ∇ ¯ L m ( f † ) . Doing so yields the inequalit y ∥ f k +1 − f † ∥ 2 2 ≤ D  f k − η b G ( f k )  −  f † − η ∇ ¯ L m ( f † )  , f k +1 − f † E . Using the iden tity f k = f k +1 − b D η ( f k ), we can rewrite the righ t-hand side as D f k +1 − f † , f k +1 − f † − b D η ( f k ) − η  b G ( f k ) − ∇ ¯ L m ( f † )  E . Cancelling the common term ∥ f k +1 − f † ∥ 2 2 from b oth sides and dividing b y η > 0 yields 0 ≤  f k +1 − f † , 1 η b D η ( f k ) + b G ( f k ) − ∇ ¯ L m ( f † )  , or equiv alently , D f k +1 − f † , ∇ ¯ L m ( f † ) E ≤  f k +1 − f † , b G ( f k ) + 1 η b D η ( f k )  . Adding and subtracting ∇ ¯ L m ( f k +1 ) and re-arranging then yields D f k +1 − f † , ∇ ¯ L m ( f k +1 ) − ∇ ¯ L m ( f † ) E ≤  f k +1 − f † , ∇ ¯ L m ( f k +1 ) − b G ( f k ) − 1 η b D η ( f k )  . (40) Since ¯ L m is µ -strongly conv ex, its gradient ∇ ¯ L m = ∇ ¯ L m is µ -strongly monotone, so that the left-hand side is low er bounded as D f k +1 − f † , ∇ ¯ L m ( f k +1 ) − ∇ ¯ L m ( f † ) E ≥ µ ∥ f k +1 − f † ∥ 2 2 . Com bining this inequality with ( 40 ), and then rescaling b oth sides by 1 /m to conv ert to the empirical norm ∥ · ∥ m and inner pro duct ⟨· , ·⟩ m , we ﬁnd that ∥ f k +1 − f † ∥ 2 m ≤ 1 µ  f k +1 − f † , ∇ ¯ L m ( f k +1 ) − b G ( f k ) − 1 η b D η ( f k )  m , as claimed in equation ( 11b ). 5.2 Pro of of Theorem 2 The theorem asserts the existence of k ernel and source-target pairs with certain prop erties. W e b egin by describing the sp eciﬁc class of kernels and source-target pairs that underlies our construc- tion. 27 A simple feature-based RKHS. In order to make the sp ectral b ehavior of the matrices in equations ( 26a )–( 26b ) explicit, we consider feature-linear kernels of the form K ( x, y ) = e K ( x, y ) = ϕ ( x ) ⊤ ϕ ( y ), where ϕ : x 7→ ϕ ( x ) ∈ R D is some feature map. The asso ciated repro ducing kernel Hilb ert space (RKHS) consists of feature-linear functions of the form f v ( x ) = v T ϕ ( x ) for some w eight vector v ∈ R D , and we ha ve ∥ f v ∥ H = ∥ v ∥ 2 . W e assume that the true function f ∗ ∈ H , sa y of the form f ∗ ( x ) = ( v ∗ ) T ϕ ( x ) for some v ∗ ∈ R D , and that its Hilb ert norm is b ounded as ∥ f ∗ ∥ H = ∥ v ∗ ∥ 2 ≤ R . Let Φ ∈ R n × D b e the source feature matrix with ϕ ( x i ) ∈ R D as its i th ro w, and deﬁne the target feature matrix ˜ Φ ∈ R m × D in an analogous manner. These feature matrices induce the D - dimensional empirical second-moment matrices Σ : = 1 n Φ ⊤ Φ ∈ R D × D and e Σ : = 1 m ˜ Φ ⊤ ˜ Φ ∈ R D × D . In our analysis, we assume that Σ and e Σ commute and therefore admit a common orthonormal eigen basis. Let { σ k } D k =1 and { e σ k } D k =1 denote their eigen v alues in this shared basis, and supp ose that they satisfy the p olynomial eigendecay conditions σ k ≍ k − 2 α and e σ k ≍ k − 2 β . Given this set-up, there is a unitary matrix U ∈ R D × D suc h that Σ = U diag( σ 1 , . . . , σ D ) U ⊤ , e Σ = U diag( e σ 1 , . . . , e σ D ) U ⊤ , with σ k ≍ k − 2 α , e σ k ≍ k − 2 β . (41a) The kernel matrices can written in terms of the feature matrices as K nn = ΦΦ ⊤ ∈ R n × n , e K mn = ˜ ΦΦ ⊤ ∈ R m × n , e K mm = ˜ Φ ˜ Φ ⊤ ∈ R m × m . (41b) Consequen tly , the eigendeca y assumptions on Σ and e Σ imply that the matrices ( K nn , e K mm ) exhibit ( α, β )-eigendeca y , whic h ensures that the sp ectral conditions required in the theorem are satisﬁed. 5.2.1 Pro of of the RaT b ound ( 30a ) Let us ﬁrst restate the MSE decomp osition ( 28a ) using our new represen tation, and the shorthand Σ λ = Σ + λ I D . The RaT estimate has mean-squared error E ∥ b f γ RaT − f ∗ ∥ 2 m = B 2 RaT + V RaT . In the feature-based representation, these bias and v ariance terms are giv en b y B 2 RaT ( f ∗ ) : = ∥ γ e Σ 1 / 2 ( e ΣΣΣ − 1 λ + γ I D ) − 1 v ∗ ∥ 2 , and (42a) V RaT : = σ 2 n T r  e Σ ( e ΣΣΣ − 1 λ + γ I D ) − 1 e ΣΣΣ − 2 λ e Σ ( e ΣΣΣ − 1 λ + γ I D ) − 1  , (42b) where the deriv ation of these expressions can b e found in Section C.1 . Here we recall that the true function f ∗ can b e written as f ∗ ( x ) = ( v ∗ ) T ϕ ( x ). The bulk of our tec hnical work is devoted to establish upp er b ounds on these t wo terms, which we state here. In the follo wing result, we use ( c 0 , c 1 ) to denote universal constants. Lemma 1 ( RaT bias and v ariance b ounds) . Under the c onditions deﬁne d ab ove, the RaT bias has the uniform upp er b ound sup ∥ f ∗ ∥ H ≤ R B 2 RaT ( f ∗ ) ≤ c 0 R 2 ( λγ ) β α + β . (43a) Mor e over, its varianc e satisﬁes the upp er b ound V RaT ≤ c 1 σ 2 n ( λγ ) − 2 α − 2 β +1 2( α + β ) . (43b) 28 See Section 5.2.2 and Section 5.2.3 for the pro ofs of these t w o b ounds, resp ectively . T aking them as given, let us complete the pro of of the RaT upp er b ound in Theorem 2 . Balancing the terms. Com bining the bias and v ariance bounds pro ved ab ov e yields MSE( b f γ RaT ) : = E ∥ b f γ RaT − f ∗ ∥ 2 m ≤ c 0 R 2 ( λγ ) β α + β + c 1 σ 2 n ( λγ ) − 2 α − 2 β +1 2( α + β ) . (44) The ﬁrst term is increasing in γ , whereas the second is decreasing in γ , so an optimal choice of γ is obtained b y balancing these t wo contributions. An order-optimal choice by choosing γ so that R 2 ( λγ ) β α + β = σ 2 n ( λγ ) − 2 α − 2 β +1 2( α + β ) . F ollo wing some algebra, w e arriv e at the choice γ = c ′ 1 λ  σ 2 R 2 n  2( α + β ) 2 α +1 for some univ ersal constan t c . Substituting this c hoice bac k in to the b ound ( 44 ) yields MSE( b f γ RaT ) ≤ cR 2 − 4 β 2 α +1 σ 4 β 2 α +1 n − 2 β 2 α +1 = c R 2  σ 2 nR 2  2 β 2 α +1 , as claimed. Finally , the standing as sumption γ ≤ λ β /α is satisﬁed whenev er  σ 2 R 2 n  2( α + β ) 2 α +1 ≤ λ 1+ β /α , or equiv alently , whenever n ≥ c 2 λ − (2 α +1) / (2 α ) for a suﬃcien tly large constant c 2 . It remains to prov e the tw o claims given in Lemma 1 . 5.2.2 Pro of of the b ound ( 43a ) By the deﬁnition ( 42a ) of the RaT bias term, we hav e sup ∥ f ∗ ∥ H ≤ R B 2 RaT ( f ∗ ) ( i ) = sup ∥ v ∗ ∥ 2 ≤ R ∥ γ e Σ 1 / 2 ( e ΣΣΣ − 1 λ + γ I D ) − 1 v ∗ ∥ 2 2 ( ii ) ≤ R 2 | | | γ e Σ 1 / 2 ( e ΣΣΣ − 1 λ + γ I D ) − 1 | | | 2 where step (i) follows since f ∗ ( x ) = ( v ∗ ) T ϕ ( x ) for some vector v ∗ ∈ R D , along with the equiv alence ∥ f ∗ ∥ H = ∥ v ∗ ∥ 2 ; and step (ii) follo ws from the v ariational deﬁnition of the operator norm | | | · | | | (i.e., the maximum singular v alue). Since the matrices Σ and e Σ commute, they admit a common orthonormal eigenbasis giv en by the columns of U in the representation ( 41a ). In this basis, we hav e e ΣΣΣ − 1 λ = U diag ( a 1 , . . . , a D ) U ⊤ , where a k : = e σ k σ k σ k + λ ≥ 0 Putting together the pieces, we ﬁnd that sup ∥ f ∗ ∥ H ≤ R B 2 RaT ( f ∗ ) ≤ R 2 | | | γ e Σ 1 / 2 ( e ΣΣΣ − 1 λ + γ I D ) − 1 | | | 2 ≤ R 2 max k =1 ,...,D γ 2 e σ k ( a k + γ ) 2 . Moreo ver, since a k ≥ 0, we ha v e the upp er b ound γ 2 e σ k ( a k + γ ) 2 = γ e σ k a k + γ · γ a k + γ ≤ γ e σ k a k + γ , 29 from which it follo ws that sup ∥ f ∗ ∥ H ≤ R B 2 RaT ( f ∗ ) ≤ R 2 max k =1 ,...,D γ e σ k a k + γ . In order to complete the pro of of the b ound( 43a ), it suﬃces to show that, under the assumed eigendeca y conditions on the eigensequences σ k and e σ k , there a constant c 0 ∈ (0 , ∞ ) indep enden t of λ and γ , such that max k =1 ,...,D γ e σ k a k + γ ≤ c 0 ( λγ ) β α + β . (45) W e prov e this ﬁnal tec hnical claim in Section C.2 . 5.2.3 Pro of of the b ound ( 43b ) W e now prov e the v ariance b ound ( 43b ). By deﬁnition ( 42b ) of the v ariance term, we hav e V RaT : = σ 2 n T r  e Σ ( e ΣΣΣ − 1 λ + γ I D ) − 1 e ΣΣΣ − 2 λ e Σ ( e ΣΣΣ − 1 λ + γ I D ) − 1  . Since Σ and e Σ comm ute, they admit a common orthonormal eigen basis. Hence there exists a unitary matrix U which diagonalizes b oth Σ and e Σ sim ultaneously; cf. equation ( 41a ). It follows that Σ − 1 λ = ( Σ + λ I D ) − 1 = U diag (( σ 1 + λ ) − 1 , . . . , ( σ D + λ ) − 1 ) U ⊤ , and therefore e ΣΣΣ − 1 λ = U diag  e σ 1 σ 1 σ 1 + λ , . . . , e σ D σ D σ D + λ  U ⊤ . Recalling the notation a k : = e σ k σ k σ k + λ , we obtain ( e ΣΣΣ − 1 λ + γ I D ) − 1 = U diag (( a 1 + γ ) − 1 , . . . , ( a D + γ ) − 1 ) U ⊤ . (46) Similarly , we hav e e ΣΣΣ − 2 λ = U diag  e σ 1 σ 1 ( σ 1 + λ ) 2 , . . . , e σ D σ D ( σ D + λ ) 2  U T . Substituting these diagonal represen tations into the trace expression for V RaT and using the in v ariance of the trace under conjugation, we obtain V RaT = σ 2 n T r  U diag  e σ 3 k σ k ( σ k + λ ) 2 1 ( a k + γ ) 2  U ⊤  = σ 2 n D X k =1 e σ 3 k σ k ( σ k + λ ) 2 1 ( a k + γ ) 2 = σ 2 n D X k =1 a 2 k ( a k + γ ) 2 e σ k σ k , where the last equation follo ws from the deﬁnition of a k . In order to analyze this sum, w e deﬁne the integer cutoﬀ k ∗ γ : = ⌈ ( λγ ) − 1 / (2( α + β )) ⌉ . Given the scaling σ k ≍ k − 2 α , we hav e σ k ≍ λ for k λ ≍ λ − 1 / (2 α ) . The assumption γ ≤ λ β /α implies that k ∗ γ = ( λγ ) − 1 / (2( α + β )) ≥ λ − 1 / (2 α ) ≍ k λ . 30 Therefore, for all k ≥ k ∗ γ , we hav e σ k ≲ λ , and hence a k = e σ k σ k σ k + λ ≍ e σ k σ k λ ≍ λ − 1 k − 2( α + β ) . W e now split the sum at k ∗ γ . Since a 2 k ( a k + γ ) 2 ≤ 1 for all k = 1 , . . . , D , the contribution from terms k = 1 , . . . , k ∗ γ can b e b ounded as k ∗ γ X k =1 a 2 k ( a k + γ ) 2 e σ k σ k ≤ k ∗ γ X k =1 e σ k σ k . Using σ k ≍ k − 2 α and e σ k ≍ k − 2 β , we obtain e σ k σ k ≍ k 2 α − 2 β , and therefore k γ X k =1 e σ k σ k ≤ C Z k γ 1 k 2 α − 2 β dk ≤ C ( k ∗ γ ) 2 α − 2 β +1 , where we use 2 α − 2 β + 1 > 0, whic h follo ws from β ∈ (1 / 2 , 1 / 2 + α ). F or the tail k > k ∗ γ , we use the b ound a 2 k ( a k + γ ) 2 ≤ a 2 k γ 2 , so that X k>k ∗ γ a 2 k ( a k + γ ) 2 e σ k σ k ≤ 1 γ 2 X k>k ∗ γ a 2 k e σ k σ k . F or k > k ∗ γ w e ha ve a k ≍ λ − 1 k − 2( α + β ) , whence a 2 k e σ k σ k ≍ λ − 2 k − 4( α + β ) k 2 α − 2 β = λ − 2 k − 2 α − 6 β . Putting together the pieces, it follo ws that X k>k ∗ γ a 2 k e σ k σ k ≤ C λ − 2 Z ∞ k ∗ γ k − 2 α − 6 β dk ≤ C λ − 2 ( k ∗ γ ) − 2 α − 6 β +1 , where the in tegral is ﬁnite since 2 α + 6 β > 1. Com bining the tw o parts, we obtain the upp er bound V RaT ≤ C σ 2 n  ( k ∗ γ ) 2 α − 2 β +1 + 1 λ 2 γ 2 ( k ∗ γ ) − 2 α − 6 β +1  . Substituting k ∗ γ = ( λγ ) − 1 / (2( α + β )) sho ws that the t wo terms are of the same order, from which the claimed b ound ( 43b ) follo ws. 5.2.4 Pro of of the SM lo wer b ound ( 30b ) W e now prov e the low er b ound ( 30b ) on the SM squared bias, as stated in Theorem 2 . W e can form ulate the bias term of SM in equation ( 28c ) as B 2 SM ( f ∗ ) =   λ e Σ 1 / 2 Σ − 1 λ v ∗ + γ e Σ 1 / 2 e Σ − 1 γ ΣΣ − 1 λ v ∗   2 , (47) whic h is derived in Section C.1.1 . 31 Observ e that b oth matrices A : = λ e Σ 1 / 2 Σ − 1 λ and B : = γ e Σ 1 / 2 e Σ − 1 γ ΣΣ − 1 λ are p ositive semi- deﬁnite, symmetric, and commutativ e by assumption. Consequen tly , for an y v ector u , we ha v e u ⊤ AB u ≥ 0, and hence B 2 SM ( f ∗ ) = ∥ ( A + B ) v ∗ ∥ 2 2 ≥ ∥ A v ∗ ∥ 2 2 + ∥ B v ∗ ∥ 2 2 =   λ e Σ 1 / 2 Σ − 1 λ v ∗   2 | {z } shift bias +   γ e Σ 1 / 2 e Σ − 1 γ ΣΣ − 1 λ v ∗   2 | {z } ridge bias . W e now tak e the supremum ov er ∥ f ∗ ∥ H ≤ R , or equiv alently ov er ∥ v ∗ ∥ 2 ≤ R in the representation f ∗ ( x ) = ( v ∗ ) T ϕ ( x ). In this wa y , w e ﬁnd that sup ∥ f ∗ ∥ H ≤ R B 2 SM ( f ∗ ) ≥ sup ∥ v ∗ ∥≤ R   λ e Σ 1 / 2 Σ − 1 λ v ∗   2 = R 2 | | | λ e Σ 1 / 2 Σ − 1 λ | | | 2 | {z } : = c 1 = c 1 R 2 . Note that for any ﬁxed λ > 0, the quan tity c 1 > 0 is an absolute constant, indep endent of ( n, R , σ, γ ), which completes the pro of. 5.2.5 Pro of of the minimax-optimal lo wer b ound ( 30c ) W e prov e this lo wer b ound via an Assouad construction, and using a particular diagonal feature mo del. Let { e j } n j =1 denote the standard basis of R n , and deﬁne a feature map ϕ : X → R n suc h that ϕ ( x k ) = √ n k − α e k , and ϕ ( e x k ) = √ n k − β e k . This feature map deﬁnes a ﬁnite-dimensional Hilb ert space H of functions f v ( x ) : = ⟨ v , ϕ ( x ) ⟩ with norm ∥ f v ∥ H = ∥ v ∥ 2 , and kernel function K ( x, x ′ ) = ⟨ ϕ ( x ) , ϕ ( x ′ ) ⟩ . By construction, with Q uniform o ver { e x j } m j =1 and P uniform ov er { x i } n i =1 , the asso ciated cov ariance op erators are diagonal with eigen v alues k − 2 α and k − 2 β , so that the k ernel matrices ( K nn , e K mm ) exhibit ( α, β )-eigendeca y . Consider the true function f ∗ ( x ) = f v ∗ ( x ). Giv en our diagonal construction, the source obser- v ations tak e the form Y k ∼ N  √ n k − α v ∗ k , σ 2 ). F or any estimator b f , deﬁne b v k = k β √ n b f ( e x k ). Then a direct substitution yields ∥ b f − f ∗ ∥ 2 m = P n k =1 k − 2 β ( b v k − v ∗ k ) 2 , where ha ve used orthogonalit y of the feature vectors. Thus, it suﬃces to lo wer b ound L ( b v , v ∗ ) : = n X k =1 k − 2 β ( b v k − v ∗ k ) 2 . (48) Th us, our problem has b een reduced to estimating the v ector v ∗ in a w eighted Euclidean norm. Lo wer b ound via hypercub e construction: W e now construct a set of functions indexed b y a Bo olean v ector τ ∈ {− 1 , 1 } M for an in teger M ∈ [1 , n ]. W e make a sp eciﬁc c hoice of M later in the argument; in particular, see equation ( 50b ). Fixing η = 1 / 4, for eac h Bo olean vector τ ∈ {− 1 , +1 } M , we deﬁne the vector v τ ∈ R n with comp onen ts v τ k : = ( τ k δ k for k = 1 , . . . , M 0 otherwise, 32 where δ k = η σ √ n k α . W e require M ≤ n , and to b e small enough so that ∥ v τ ∥ 2 2 = M X k =1 δ 2 k = η 2 σ 2 n M X k =1 k 2 α ≤ R 2 . (49) This latter prop erty ensures that eac h vector v τ indexes a function within our class. Letting P τ denote the la w of Y under v τ , we hav e Y ∼ N ( µ τ , σ 2 I ) where µ τ k : = √ n k − α v τ k . Our pro of is based on tw o auxiliary claims. First, for an y v alid M , the w orst-case estimation error ov er this family satisﬁes the lo wer b ound sup τ ∈{− 1 , +1 } M E τ [ L ( b v , v τ )] ≥ c α,β σ 2 n M 2 α − 2 β +1 , (50a) where c α,β > 0 is a constant dep ending only on the exp onents ( α, β ). Second, the choice M : =  c ∗  nR 2 σ 2  1 2 α +1  for a suﬃcien tly small c ∗ > 0 (50b) is v alid, meaning that M ≤ n , and the b ound ( 49 ) holds. Observ e that the lo wer b ound ( 30c ) follows from these t w o claims. In particular, w e substitute the choice ( 50b ) into inequality ( 50a ). Ignoring the constan ts, this yields a low er b ound that scales as σ 2 n M 2 α − 2 β +1 = σ 2 n  nR 2 σ 2  2 α − 2 β +1 2 α +1 = R 2 σ 2 nR 2  nR 2 σ 2  2 α − 2 β +1 2 α +1 = R 2  σ 2 nR 2  2 β 2 α +1 , as claimed in equation ( 30c ). It remains to prov e the tw o auxiliary results ( 50a ) and ( 50b ). Pro of of the lo wer b ound ( 50a ) : Giv en any estimate b v of the v ector v τ , deﬁne the signs b τ k = sign( b v k ). If b τ k  = τ k , then | b v k − v τ k | ≥ δ k , whence ( b v k − v τ k ) 2 ≥ δ 2 k 1 { b τ k  = τ k } . Summing ov er all indices k = 1 , . . . , M yields the lo wer b ound L ( b v , v τ ) ≥ M X k =1 k − 2 β δ 2 k E τ  1 { b τ k  = τ k }  In order to apply Assouad’s lemma [ Tsy09 ], we need to v erify its testing conditions, which inv olv e making single ﬂips to the Bo olean v ector. F or each τ ∈ {− 1 , +1 } M and an y k ∈ [ M ], let τ ( k ) denote the vector obtained by ﬂipping the k -th co ordinate of τ . Under our construction, the Kullback– Leibler divergence can be upper bounded as KL( P τ ∥ P τ ( k ) ) = 1 2 σ 2  2 √ n k − α δ k  2 = 2 η 2 . Combining with Pinsker’s inequality yields the TV upper bound TV( P τ , P τ ( k ) ) ≤ η = 1 / 4 . 33 Applying Assouad’s lemma with w eights ∆ k = k − 2 β δ 2 k then guarantees that sup τ ∈{− 1 , +1 } M E τ L ( b v , v τ ) ≥ 1 − η 2 M X k =1 k − 2 β δ 2 k = 3 8 M X k =1 k − 2 β δ 2 k , (51a) where the last equalit y uses our choice η = 1 / 4. Recalling our c hoice δ 2 k = η 2 σ 2 n k 2 α with η = 1 / 4, w e ha ve M X k =1 k − 2 β δ 2 k = η 2 σ 2 n M X k =1 k 2 α − 2 β = 1 16 σ 2 n M X k =1 k 2 α − 2 β . (51b) No w our assumption that 2 β < 2 α + 1 implies that 2 α − 2 β > − 1, so that M X k =1 k 2 α − 2 β ≥ c α,β M 2 α − 2 β +1 for some constan t c α,β > 0. Com bining with the lo wer b ounds ( 51a ) and ( 51b ) yields the claim ( 50a ). Pro of of v alidit y of M : W e now need to prov e that for a suﬃciently small constan t c ∗ > 0, the choice ( 50b ) of M is v alid, meaning that M ≤ n , and the b ound ( 49 ) holds. First, given our assumption R 2 σ 2 ≤ n 2 α , it follo ws that M ≤ c  nR 2 σ 2  1 2 α +1 ≤ c c ∗ n ≤ n as long as c ∗ is suﬃciently small. As for the condition ( 49 ), we can write ∥ v τ ∥ 2 2 = M X k =1 δ 2 k = η 2 σ 2 n M X k =1 k 2 α ≤ η 2 σ 2 n  1 + Z M 1 x 2 α dx  ≤ c α η 2 σ 2 n M 2 α +1 , for a constan t c α > 0. Substituting our choice ( 50b ) yields the upp er b ound ∥ v τ ∥ 2 2 ≤ c α η 2 σ 2 · 1 n  c ∗  nR 2 σ 2  1 2 α +1  2 α +1 = c α η 2 c 2 α +1 ∗ R 2 . Since η = 1 / 4 and c α is universal, c ho osing c ∗ suﬃcien tly small ensures that ∥ v τ ∥ 2 2 ≤ R 2 , as claimed ( 49 ). 5.3 Pro of of Theorem 3 W e prov e eac h of the t wo claims in turn. 5.3.1 Pro of of Theorem 3 (a) In tro duce the shorthand b H η ( f ) = Prox η  f − η b G ( f )  , so that the algorithm generates the sequence f k +1 = b H η ( f k ). Our pro of hinges on the following claim: with the stepsize η = 1 / b β , the up date op erator b H η satisﬁes the inequalit y ∥ b H η ( f ) − b H η ( b f ) ∥ 2 2 ≤ ∥ f − b f ∥ 2 2 − 1 2 ∥ ( f − b H η ( f )) − ( b f − b H η ( b f )) ∥ 2 2 + 2 m ε 2 b β 2 , (52) 34 where b f is an y RaT ﬁxed p oint. T aking this inequality as given, let us complete the pro of of the claim ( 34a ). W e apply the b ound ( 52 ) with f = f k , so that b H η ( f ) = f k +1 b y deﬁnition of the Picard iteration. Doing so, using the fact that b f = b H η ( b f ), and re-arranging yields 1 2 ∥ b D η ( f k ) ∥ 2 2 = 1 2 ∥ ( f k − b H η ( f k ) ∥ 2 2 ≤ ∥ f k − b f ∥ 2 2 − ∥ f k +1 − b f ∥ 2 2 + 2 m ε 2 b β 2 Summing this recursion and using the telescoping prop erty , we ﬁnd that 1 2( K + 1) K X k =0 ∥ b D η ( f k ) ∥ 2 2 ≤ ∥ f 0 − b f ∥ 2 2 K + 1 + 2 m ε 2 b β 2 Noting that the minimum is less than the a verage and rescaling b oth sides by 1 /m to conv ert to the norm ∥ · ∥ m , the claim follows. It remains to prov e our auxiliary claim. Pro of of the b ound ( 52 ) : In the standard Euclidean inner pro duct and norm, the ( b β , ε )-co- co ercivit y condition tak es the form D f − b f , b G ( f ) − b G ( b f ) E ≥ 1 b β  ∥ b G ( f ) − b G ( b f ) ∥ 2 2 − mε 2  . (53) In tro duce the shorthand b S ( f ) : = f − η b G ( f ). W e then hav e ∥ b S ( f ) − b S ( b f ) ∥ 2 2 = ∥ ( f − b f ) − η ( b G ( f ) − b G ( b f )) ∥ 2 2 = ∥ f − b f ∥ 2 2 + η 2 ∥ b G ( f ) − b G ( b f ) ∥ 2 2 − 2 η D f − b f , b G ( f ) − b G ( b f ) E ( i ) ≤ ∥ f − b f ∥ 2 2 −  2 η b β − 1  η 2 ∥ b G ( f ) − b G ( b f ) ∥ 2 2 + 2 η mε 2 b β ( ii ) = ∥ f − b f ∥ 2 2 −  2 η b β − 1  ∥ ( I − b S )( f ) − ( I − b S )( b f ) ∥ 2 2 + 2 η mε 2 b β where step (i) uses the ( b β , ε )-co-co ercivit y condition, and step (ii) uses the fact that I − b S = η b G b y deﬁnition. Setting η = 1 / b β yields ∥ b S ( f ) − b S ( b f ) ∥ 2 2 ≤ ∥ f − b f ∥ 2 2 − ∥ ( I − b S )( f ) − ( I − b S )( b f ) ∥ 2 2 + 2 m ε 2 b β 2 (54) Next w e use the b ound ( 54 ) to pro ve inequality ( 52 ). Since the pro ximal op erator Prox η is ﬁrmly non-expansive [ PB14 ; BC17 ], we hav e ∥ Prox η ( u ) − Prox η ( v ) ∥ 2 2 ≤ ∥ u − v ∥ 2 2 − ∥ ( I − Prox η )( u ) − ( I − Prox η )( v ) ∥ 2 2 . W e apply this inequalit y with u = b S ( f ) and v = b S ( b f ). These c hoices ensure that Prox η ( u ) = b H η ( f ) and Prox η ( v ) = b H η ( b f ), so that we obtain ∥ b H η ( f ) − b H η ( b f ) ∥ 2 2 ≤ ∥ b S ( f ) − b S ( b f ) ∥ 2 2 − ∥ ( I − Prox η )( u ) − ( I − Prox η )( v ) ∥ 2 2 . 35 Com bined with our earlier inequality ( 54 ), we ha ve ∥ b H η ( f ) − b H η ( b f ) ∥ 2 2 ≤ ∥ f − b f ∥ 2 2 + 2 m ε 2 b β 2 − D , where D : = ∥ ( f − b S ( f )) − ( b f − b S ( b f )) ∥ 2 2 + ∥ ( b S ( f ) − b H η ( f )) − ( b S ( b f ) − b H η ( b f )) ∥ 2 2 . In order to complete the pro of, it suﬃces to sho w that D ≥ 1 2 ∥ ( f − b H η ( f )) − ( b f − b H η ( b f )) ∥ 2 2 . In terms of the shorthand ∆ 1 = ( f − b S ( f )) − ( b f − b S ( b f )) and ∆ 2 = ( b S ( f ) − b H η ( f )) − ( b S ( b f ) − b H η ( b f )), this inequality is equiv alent to sho wing that ∥ ∆ 1 ∥ 2 2 + ∥ ∆ 2 ∥ 2 2 ≥ 1 2 ∥ ∆ 1 + ∆ 2 ∥ 2 2 whic h follo ws since ⟨ ∆ 1 , ∆ 2 ⟩ ≤ 1 2 ∥ ∆ 1 ∥ 2 2 + 1 2 ∥ ∆ 2 ∥ 2 2 b y Y oung’s inequality . 5.3.2 Pro of of Theorem 3 (b) W e claim that with the stepsize η = b µ b β 2 , the op erator b H η is appro ximately con tractiv e around any ﬁxed p oin t b f : in particular, we hav e ∥ b H η ( f ) − b H η ( b f ) ∥ 2 m ≤ γ ∥ f − b f ∥ 2 m + 4(1 − γ ) ε 2 , (55a) where γ : = 1 − b µ 2 b β 2 . W e return to prov e this prop erty shortly . T aking it as given, let us prov e the b ound ( 34b ). Applying the inequalit y ( 55a ) repeatedly and unwinding the recursion yields ∥ f k − b f ∥ 2 m ≤ γ k ∥ f 0 − b f ∥ 2 m + 4(1 − γ ) ε 2 1 − γ = γ k ∥ f 0 − b f ∥ 2 m + 4 ε 2 , (55b) Noting that b D η ( f K ) = f K − b H η ( f K ) = f K − f K +1 , we can write ∥ b D η ( f K ) ∥ 2 m ( i ) ≤ 2  ∥ f K − b f ∥ 2 m + ∥ f K +1 − b f ∥ 2 m  ( ii ) ≤ 2  1 − b µ 2 b β 2  K ∥ f 0 − b f ∥ 2 m + 8 ε 2 , as claimed in equation ( 34b ). Here step (i) follo ws from the triangle inequality , whereas step (ii) follo ws b y applying inequality ( 55b ). Pro of of the b ound ( 55a ) : W e ﬁrst derive a consequence of the approximate co-co ercivity condition ( 53 ). By re-arranging it, w e can write write ∥ b G ( f ) − b G ( b f ) ∥ 2 2 ≤ b β D f − b f , b G ( f ) − b G ( b f ) E + mε 2 ( ∗ ) ≤ b β 2 2 ∥ f − b f ∥ 2 2 + 1 2 ∥ b G ( f ) − b G ( b f ) ∥ 2 2 + mε 2 , 36 where inequality ( ∗ ) follo ws from the Cauc hy–Sc hw arz inequality . Re-arranging yields ∥ b G ( f ) − b G ( b f ) ∥ 2 2 ≤ b β 2 ∥ f − b f ∥ 2 2 + 2 mε 2 . (56) W e ﬁrst sho w that the op erator b S η ( f ) = f − η b G ( f ) is γ -con tractive in the given sense ( 55a ). W e write ∥ b S η ( f ) − b S η ( b f ) ∥ 2 2 = ∥ ( f − b f ) − η ( b G ( f ) − b G ( b f )) ∥ 2 2 = ∥ f − b f ∥ 2 2 + η 2 ∥ b G ( f ) − b G ( b f ) ∥ 2 2 − 2 η D f − b f , b G ( f ) − b G ( b f ) E ≤ ∥ f − b f ∥ 2 2 + η 2 b β 2  ∥ f − b f ∥ 2 2 + 2 mε 2  − 2 η b µ  ∥ f − b f ∥ 2 2 − mε 2  , where the last step makes use of inequality ( 56 ), along with the ( b µ, ε )-approximate monotonicity condition ( 33b )). Re-arranging terms yields ∥ b S η ( f ) − b S η ( b f ) ∥ 2 ≤  1 + η 2 b β 2 − 2 η b µ  ∥ f − b f ∥ 2 + 2  η 2 b β 2 + η b µ  mε 2 , and after setting η = b µ/ b β 2 , we ﬁnd that ∥ b S η ( f ) − b S η ( b f ) ∥ 2 ≤  1 − b µ 2 b β 2  ∥ f − b f ∥ 2 + 4 b µ 2 b β 2 ε 2 = γ ∥ f − b f ∥ 2 2 + 4(1 − γ ) mε 2 , where we hav e made use of the deﬁnition γ = 1 − ( b µ/ b β ) 2 . Since the pro ximal op erator is non- expansiv e, w e ha ve ∥ b H η ( f ) − b H η ( b f ) ∥ 2 2 = ∥ Prox η ( b S η ( f )) − Prox η ( b S η ( b f )) ∥ 2 ≤ ∥ b S η ( f ) − b S η ( b f ) ∥ 2 2 ≤ γ ∥ f − b f ∥ 2 2 + 4(1 − γ ) mε 2 . Rescaling b oth sides by 1 /m to conv ert to the ∥ · ∥ m -norm completes the pro of of the b ound ( 55a ). 6 Discussion W e studied statistical estimation in a studen t–teacher setting, where the predictions of a blac k- b o x predictive metho d (the teacher) are used to guide training of a second mo del (the studen t), whic h migh t b e simpler, more interpretable, or b etter adapted to some form of co v ariate shift. W e in tro duced a precise notion of a studen t oracle estimand f † , in particular via the ﬁxed p oint of a proximal up date applied to a target p opulation ob jective. This p ersp ective naturally leads to the r esidual-as-te acher ( RaT ) estimator, which mimics this proximal up date by using the teac her to estimate the studen t’s residuals. W e prov ed v arious guaran tees for this estimator, including statistical b ounds ( Theorem 1 , Prop osition 2 and Corollary 1 ), as w ell as computational b ounds on an iterative algorithm used to compute the RaT ﬁxed p oint ( Theorem 3 ). In addition, we also pro vided theory for, and comparisons with, the standard student soft-matching ( SM ) approach to this problem. This theory reveals that the tw o metho ds diﬀer fundamentally in how they handle bias or mis-sp eciﬁcation present in the teacher (cf. Corollary 1 and Prop osition 2 ). Moreo v er, for kernel-based studen t–teacher pairs, we established a separation result ( Theorem 2 ): when the studen t regularization parameter γ is appropriately tuned, then the RaT estimator achiev es the minimax-optimal rate, whereas the SM estimator is inconsistent for any choice of studen t tuning parameter. 37 Our w ork leav es open a num b er of in teresting questions. First, we h a v e shown that RaT ac hieves minimax-optimal rates for a speciﬁc class of student–teac her problems. Whether RaT remains minimax-optimal b eyond this class, particularly for more general student–teac her pairs and non- k ernel settings, is an important op en question. Another imp ortant direction is to better understand the statistical prop erties of teac her-based gradient estimation under cov ariate shift, including the distinction b et w een b enign and malign co v ariate shift. W e gav e a precise c haracterization in one setting ( Theorem 2 ), and susp ect that qualitatively similar results hold more generally . Finally , a more general understanding of the interaction betw een the student and teac her function classes w ould be v aluable. Our results sho w that the abilit y of RaT to mitigate bias dep ends on this in teraction, but a systematic characterization remains open. Ac knowledgemen ts This w ork was partially funded b y National Science F oundation Gran t NSF DMS-2311072; Oﬃce of Na v al Reseach ONR Grant N00014026-1-2116, and the F ord Professorship to MJW. KY was supp orted b y the T ak enak a Scholarship F oundation. References [Ara+20] E. Arazo et al. “Pseudo-lab eling and conﬁrmation bias in deep semi-sup ervised learn- ing”. In: 2020 International Joint Confer enc e on Neur al Networks (IJCNN) . IEEE. 2020, pp. 1–8. [Bas+20] R. Basri et al. “The Conv ergence Rate of Neural Net works for Learned F unctions of Diﬀerent F requencies”. In: A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) (2020). [BC14] J. Ba and R. Caruana. “Do Deep Nets Really Need to b e Deep?” In: A dvanc es in Neur al Information Pr o c essing Systems 27 (NeurIPS) . 2014. url : https : / / papers . nips . cc / paper _ files / paper / 2014 / file / ea8fcd92d59581717c0b1c6866a3b2f8 - Paper.pdf . [BC17] H. H. Bausc hke and P . L. Com b ettes. Convex Analysis and Monotone Op er ator The ory in Hilb ert Sp ac es . 2nd ed. Springer, 2017. [BCNM06] C. Bucila, R. Caruana, and A. Niculescu-Mizil. “Mo del Compression”. In: Pr o c e e d- ings of the 12th ACM SIGKDD International Confer enc e on Know le dge Disc overy and Data Mining (KDD) . 2006. url : https : / / www . cs . cornell . edu / ~ caruana / compression.kdd06.pdf . [Bec17] A. Bec k. First-Or der Metho ds in Optimization . SIAM, 2017. [Bre01] L. Breiman. “Random F orests”. In: Machine L e arning 45.1 (2001), pp. 5–32. [Bre+84] L. Breiman et al. Classiﬁc ation and R e gr ession T r e es . W adsw orth, 1984. [Bre98] L. Breiman. “Arcing classiﬁer (with discussion and a rejoinder b y the author)”. In: The A nnals of Statistics 26.3 (1998), pp. 801–849. [BS96] L. Breiman and N. Shang. Born A gain T r e es . T ec h. rep. Univ ersit y of California, Berk eley, 1996. url : https://www.stat.berkeley.edu/ ~ breiman/BAtrees.pdf . 38 [BY03] P . B¨ uhlmann and B. Y u. “Bo osting with the L2 Loss: Regression and Classiﬁcation”. In: Journal of the Americ an Statistic al Asso ciation 98.462 (2003), pp. 324–339. [CG16] T. Chen and C. Guestrin. “X G Bo ost: A Scalable T ree Bo osting System”. In: Pr o c e e d- ings of the 22nd A CM SIGKDD International Confer enc e on Know le dge Disc overy and Data Mining (KDD) . 2016, pp. 785–794. [CS96] M. W. Crav en and J. W. Sha vlik. “Extracting T ree-Structured Represen tations of T rained Netw orks”. In: A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) . 1996. [Dee25] DeepSeek-AI. “DeepSeek-R1: Incentivizing Reasoning Capabilit y in LLMs via Rein- forcemen t Learning”. In: (2025). arXiv: 2501.12948 [cs.CL] . [F G96] J. F an and I. Gijb els. L o c al Polynomial Mo del ling and Its Applic ations . CR C Press, 1996. [FH17] N. F rosst and G. Hinton. “Distilling a Neural Netw ork Into a Soft Decision T ree”. In: arXiv pr eprint arXiv:1711.09784 (2017). url : https : / / arxiv . org / pdf / 1711 . 09784.pdf . [FHT00] J. F riedman, T. Hastie, and R. Tibshirani. “Additiv e Logistic Regression: A Statistical View of Bo osting”. In: The Annals of Statistics 28.2 (2000), pp. 337–407. [F ri01] J. H. F riedman. “Greedy function appro ximation: A gradien t b o osting machine”. In: A nnals of Statistics (2001), pp. 1189–1232. [F ur+18] T. F urlanello et al. “Born-Again Neural Net w orks”. In: Pr o c e e dings of Machine L e arn- ing R ese ar ch . V ol. 80. 2018, pp. 1607–1616. url : https://proceedings.mlr.press/ v80/furlanello18a/furlanello18a.pdf . [Gee00] S. v an de Geer. Empiric al Pr o c esses in M-Estimation . Cambridge Universit y Press, 2000. [Gu02] C. Gu. Smo othing spline ANOV A mo dels . Springer Series in Statistics. New Y ork, NY: Springer, 2002. [HD19] D. Hendrycks and T. Dietterich. “Benc hmarking Neural Netw ork Robustness to Com- mon Corruptions and Perturbations”. In: ICLR . 2019. [He+16] K. He et al. “Deep residual learning for image recognition”. In: Pr o c e e dings of the IEEE Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) . 2016, pp. 770–778. [Ho w19] J. How ard. Imagenette: A smal ler subset of 10 e asily classiﬁe d classes fr om ImageNet . 2019. url : https://github.com/fastai/imagenette . [HVD15] G. Hinton, O. Viny als, and J. Dean. “Distilling the Knowledge in a Neural Netw ork”. In: arXiv pr eprint arXiv:1503.02531 (2015). url : https : / / arxiv . org / pdf / 1503 . 02531.pdf . [Nad64] E. A. Nadara y a. “On estimating regression”. In: The ory of Pr ob ability & Its Applic a- tions 9.1 (1964), pp. 141–142. [Nes18] Y. Nesterov. L e ctur es on Convex Optimization . Springer, 2018. [PB14] N. Parikh and S. Bo yd. “Proximal Algorithms”. In: F oundations and T r ends in Opti- mization 1.3 (2014), pp. 127–239. doi : 10.1561/2400000003 . 39 [Rah+19] N. Rahaman et al. “On the Sp ectral Bias of Neural Netw orks”. In: International Confer enc e on Machine L e arning (ICML) . PMLR. 2019, pp. 5301–5310. [RP96] R. T. Ro ck afellar and R. A. Poliquin. “Pro x-regular functions in v ariational analysis”. In: T r ansactions of the A meric an Mathematic al So ciety 348.5 (1996), pp. 1805–1838. [R W98] R. T. Rock afellar and R. J.-B. W ets. V ariational Analysis . V ol. 317. Grundlehren der mathematisc hen Wissensc haften. Springer, 1998. [R WY14] G. Raskutti, M. J. W ainwrigh t, and B. Y u. “Early stopping and non-parametric re- gression: An optimal data-dep endent stopping rule”. In: Journal of Machine L e arning R ese ar ch 15 (2014), pp. 335–366. [T el16] M. T elgarsky . “Beneﬁts of Depth in Neural Netw orks”. In: Pr o c e e dings of the 29th An- nual Confer enc e on L e arning The ory . Ed. b y V. F eldman, A. Rakhlin, and O. Shamir. V ol. 49. Pro ceedings of Mac hine Learning Researc h. PMLR, 2016, pp. 1517–1539. [Tsy09] A. B. Tsybako v. Intr o duction to Nonp ar ametric Estimation . Springer Series in Statis- tics. New Y ork: Springer, 2009. isbn : 978-0-387-79051-0. [TV17] A. T arv ainen and H. V alp ola. “Mean teachers are better role mo dels: W eigh t-a veraged consistency targets improv e semi-sup ervised deep learning results”. In: A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) (2017). arXiv: 1703.01780 [cs.NE] . [VSR20] T. Vidal, M. Schiﬀer, and S. Ropke. “Born-Again T ree Ensembles”. In: Pr o c e e d- ings of Machine L e arning R ese ar ch . V ol. 119. 2020, pp. 9740–9750. url : https : / / proceedings.mlr.press/v119/vidal20a/vidal20a.pdf . [VW96] A. W. v an der V aart and J. W ellner. We ak Conver genc e and Empiric al Pr o c esses . New Y ork, NY: Springer-V erlag, 1996. [W ah90] G. W ahba. Spline Mo dels for Observational Data . CBMS-NSF Regional Conference Series in Applied Mathematics. Philadelphia, PN: SIAM, 1990. [W ai19] M. J. W ainwrigh t. High-dimensional statistics: A non-asymptotic viewp oint . Cam- bridge, UK: Cam bridge Univ ersity Press, 2019. [W at64] G. S. W atson. “Smooth regression analysis”. In: Sankhy¯ a: The Indian Journal of Statis- tics, Series A 26.4 (1964), pp. 359–372. [WYW19] Y. W ei, F. Y ang, and M. J. W ainwrigh t. “Early stopping for kernel b o osting algo- rithms: A general analysis with lo calized complexities”. In: IEEE T r ans. Info. The ory 65.10 (2019), pp. 6685–6703. [Xie+20] Q. Xie et al. “Self-T raining with Noisy Student Impro ves ImageNet Classiﬁcation”. In: CVPR . 2020. url : https://openaccess.thecvf.com/content_CVPR_2020/papers/ Xie _ Self - Training_ With _ Noisy _ Student _ Improves _ ImageNet _ Classification _ CVPR_2020_paper.pdf . [XZX20] Z.-Q. J. Xu, Y. Zhang, and Y. Xiao. “F requency Principle: F ourier Analysis Sheds Ligh t on Deep Neural Netw orks”. In: Communic ations in Computational Physics 28.5 (2020), pp. 1746–1767. doi : 10.4208/cicp.OA- 2020- 0088 . [ZY05] T. Zhang and B. Y u. “Bo osting with Early Stopping: Conv ergence and Consistency”. In: The A nnals of Statistics 33.4 (2005), pp. 1538–1579. 40 A Pro ximal op erators and ﬁxed p oin ts In this app endix, we collect together v arious prop erties of proximal up dates, pertinent to both the oracle function f † and the RaT estimate b f RaT . As in the pro of of Theorem 1 , we deﬁne the function P en m ( u ) : = min { f ∈ F | f ( e x m 1 )= u } P en( f ). A.1 Pro ximal ﬁxed p oint for f † Let us clarify wh y f † satisﬁes the pro ximal ﬁxed point equation. Consider a function e f that satisﬁes the proximal ﬁxed point relation e f = Prox η  e f ( e x m 1 ) − η ∇ ¯ L m ( e f )  . Letting e z ∈ ∂ P en m ( e f ( e x m 1 )), the optimalit y conditions asso ciated with the proximal update imply that ∇ ¯ L m ( e f ) + e z = 0. But this is exactly the zero sub-gradien t condition that deﬁnes the studen t oracle estimand f † . A.2 RaT ﬁxed p oin ts Recall that w e ha ve deﬁned the RaT estimator via the ﬁxed point equation b f RaT = Prox η  b f RaT ( e x m 1 ) − η b G ( b f RaT )  (57) for some stepsize η > 0. In order for this deﬁnition to b e meaningful, it should b e the case that the set of ﬁxed p oin ts is indep enden t of the c hoice of stepsize. Here w e establish this basic fact. Consider any function b f that satisﬁes the ﬁxed p oint relation ( 57 ) for some stepsize η > 0. Expanding the deﬁnition of the proximal up date, w e ha ve b f ∈ arg min f ∈ F n 1 2 η ∥ f ( e x m 1 ) − u ∥ 2 2 + Pen m ( f ( e x m 1 )) o , where u = b f ( e x m 1 ) − η b G ( b f ). The zero-order optimality conditions for this optimization problem imply that there exists some b z ∈ ∂ P en m ( b f ( e x m 1 )) such that 0 = 1 η  b f ( e x m 1 ) − u  + b z = b G ( b f ) + b z . Observ e that the righ t-hand side is indep enden t of η , so that the set of ﬁxed p oints do es not depend on η , as claimed. B Additional pro ofs In this app endix, w e collect together v arious pro ofs of auxiliary results, including the pro of of Prop o- sition 1 in Section B.1 ; the pro of of Corollary 1 in Section B.2 ; and the proof of Proposition 2 in Section B.3 . B.1 Pro of of Prop osition 1 The claim of this prop osition follo ws by argumen ts analogous to those used to analyze the RaT estimator in the pro of of Theorem 1 . In particular, b eginning with the deﬁnition ( 12b ) of the SM estimator, it follo ws that it satisﬁes a pro ximal ﬁxed p oint equation of the form b f SM = Prox η  b f SM ( e x m 1 ) − η ∇ L SM m ( b f SM )  . (58) 41 Consequen tly , it can be viewed as a v ariant of the RaT estimator, in whic h the appro ximate gradien t b G ( f ) used in RaT is replaced b y ∇ L SM m ( f ). The argumen ts used in the pro of of Theorem 1 do not exploit any sp eciﬁc prop erties of the gradient approximation used, as long as it satisﬁes a proximal ﬁxed p oint equation of the form ( 58 ). Consequently , we can recapitulate these same arguments, mutatis mutandis , in order to deriv e the analogous guarantees for the SM estimator. B.2 Pro of of Corollary 1 W e divide our proof into t w o parts, corresp onding to eac h of the t wo claims ( 20a ) and ( 20b ). Recall that the least-squares loss is strongly conv ex with µ = 1, so that the b ounds ( 15b ) and ( 9b ) are in force for the SM and RaT estimators, resp ectively . Moreov er, for the least-squares loss, the oracle gradien t ∇ ¯ L m ( f ) ∈ R m has entries  ∇ ¯ L m ( f )  j = f ( e x j ) − E [ Y | e x j ] = f ( e x j ) −  f † ( e x j ) + g ∗ ( e x j )  , (59) where we hav e inserted the decomposition E [ Y | x ] = f † ( x ) + g ∗ ( x ). Pro of of the SM b ound ( 20a ) : The least-squares loss is strongly conv ex with µ = 1, so that the b ound ( 15b ) is in force. The SM gradient ∇ L SM m ( f ) ∈ R m tak es the form [ ∇ L SM m ( f )] j = f ( e x j ) − b y j for j = 1 , . . . , m , where b y : = T ( y ) ∈ R m are the pseudo-resp onses computed by the teacher. Substituting these c hoices in to the b ound ( 15b ) yields ∥ b f SM − f † ∥ 2 m ≤ D b f SM − f † , T ( y ) −  f † ( e x m 1 ) + g ∗ ( e x m 1 )  E m . (60) W e now decomp ose the diﬀerence T ( y ) −  f † ( e x m 1 ) + g ∗ ( e x m 1 )  as the sum T  f † ( x n 1 ) + g ∗ ( x n 1 ) + w  − T ( f † + g ∗ ) | {z } ≡ v SM + n T ( f † + g ∗ ) −  f † ( e x m 1 ) + g ∗ ( e x m 1 )  o . Com bined with Y oung’s inequalit y , inequalit y ( 60 ) then implies that ∥ b f SM − f † ∥ 2 m ≤ D b f SM − f † , v SM E m + 1 2 ∥ b f SM − f † ∥ 2 m + 1 2 B 2 SM ( f † + g ∗ ) , where B 2 SM ( f † + g ∗ ) : = ∥T ( f † ( x n 1 ) + g ∗ ( x n 1 )) − ( f † + g ∗ ) ∥ 2 m . Re-arranging terms completes the pro of. Pro of of the RaT b ound ( 20b ) : In this case, the b ound ( 20b ) is in force, and again using the gradien t represen tation ( 59 ), we arrive at the upp er bound ∥ b f RaT − f † ∥ 2 m ≤ D b f RaT − f † , b f RaT ( e x m 1 ) − f ∗ ( e x m 1 ) − b G ( b f RaT ) E m . Equiv alen tly , in terms of the shorthand ∆ = b f RaT − f † and the representation b G ( b f RaT ) = T ( b f RaT ( x n 1 ) − y ), w e ha ve ∥ ∆ ∥ 2 m ≤ D ∆ , T ( b f RaT ( x n 1 ) − y ) + ∆( e x m 1 ) − g ∗ ( e x m 1 ) E m . (61) 42 No w observing that T ( b f RaT ( x n 1 ) − y ) = T (∆( x n 1 ) − g ∗ ( x n 1 ) − w ), we decomp ose the diﬀerence T ( b f RaT ( x n 1 ) − y ) + ∆( e x m 1 ) − g ∗ ( e x m 1 ) into the sum T  b f RaT ( x n 1 ) − y  − T (∆( x n 1 ) − g ∗ ( x n 1 )) | {z } ≡ v RaT + n T (∆( x n 1 ) − g ∗ ( x n 1 )) −  ∆( e x m 1 ) − g ∗ ( e x m 1 )  o . Substituting this decomposition in to the upper bound ( 61 ) and applying Y oung’s inequalit y yields ∥ ∆ ∥ 2 m ≤ ⟨ ∆ , v RaT ⟩ m + 1 2 ∥ ∆ ∥ 2 m + 1 2 B 2 RaT ( g ∗ ) , where B 2 RaT ( g ∗ ) : = ∥T (∆( x n 1 ) − g ∗ ( x n 1 )) − (∆ − g ∗ ) ∥ 2 m . Re-arranging terms yields the claim. B.3 Pro of of Prop osition 2 W e ﬁrst derive the claimed forms of the SM and RaT weigh t vectors b θ SM and b θ RaT , as given in equa- tions ( 26a ) and ( 26b ), resp ectiv ely . Recall that functions in the student class can b e written as f θ ( · ) = P m j =1 θ j e K ( · , e x j ) for some weigh t v ector θ ∈ R m . Pro of of relation ( 26a ) : By deﬁnition, the PL estimate in this case performs γ -regularized KRR on the teacher’s output v ector T ( y ) ∈ R m . By standard results on kernel ridge regression, the resulting estimate is given b y the student function f b θ SM , where the weigh t v ector takes the claimed form b θ SM =  e K mm + mγ I ) − 1 T ( y ). Pro of of relation ( 26b ) : Recall that in this section, the student penalty tak es the form P en( f ) = mγ 2 ∥ f ∥ 2 H . Consequen tly , when the proximal operator Prox η is applied to a v ector u ∈ R m , it returns the student function f v , where the co eﬃcien t v ector v ∈ R m is given by v : =  e K mm + mγ η I m  − 1 u. No w let b f RaT ≡ f b θ RaT b e a RaT ﬁxed p oint. F or the least-squares loss, the source residual vector is b f RaT ( x n 1 ) − y = e K nm b θ RaT − y , and hence the teac her-estimated gradient is b G ( b f RaT ) = T  e K nm b θ RaT − y  . Substituting these relations into the ﬁxed p oint equation ( 7 ), we ﬁnd that b θ RaT =  e K mm + mγ η I m  − 1  e K mm b θ RaT − η T  e K nm b θ RaT − y   . (62) Using the resp onse-linearit y of the teacher, this becomes b θ RaT =  e K mm + mγ η I m  − 1  e K mm b θ RaT − η A b θ RaT + η T ( y )  , where we recall the deﬁnition A : = T e K nm . Multiplying b oth sides b y e K mm + mγ η I m yields  e K mm + mγ η I m  b θ RaT = e K mm b θ RaT − η A b θ RaT + η T ( y ) . 43 Cancelling the common term e K mm b θ RaT from b oth sides, we obtain mγ η b θ RaT = − η A b θ RaT + η T ( y ). Finally , dividing b oth sides b y η > 0 and rearranging yields  A + mγ I m  b θ RaT = T ( y ) . Assuming that A + mγ I m is inv ertible, we conclude that b θ RaT =  A + mγ I m  − 1 T ( y ), as claimed in equation ( 26b ). W e now turn to the proofs of the exact MSEs giv en in Prop osition 2 . Pro of of the MSE relation ( 28b ) : By deﬁnition of the observ ation mo del, we ha ve the relation y = e K nm θ ∗ + w . Recalling our shorthand notation A = T e K nm and B γ = e K mm + mγ I m , we can write b θ SM − θ ∗ = B − 1 γ T ( y ) − θ ∗ =  B − 1 γ A − I m  θ ∗ + B − 1 γ T w . Recall that ∥ b f SM − f ∗ ∥ 2 m = (1 /m ) ∥ e K ( b θ SM − θ ∗ ) ∥ 2 2 . W e th us ha ve E w ∥ b f SM − f ∗ ∥ 2 m = 1 m ∥ e K mm  B − 1 γ A − I m  θ ∗ ∥ 2 2 + σ 2 m | | | e K mm B − 1 γ T | | | 2 F , using the fact that co v ( w ) = σ 2 I n . Pro of of the MSE relation ( 28a ) : Similarly , w e can write b θ RaT − θ ∗ =  ( A + mγ I ) − 1 A − I  θ ∗ + ( A + mγ I ) − 1 T ( w ) . Using the iden tity ( A + mγ I ) − 1 A = I − mγ ( A + mγ I ) − 1 , we hav e b θ RaT − θ ∗ = − mγ ( A + mγ I ) − 1 θ ∗ + ( A + mγ I ) − 1 T ( w ) W e again use the relation ∥ b f RaT − f ∗ ∥ 2 m = (1 /m ) ∥ e K mm ( b θ RaT − θ ∗ ) ∥ 2 2 . Computing the squared bias and v ariance terms as before yields the claim. C Auxiliary results for Theorem 2 This app endix is devoted to v arious auxiliary results asso ciated with the separation result given in Theorem 2 . C.1 Pro of of equations ( 42a ) and ( 42b ) Under the ﬁnite-dimensional feature represen tation introduced abov e, the Gram matrices admit the forms K nn = ΦΦ ⊤ and e K mn = ˜ ΦΦ ⊤ , e K mm = ˜ Φ ˜ Φ ⊤ . Substituting these expressions into the deﬁnitions of the op erators A and T λ T ⊤ λ allo ws us to express them in terms of the feature matrices. A k ey step in these calculations is the push-through identit y ( ΦΦ ⊤ + nλ I n ) − 1 Φ = Φ ( Φ ⊤ Φ + nλ I D ) − 1 . (63) 44 W e start from the deﬁnition T λ = K mn ( K nn + nλ I n ) − 1 . Substituting the feature represen tations and applying the push-through iden tity yields T λ = ˜ ΦΦ ⊤ ( ΦΦ ⊤ + nλ I n ) − 1 = ˜ Φ ( Φ ⊤ Φ + nλ I D ) − 1 Φ ⊤ . Using Φ ⊤ Φ = n Σ , we obtain the compact represen tation T λ = 1 n ˜ Φ ( Σ + λ I D ) − 1 Φ ⊤ = 1 n ˜ ΦΣ − 1 λ Φ ⊤ , (64) where we deﬁne the shorthand Σ λ = Σ + λ I D . Similar calculations applied to the op erator A and e K mm ( A + mγ I m ) − 1 yield the represen tations A = ˜ ΦΣΣ − 1 λ ˜ Φ ⊤ , e K mm ( A + mγ I m ) − 1 = 1 m ˜ Φ  e ΣΣΣ − 1 λ + γ I D  − 1 ˜ Φ ⊤ . Finally , substituting the ab ov e expressions into the deﬁnitions of the bias and v ariance terms yields the desired representations in equation ( 28a ). F or the bias term, w e hav e B 2 RaT ( g ∗ ) 2 : = 1 m ∥ mγ e K mm ( A + mγ I m ) − 1 θ ∗ ∥ 2 2 = 1 m    γ ˜ Φ  e ΣΣΣ − 1 λ + γ I D  − 1 ˜ Φ ⊤ θ ∗    2 2 = γ 2 m ( θ ∗ ) ⊤ ˜ Φ  e ΣΣΣ − 1 λ + γ I D  −⊤ ( ˜ Φ ⊤ ˜ Φ )  e ΣΣΣ − 1 λ + γ I D  − 1 ˜ Φ ⊤ θ ∗ = γ 2 ( ˜ Φ ⊤ θ ∗ ) ⊤  e ΣΣΣ − 1 λ + γ I D  −⊤ e Σ  e ΣΣΣ − 1 λ + γ I D  − 1 ( ˜ Φ ⊤ θ ∗ ) , where we used ˜ Φ ⊤ ˜ Φ = m e Σ . Deﬁning v ∗ = ˜ Φ ⊤ θ ∗ , this b ecomes B 2 RaT ( g ∗ ) 2 = γ 2 ( v ∗ ) ⊤  e ΣΣΣ − 1 λ + γ I D  −⊤ e Σ  e ΣΣΣ − 1 λ + γ I D  − 1 v ∗ = γ 2   e Σ 1 / 2  e ΣΣΣ − 1 λ + γ I D  − 1 v ∗   2 2 . Similarly , for the v ariance term, we hav e V RaT : = σ 2 m | | | e K mm ( A + mγ I m ) − 1 T λ | | | 2 F = σ 2 m | | | 1 nm ˜ Φ  e ΣΣΣ − 1 λ + γ I D  − 1 ˜ Φ ⊤ ˜ ΦΣ − 1 λ Φ ⊤ | | | 2 F = σ 2 m n 2 | | | ˜ Φ  e ΣΣΣ − 1 λ + γ I D  − 1 e ΣΣ − 1 λ Φ ⊤ | | | 2 F , where we used ˜ Φ ⊤ ˜ Φ = m e Σ . No w expand the F rob enius norm via ∥ M ∥ 2 F = T r( M ⊤ M ), from cyclicit y of trace, V RaT = σ 2 m n 2 T r  Φ ⊤ ΦΣ − 1 λ e Σ  e ΣΣΣ − 1 λ + γ I D  −⊤ ˜ Φ ⊤ ˜ Φ  e ΣΣΣ − 1 λ + γ I D  − 1 e ΣΣ − 1 λ  = σ 2 n T r  ΣΣ − 1 λ e Σ  e ΣΣΣ − 1 λ + γ I D  −⊤ e Σ  e ΣΣΣ − 1 λ + γ I D  − 1 e ΣΣ − 1 λ  , where in the second line we again used ˜ Φ ⊤ ˜ Φ = m e Σ and Φ ⊤ Φ = n Σ . 45 C.1.1 Pro of of equation ( 47 ) Starting from ( 28c ), we decomp ose the bias v ector as B 2 SM ( f ∗ ) = 1 m   ( e K mm − A ) θ ∗ + mγ ( e K mm + mγ I m ) − 1 A θ ∗   2 . Using the feature representations, this becomes B 2 SM ( f ∗ ) = 1 m   ˜ Φ ( I D − ΣΣ − 1 λ ) ˜ Φ ⊤ θ ∗ + mγ ( ˜ Φ ˜ Φ ⊤ + mγ I m ) − 1 ˜ ΦΣΣ − 1 λ ˜ Φ ⊤ θ ∗   2 = 1 m   ˜ Φ  λ Σ − 1 λ + γ e Σ − 1 γ ΣΣ − 1 λ  v ∗   2 = ( v ∗ ) ⊤  λ Σ − 1 λ + γ e Σ − 1 γ ΣΣ − 1 λ  ⊤ e Σ  λ Σ − 1 λ + γ e Σ − 1 γ ΣΣ − 1 λ  v ∗ (65) =   e Σ 1 / 2  λ Σ − 1 λ + γ e Σ − 1 γ ΣΣ − 1 λ  v ∗   2 , where we used the identities ( ˜ Φ ˜ Φ ⊤ + mγ I ) − 1 ˜ Φ = ˜ Φ ( ˜ Φ ⊤ ˜ Φ + mγ I ) − 1 , I D − ΣΣ − 1 λ = λ Σ − 1 λ , and wrote v ∗ = ˜ Φ ⊤ θ ∗ and ˜ Φ ⊤ ˜ Φ = m e Σ . This concludes the pro of of the bias represen tation ( 47 ). C.2 Pro of of the b ound ( 45 ) Fix constants c α, − , c α, + , c β , − , c β , + ∈ (0 , ∞ ) suc h that c α, − k − 2 α ≤ σ k ≤ c α, + k − 2 α , c β , − k − 2 β ≤ e σ k ≤ c β , + k − 2 β for all k = 1 , . . . , D . W e now deriv e a uniform lo wer b ound on a k . If σ k ≥ λ , then σ k + λ ≤ 2 σ k , and hence a k = e σ k σ k σ k + λ ≥ 1 2 e σ k . If instead σ k < λ , then σ k + λ ≤ 2 λ , and hence a k = e σ k σ k σ k + λ ≥ e σ k σ k 2 λ . Therefore, for ev ery k = 1 , . . . , D , w e ha ve a k ≥ 1 2 min  e σ k , e σ k σ k λ  . Substituting the eigendeca y b ounds gives a k ≥ 1 2 min n c β , − k − 2 β , c α, − c β , − λ k − 2( α + β ) o . Consequen tly , w e ha ve γ e σ k a k + γ ≤ min  e σ k , γ e σ k a k  ≤ min  c β , + k − 2 β , 2 c β , + γ k − 2 β min { c β , − k − 2 β , ( c α, − c β , − /λ ) k − 2( α + β ) }  . Using min { u, v } ≤ u and min { u, v } ≤ v , this implies γ e σ k a k + γ ≤ min  c β , + k − 2 β , 2 c β , + c β , − γ , 2 c β , + c α, − c β , − λγ k 2 α  . Since the middle term is dominated by the maxim um of the ﬁrst and third terms at the balancing scale, it suﬃces to b ound γ e σ k a k + γ ≤ min  c β , + k − 2 β , 2 c β , + c α, − c β , − λγ k 2 α  . 46 Deﬁne the functions A 1 ( k ) : = c β , + k − 2 β and A 2 ( k ) : = 2 c β , + c α, − c β , − λγ k 2 α . Observe that A 1 is de- creasing whereas A 2 is increasing in k , so that the maximum of min { A 1 ( k ) , A 2 ( k ) } is attained when the tw o terms are of comparable size. Cho ose k ∗ > 0 satisfy A 1 ( k ∗ ) = A 2 ( k ∗ ), so that ( k ∗ ) 2( α + β ) = c α, − c β , − 2 1 λγ . Substituting this relation yields A 1 ( k ∗ ) = A 2 ( k ∗ ) = c β , +  2 c α, − c β , −  β α + β ( λγ ) β α + β . Putting together the pieces yields the claim ( 45 ). D Analysis for resp onse-linear teac hers In this section, w e discuss tec hniques and t ypical scalings of the stochastic error in the b ound ( 22b ). In particular, it inv olves the stochastic error term D b f − f † , T ( w ) E m , with a term of this form shared b y the RaT and SM estimates. F or understanding, it is con venien t to introduce the rescaled noise v ariables v j : = p n m [ T ( w )] j for j = 1 , . . . , m . W e clarify the motiv ation for the p n/m rescaling in a moment. With this deﬁnition, for any function f ∈ F , w e ha ve the equiv alence D f − f † , T ( w ) E m = 1 √ n m X j =1 v j  f ( e x j ) − f † ( e x j )  √ m . (66) This term can b e recognized as a v ariant of the standard noise complexity in v olved in analysis of M -estimators. It inv olves the standard rescaling 1 / √ n in source sample size n , but diﬀers in that the summation is o v er the m target samples. Nonetheless, we can b ound it using standard tec hniques in empirical process theory . The nov el and interesting ingredient is the structure of the noise vector v ∈ R m , which has been transformed by the teac her from the original vector w ∈ R n . F or a general studen t function class, it is standard to use discretization argumen ts, via metric en tropy or VC dimension, to reduce the problem to studying a ﬁnite maximum. Th us, to gain in tuition, it is natural to study the behavior of the studen t noise term ( 66 ) when the studen t function f ranges o ver a ﬁnite class { f 1 , . . . , f N } , sa y with eac h function is uniformly b ounded as ∥ f ℓ ∥ ∞ ≤ 1. If the original noise v ariables are i.i.d. w i ∼ N (0 , σ 2 ), the new noise v ector v ∈ R m is Gaussian with co v ariance C : = n m TT ⊤ ∈ R m × m (67a) In tro duce the shorthand ∆ ℓ : = 1 √ m  f ℓ ( e x m 1 ) − f † ( e x m 1 )  ∈ R m , and observe that ∥ ∆ ℓ ∥ 2 ≤ √ 2 b y construction. With this notation, w e ha ve the bound E h D b f − f † , T ( w ) E m i ( i ) ≤ max ℓ =1 ,...N q ⟨ ∆ ℓ , C ∆ ℓ ⟩ r 2 σ 2 log N n ( ii ) ≤ 2 r | | | C | | | σ 2 log N n , (67b) 47 where | | | C | | | is the maximum eigen v alue of C . T o b e clear, w e note that b oth b ounds (i) and (ii) are quite crude, and moreo v er, the b ound could b e further improv ed by lo calization of the empirical pro cess. How ever, w e omit these tec hnical reﬁnements, since our main goal is to understand the studen t-teacher interaction. Th us, we are left to understand the structure of the cov ariance matrix C , which depends en tirely on the c hoice of rescaled teac her p n/m T . T o gain intuition, w e discuss tw o standard choices: S h i f t 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 I n v e r s e b a n d w i d t h h 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 T eacher nor m 1.2 1.4 1.6 1.8 2.0 T eacher nor m vs (shif t, inv -bandwidth) 1.2 1.4 1.6 1.8 2.0 S h i f t 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 I n v e r s e r e g . l o g ( 1 / ) 0 1 2 3 4 5 6 T eacher nor m 0.5 1.0 1.5 2.0 2.5 T eacher nor m versus (shif t, inverse-r eg) 0.5 1.0 1.5 2.0 2.5 (a) (b) Figure 8. Plots of the op erator norm | | | C | | | from equation ( 67a ) for t w o diﬀeren t classes of resp onse- linear teachers, target distribution Q uniform on [ − 1 , 1], and source distributions P from a Beta distribution (shifted to [ − 1 , 1]) with parameter α ∈ [0 . 25 , 2 . 0]. The setting α = 1 corresp onds to no co v ariate shift. (a) Nadara ya–W atson smoother with bandwidth h . Plots of | | | C | | | v ersus co v ariate shift α and inv erse bandwidth 1 /h . (b) Kernel ridge regression (KRR) teacher using a Gaussian k ernel and regularization parameter λ . Plots of | | | C | | | v ersus cov ariate shift α and in verse regularization 1 /λ . Kernel ridge teacher: Recall the form ( 21 ) of the kernel ridge regression (KRR) teac her with regularization parameter λ > 0. Note that r n m T λ = K mn √ m n  K nn n + λ I  − 1 . The op erator norm | | | Σ | | | corresp onds to the squared maximum singular v alue of this matrix, and by our c hoice of scaling ensures that it is an order-one quan tity in terms of the t wo sample sizes ( n, m ). In particular, the n -dimensional matrix K nn n is an empirical appro ximation to the k ernel in tegral op erator, so that by standard results, its sp ectrum will conv erge to that of the k ernel op erator. Similar comments apply to the rescaled target-source matrix K mn / √ mn , whic h has dimension m × n , and appro ximates the target-source cross-momen t operator. NW-smo othing teac her: The Nadara ya–W atson smo other [ W at64 ; Nad64 ] is another standard example. Let ϕ b e a base-kernel function that integrates to one, with a standard example b eing the Gaussian densit y function. F or a bandwidth parameter h > 0, the asso ciated teac her matrix 48 has ( j , i )-th en try giv en b y [ T h ] j i = ϕ h ( e x j − x i ) P n ℓ =1 ϕ h ( e x j − x ℓ ) where ϕ h ( t ) : = 1 h ϕ ( t/h ). (68) Th us, the NW teac her matrix T h is ro w-sto chastic, and typically acts lik e a lo cal a veraging op erator. 49

Residual-as-Teacher: Mitigating Bias Propagation in Student--Teacher Estimation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment