A Researcher's Guide to Empirical Risk Minimization
This guide provides a reference for high-probability regret bounds in empirical risk minimization (ERM). The presentation is modular: we begin with intuition and general proof strategies, then state broadly applicable guarantees under high-level cond…
Authors: Lars van der Laan
A Researc her’s Guide to Empirical Risk Minimization Lars v an der Laan Departmen t of Statistics, Universit y of W ashington F ebruary 26, 2026 Abstract This guide dev elops high-probability regret bounds for empirical risk minimization (ERM). The presen tation is modular: w e state broadly applicable guaran tees under high-lev el conditions and giv e tools for verifying them for specific losses and function classes. W e emphasize that many ERM rate deriv ations can b e organized around a three-step recipe—a basic inequalit y , a uniform lo cal concentration b ound, and a fixed-point argument—whic h yields regret bounds in terms of a critical radius, defined via localized Rademac her complexit y , under a mild Bernstein-type v ariance–risk condition. T o make these b ounds concrete, w e upp er bound the critical radius using local maximal inequalities and metric-entrop y integrals, reco vering familiar rates for VC- subgraph, Sob olev/H¨ older, and b ounded-v ariation classes. W e also review ERM with n uisance comp onents—including weigh ted ERM and Neyman- orthogonal losses—as they arise in causal inference, missing data, and domain adaptation. F ol- lo wing F oster and Syrgk anis ( 2023 ), w e highligh t that these problems often admit regret-transfer b ounds linking regret under an estimated loss to population regret under the target loss. These b ounds typically decomp ose regret in to (i) statistical error under the estimated (optimized) loss and (ii) approximation error due to n uisance estimation. Under sample splitting or cross-fitting, the first term can b e controlled using standard fixed-loss ERM regret b ounds, while the second term dep ends only on n uisance-estimation accuracy . W e also treat the in-sample regime, where n uisances and the ERM are fit on the same data, deriving regret b ounds and giving sufficien t conditions for fast rates. T able of con ten ts 1 In tro duction 2 2 Preliminaries 4 2.1 Problem setup and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 High-probabilit y guaran tees (P AC b ounds) . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 F rom regret to L 2 ( P ) error: curv ature and strong conv exit y . . . . . . . . . . . . . . 7 1 3 Pro of blueprin t for regret b ound 9 3.1 Step 1: the basic inequalit y (deterministic regret b ound) . . . . . . . . . . . . . . . . 9 3.2 W arm-up: turning the basic inequalit y in to rates . . . . . . . . . . . . . . . . . . . . 9 3.3 A three-step template for ERM rates . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4 Regret via lo calized Rademacher complexit y 15 4.1 A general high-probabilit y regret theorem . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 F rom en trop y to critical radii: b ounding localized complexit y . . . . . . . . . . . . . 18 4.3 Illustrativ e example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5 ERM with nuisance comp onents 23 5.1 W eighted ERM and regret transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.2 Orthogonal losses and nuisance-robust learning . . . . . . . . . . . . . . . . . . . . . 24 5.3 In-sample nuisance estimation: ERM without sample splitting . . . . . . . . . . . . . 26 6 Conclusion 30 A Regularized ERM 35 A.1 Enforcing strong conv exity via Tikhono v regularization (Ridge) . . . . . . . . . . . . 35 B Uniform lo cal concentration inequalities for empirical pro cesses 37 B.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 B.2 Uniform lo cal concen tration for empirical processes . . . . . . . . . . . . . . . . . . . 37 B.3 Uniform lo cal concen tration for Lipschitz transformations . . . . . . . . . . . . . . . 42 B.4 Lo cal maximal inequalities via metric entrop y . . . . . . . . . . . . . . . . . . . . . . 43 B.5 En tropy preserv ation results for star shap ed h ulls . . . . . . . . . . . . . . . . . . . . 45 C Uniform lo cal concentration for empirical inner pro ducts 46 C.1 A general bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 C.2 Local maximal inequality via sup-norm metric entrop y . . . . . . . . . . . . . . . . . 51 C.3 A sp ecialized bound for star-shap ed classes . . . . . . . . . . . . . . . . . . . . . . . 54 D Pro ofs of main results 58 D.1 Pro ofs for Section 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 D.2 Pro ofs for Section 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 D.3 Pro ofs for Section 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 D.4 Pro of for Section 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 1 In tro duction Empirical risk minimization (ERM) is a central to ol of mo dern statistics and machine learning. In a t ypical problem, w e c ho ose ˆ f n b y minimizing the empirical risk P n ℓ ( · , f ) ov er a function class F , 2 and seek guaran tees on the r e gr et (or excess risk) P ℓ ( · , ˆ f n ) − P ℓ ( · , f 0 ) , where f 0 ∈ arg min f ∈F P ℓ ( · , f ) is a p opulation risk minimizer. While this principle is simple, deriving sharp rates in new settings can be technical. Pro ofs typically rely on tools from empirical pro cess theory , suc h as uniform lo cal maximal inequalities and concen tration b ounds ( V an Der V aart and W ellner , 1996 ; Geer , 2000 ; Gin´ e and Koltc hinskii , 2006 ; W ain wright , 2019 ). T o av oid re-deriving regret b ounds for each loss and function class, a common goal is to express them in terms of standard complexit y measures—such as cov ering num b ers and entrop y integrals—for which a w ell-developed calculus is a v ailable. This guide presen ts high-probability regret analysis for ERM. W e take a mo dular approac h, isolating proof patterns and complexity b ounds that are broadly applicable. The main message is that many ERM rate deriv ations can be organized around a three-step recip e ( Geer , 2000 ; Abb eel and Ng , 2004 ; Bartlett et al. , 2005 ; Koltc hinskii , 2011 ; W ain wright , 2019 ): (i) a deterministic basic inequalit y; (ii) a uniform lo cal concen tration b ound for the empirical-pro cess term; and (iii) a fixed-p oin t argumen t that turns the lo cal b ound into a regret rate. Under a mild Bernstein-t yp e v ariance condition on the loss, we giv e a general theorem that b ounds ERM regret in terms of the critic al r adius —equiv alently , the lo c alize d R ademacher c omplexity —of the loss-difference class F ℓ := { ℓ ( · , f ) − ℓ ( · , f 0 ) : f ∈ F } . This critical-radius viewp oin t separates the statistic al task of con trolling the local complexit y of F ℓ from the algebr aic task of solving the associated fixed-p oin t inequalit y to obtain a regret rate. T o carry out step (ii) and compute critical radii in practice, we develop maximal inequalities that upp er b ound the lo cal complexit y of F ℓ , and hence its critical radius, in terms of metric- en tropy in tegrals ( V an Der V aart and W ellner , 2011 ; Lei et al. , 2016 ; W ainwrigh t , 2019 ). These results reduce many rate calculations to b ounding cov ering num b ers for F ℓ . Co vering num b ers are w ell understo o d for many classical function classes and b eha ve stably under basic op erations, suc h as forming star-shaped hulls or applying Lipsc hitz transformations, which streamlines these calculations. This route recov ers familiar critical radii and regret rates for V C-subgraph classes, Sob olev/H¨ older classes, and bounded-v ariation classes. A second fo cus of this guide is ERM with nuisanc e c omp onents . Many mo dern estimators min- imize empirical risks of the form P n ℓ ( · , f ; ˆ g ), where ˆ g is itself estimated from data (e.g., via inv erse- probabilit y weigh ting or pseudo-outcome regression). F ollowin g F oster and Syrgk anis ( 2023 ), nui- sance comp onents often do not require a new regret analysis: one can apply a standard regret b ound to the ERM under the estimate d loss and then use a r e gr et-tr ansfer inequality to con trol the error from using ˆ g in place of g , i.e., the discrepancy b etw een the p opulation risks (and minimiz- ers) under g and ˆ g . Building on F oster and Syrgk anis ( 2023 ), w e develop regret-transfer b ounds for w eighted ERM with estimated w eights. Under sample splitting or cross-fitting, these b ounds com bine with fixed-loss regret results to yield high-probability guaran tees for nuisance-dependent ERM. W e also treat the in-sample setting, where the n uisances are estimated on the same data used to compute ˆ f n , whic h requires additional uniform concentration arguments. F or suitably smo oth 3 optimization classes (e.g., H¨ older or Sobolev classes), w e sho w that oracle rates are attainable under Donsk er-type conditions on the nuisance class. Scop e and inten t. These notes are not intended as a comprehensive survey of ERM, nor do they aim to pro vide the sharp est p ossible regret bounds. Instead, they collect pro of patterns and complexit y b ounds that we ha ve found useful for deriving regret rates across problems. Our goal is to sit b etw een t w o complementary literatures: the generalit y of lo calized Rademacher complexit y argumen ts ( Bartlett and Mendelson , 2002 ; Bousquet et al. , 2003 ; Bartlett et al. , 2005 ; W ainwrigh t , 2019 ) and the practical con venience of uniform-entrop y and maximal-inequalit y argumen ts ( V an Der V aart and W ellner , 1996 , 2011 ). Similarly , Section 5 is mean t as a complement to the orthogonal statistical learning framework of F oster and Syrgk anis ( 2023 ). F or in tro ductory background on ERM, helpful lecture-note treatmen ts include Bartlett ( 2013 ), Gun tub oyina ( 2018 ), and Sen ( 2018 ), while bo ok-length introductions include Shalev-Sh wartz and Ben-David ( 2014 ), Mohri et al. ( 2018 ), W ainwrigh t ( 2019 ), and Koltc hinskii ( 2011 ). F or empirical-pro cess treatmen ts of ERM and M- estimation, see V an Der V aart and W ellner ( 1996 ) and Geer ( 2000 ). Organization. Section 2 in tro duces the ERM setup and the high-probabilit y language used throughout, and recalls ho w regret b ounds translate in to error bounds under curv ature. Section 3 presen ts a high-level blueprint and intuition for regret pro ofs. Section 4.1 develops a general regret theorem in terms of lo calized Rademac her complexit y and critical radii, while Section 4.2 provides en tropy-based b ounds that mak e critical-radius calculations concrete for common classes. Finally , Section 5 extends the framework to n uisance-dep endent ERM, including w eighted ERM, orthogonal losses, and in-sample nuisance estimation. 2 Preliminaries 2.1 Problem setup and notation W e observ e i.i.d. data Z 1 , . . . , Z n ∈ Z dra wn from an unkno wn distribution P , and write Z ∼ P for an indep endent dra w. The empirical distribution of the sample is P n := 1 n P n i =1 δ Z i , where δ z denotes a p oin t mass at z ∈ Z . F or an y measurable function g : Z → R , w e use the shorthand P g := E { g ( Z ) } , P n g := 1 n n X i =1 g ( Z i ) . W e write ∥ g ∥ := ∥ g ∥ L 2 ( P ) := { P g 2 } 1 / 2 and ∥ g ∥ ∞ := ess sup z ∈Z | g ( z ) | , and use ∥ g ∥ n := ∥ g ∥ L 2 ( P n ) as shorthand for the empirical L 2 norm. F or simplicit y , we let F denote a class of real-v alued functions on Z . If Z = ( X , Y ) and the functions in F dep end only on X , w e adopt the con v ention that for z = ( x, y ), f ( z ) := f ( x ) for all f ∈ F . Throughout, ≲ and ≳ denote inequalities that hold up to univ ersal constan ts, unless stated otherwise. W e denote the minimum and maximum op erators by x ∧ y := min { x, y } and x ∨ y := max { x, y } . 4 Our goal is to learn a (constrained) p opulation risk minimizer f 0 ∈ arg min f ∈F R ( f ) , R ( f ) := E { ℓ ( Z , f ) } = P ℓ ( · , f ) , where F is a class of candidate predictors and R ( f ) is the p opulation risk induced by a loss function ℓ ( z , f ) ∈ R . The loss function ℓ specifies the notion of predictive error b eing optimized, while F ma y encode structural assumptions such as sparsit y , smo othness, or lo w dimensionalit y . In practice, f 0 is often view ed as an appro ximation to a global minimizer f ⋆ 0 ∈ arg min f ∈ L 2 ( P ) R ( f ) (i.e., the minimizer ov er F = L 2 ( P )). Empirical risk minimization (ERM) is a basic principle for approximating p opulation risk min- imizers such as f 0 . The idea is simple: replace the unknown distribution P in R ( f ) = P ℓ ( · , f ) with its empirical coun terpart P n , and then minimize the resulting ob jectiv e. Concretely , an empirical risk minimizer is any solution ˆ f n ∈ arg min f ∈F R n ( f ) , R n ( f ) := P n ℓ ( · , f ) = 1 n n X i =1 ℓ ( Z i , f ) , where R n ( f ) is the empirical risk, computed o v er the sample { Z i } n i =1 . Examples of loss functions and their purp oses in machine learning can b e found, for example, in Hastie et al. ( 2009 ), James et al. ( 2013 ), Shalev-Shw artz and Ben-David ( 2014 ), W ainwrigh t ( 2019 ), and F oster and Syrgk anis ( 2023 ). F or instance, in regression with Z = ( X , Y ), a common c hoice is the squared loss ℓ { ( x, y ) , f } := { y − f ( x ) } 2 . In this case, the p opulation risk is globally minimized ov er L 2 ( P X ) by the regression function f ⋆ 0 ( x ) = E 0 [ Y | X = x ]. The function class F sp ecifies a w orking model for f ⋆ 0 and need not contain f ⋆ 0 exactly . F or example, one ma y assume that f ⋆ 0 is con tained in, or well approximated b y , the linear class F := { x 7→ x ⊤ β : β ∈ R p } (or a constrained v ariant such as F := { x 7→ x ⊤ β : ∥ β ∥ 1 ≤ B } in high-dimensional settings). In binary classification with Y ∈ { 0 , 1 } , one often mo dels f ( x ) ∈ (0 , 1) as a class probability and uses the cross-entrop y (negative log-likelihoo d) loss ℓ { ( x, y ) , f } := − y log f ( x ) − (1 − y ) log { 1 − f ( x ) } , for whic h a global minimizer is f ⋆ 0 ( x ) = P ( Y = 1 | X = x ). Different parameterizations of P ( Y = 1 | X = x ) lead to differen t but equiv alen t loss represen tations. F or example, if F parameterizes the log-o dds (the logit), so that f ( x ) ∈ R and p f ( x ) := { 1 + exp( − f ( x )) } − 1 is the induced conditional probabilit y , then the cross-entrop y loss can b e written as the logistic regression loss ℓ { ( x, y ) , f } := log 1 + exp { f ( x ) } − y f ( x ) . The benefit of this parameterization is that f is unconstrained, so F can b e tak en to b e a conv ex set (or a linear space), whic h often simplifies optimization, for example via gradient-based metho ds. The squared and logistic losses are sp ecial cases of a broad class of losses deriv ed from (w orking or quasi-)log-likelihoo ds for exp onential family mo dels, whic h also includes the P oisson (log-linear) 5 loss and man y others ( Hastie et al. , 2009 ). 2.2 High-probabilit y guaran tees (P AC b ounds) A cen tral goal of this guide is to understand ho w quickly ˆ f n approac hes f 0 as n grows. W e deriv e con vergence rates under general conditions, fo cusing on tw o accuracy measures: the regret (or excess risk) R ( ˆ f n ) − R ( f 0 ) and the L 2 ( P ) estimation error ∥ ˆ f n − f 0 ∥ . T o state finite-sample guaran tees, we will often use P A C-style (probably approximately cor- rect) bounds ( V apnik , 1999 ; Geer , 2000 ; Shalev-Sh wartz and Ben-Da vid , 2014 ; Mohri et al. , 2018 ; W ainwrigh t , 2019 ). Concretely , a P AC b ound sp ecifies a function ε ( n, η ) of the sample size n and a failure probabilit y η ∈ (0 , 1) such that, for every n ∈ N and every η ∈ (0 , 1), R ( ˆ f n ) − R ( f 0 ) ≤ ε ( n, η ) with probability at least 1 − η . The parameter η is user-c hosen: taking η smaller mak es the even t more likely , at the cost of a larger (more conserv ative) b ound. F or example, consider linear prediction with squared loss, where F := { x 7→ x ⊤ β : ∥ β ∥ 1 ≤ B } and f 0 ( x ) = x ⊤ β 0 for an s -sparse vector β 0 ∈ R p (i.e., ∥ β 0 ∥ 0 = s ). Let ˆ f b e the corresp onding ℓ 1 -constrained ERM (equiv alently , the lasso). Under standard design conditions, one has the high- probabilit y regret b ound ( W ain wright , 2019 ) R ( ˆ f n ) − R ( f 0 ) ≲ σ 2 s log ( p/η ) n with probability at least 1 − η , (1) where σ 2 is a noise level (e.g., the v ariance of the regression errors) and p is the ambien t dimension. P AC-st yle b ounds pro vide more information than O p (big-Oh in probabilit y) statements b ecause they mak e the dep endence on the failure probability and sample size explicit. The statement R ( ˆ f n ) − R ( f 0 ) = O p ( ε n ) means that, for every η ∈ (0 , 1), there exist constants C η < ∞ and N η suc h that, for all n ≥ N η , R ( ˆ f n ) − R ( f 0 ) ≤ C η ε n with probability at least 1 − η . The constant C η is left unsp ecified and ma y dep end p o orly on η . Moreo ver, this b ound need not hold un til the sample size exceeds N η , which ma y b e arbitrarily large. In con trast, a P AC bound giv es an explicit finite-sample guaran tee by pro viding an η -dep endent error function ε ( n, η ) (and, if needed, an explicit N η ). F or example, the high-dimensional linear regression b ound in ( 1 ) yields the O p statemen t R ( ˆ f n ) − R ( f 0 ) = O p σ 2 s log p/n . It also sho ws how the guaran tee degrades as η decreases: η enters only through log(1 /η ), whic h gro ws logarithmically as η ↓ 0. Another adv antage of P AC-st yle b ounds ov er O p b ounds is that they facilitate sim ultaneous con trol of many ev ents. F or example, supp ose we wish to control the errors of a collection of ERMs { ˆ f n,k : k ∈ K ( n ) } , where the index set K ( n ) gro ws with n . An O p statemen t for eac h k is inherently p oin twise: knowing R ( ˆ f n,k ) − R ( f 0 ,k ) = O p ( ε n,k ) for every fixed k do es not, b y itself, imply that 6 all b ounds hold simultaneously when | K ( n ) | → ∞ . In con trast, if we hav e P A C bounds Pr R ( ˆ f n,k ) − R ( f 0 ,k ) ≤ ε k ( n, η ) ≥ 1 − η for eac h k , then a union b ound yields Pr ∀ k ∈ [ K ( n )] : R ( ˆ f n,k ) − R ( f 0 ,k ) ≤ ε k n, η | K ( n ) | ≥ 1 − η . Th us, P A C b ounds let us tune η to con trol the probabilit y of any failure across a growing family of estimators, t ypically at the cost of an additional log | K ( n ) | term. In regression, a common approac h is to partition the cov ariate space X into K ( n ) disjoint bins and p erform empirical risk minimization within each bin ( W asserman , 2006 ; T akeza wa , 2005 ). F or example, if F consists of constant predictors and we use the squared error loss, then ˆ f n,k is the constant function equal to the empirical mean of the ≈ n/K ( n ) resp onses in bin k . In this case, the maxim um squared error ov er K ( n ) indep endent empirical means, eac h based on n/K ( n ) observ ations, is typically of order K ( n ) log K ( n ) /n , a fact that can b e used to derive regret b ounds for histogram regression ( W asserman , 2006 ). Similar needs for uniform con trol arise b ey ond classical regression, for instance in sequential or iterativ e regression pro cedures in reinforcement learning, such as fitted v alue iteration and fitted Q -iteration ( Munos , 2005 ; Munos and Szep esv´ ari , 2008 ; An tos et al. , 2007 ; v an der Laan et al. , 2025a ; v an der Laan and Kallus , 2025a , b ). In these settings, regret is incurred at each iteration, and one must con trol the cum ulative regret with high probabilit y; see, for example, Lemma 3 and Theorem 2 of v an der Laan and Kallus ( 2025a ). 2.3 F rom regret to L 2 ( P ) error: curv ature and strong con v exit y It is common to derive high-probability b ounds on the regret R ( ˆ f n ) − R ( f 0 ). In man y problems, the regret is lo cally quadratic in f − f 0 , so { R ( ˆ f n ) − R ( f 0 ) } 1 / 2 b eha ves lik e a notion of distance betw een ˆ f n and f 0 . Nonetheless, we are often more in terested in the L 2 ( P ) estimation error ∥ ˆ f n − f 0 ∥ , which is typically more interpretable and is often needed for downstream analyses. In this section, w e sho w ho w to translate regret bounds in to L 2 ( P ) error b ounds. A standard sufficien t condition is a lo cal quadratic-gro wth (curv ature) inequality: there exists κ > 0 suc h that, for all f ∈ F , R ( f ) − R ( f 0 ) ≥ κ ∥ f − f 0 ∥ 2 . When this holds, any high-probability b ound of the form R ( ˆ f n ) − R ( f 0 ) ≲ ε immediately yields ∥ ˆ f n − f 0 ∥ ≲ κ − 1 / 2 √ ε at the same confidence level. This condition is natural when R is twice differen tiable with p ositive curv ature near f 0 , since a second-order T aylor expansion around f 0 implies quadratic gro wth in ∥ f − f 0 ∥ lo cally . W e now formalize this in tuition. A function class F is c onvex if, for any f , g ∈ F and an y t ∈ [0 , 1], the conv ex combination tf + (1 − t ) g also lies in F . W e say that the risk function R : F → R is str ongly c onvex at f 0 if there exists a curv ature constan t κ ∈ (0 , ∞ ) suc h that, for all 7 f ∈ F , R ( f ) − R ( f 0 ) − ˙ R f 0 ( f − f 0 ) ≥ κ ∥ f − f 0 ∥ 2 , where ˙ R f 0 ( h ) := d dt R ( f 0 + th ) t =0 denotes the directional deriv ative of R at f 0 in direction h (Equation 14.42 of W ain wright , 2019 ). Lemma 1. Supp ose F is conv ex and R : F → R is strongly conv ex around f 0 with curv ature constan t κ ∈ (0 , ∞ ). Then, for all f ∈ F , R ( f ) − R ( f 0 ) ≥ κ ∥ f − f 0 ∥ 2 . Pr o of. By strong con vexit y of R , we hav e R ( f ) − R ( f 0 ) ≥ ˙ R f 0 ( f − f 0 ) + κ ∥ f − f 0 ∥ 2 . Since F is con v ex, the line segmen t f 0 + t ( f − f 0 ) lies in F for all t ∈ [0 , 1]. Because f 0 minimizes R o v er F , w e ha ve R { f 0 + t ( f − f 0 ) } − R ( f 0 ) ≥ 0 for all t ∈ [0 , 1] . Dividing by t > 0 and taking the limit as t ↓ 0 yields 0 ≤ lim t ↓ 0 R { f 0 + t ( f − f 0 ) } − R ( f 0 ) t = ˙ R f 0 ( f − f 0 ) . Th us ˙ R f 0 ( f − f 0 ) ≥ 0, and dropping this term from the strong con v exity b ound giv es R ( f ) − R ( f 0 ) ≥ κ ∥ f − f 0 ∥ 2 , as claimed. □ W e illustrate Lemma 1 b y verifying the curv ature condition explicitly for the least-squares loss using prop erties of pro jections on to con v ex sets. Example 1 (Strong con vexit y for least squares via pro jection) . Let ℓ ( z , f ) := { y − f ( x ) } 2 and R ( f ) := E { ( Y − f ( X )) 2 } . Let F ⊂ L 2 ( P X ) b e con vex, and let f 0 ∈ arg min f ∈F R ( f ). W rite f ⋆ ( x ) := E ( Y | X = x ) for the (unconstrained) risk minimizer. Conditioning on X yields R ( f ) = E h E ( Y − f ( X )) 2 | X i = E V ar( Y | X ) + { f ⋆ ( X ) − f ( X ) } 2 . Therefore, for an y f ∈ F , R ( f ) − R ( f 0 ) = E { f ⋆ ( X ) − f ( X ) } 2 − { f ⋆ ( X ) − f 0 ( X ) } 2 = ∥ f − f 0 ∥ 2 − 2 E ( f − f 0 )( f ⋆ − f 0 ) . Since f 0 is the least-squares pro jection of f ⋆ on to the conv ex set F , the first-order optimality condition gives E [( f − f 0 )( f ⋆ − f 0 )] ≤ 0 for all f ∈ F . Hence R ( f ) − R ( f 0 ) ≥ ∥ f − f 0 ∥ 2 , f ∈ F . 8 If F is a linear subspace, then E [( f − f 0 )( f ⋆ − f 0 )] = 0, and the inequality b ecomes an equalit y: R ( f ) − R ( f 0 ) = ∥ f − f 0 ∥ 2 . In the remainder of the guide, w e focus primarily on b ounding the regret, since the lemma ab ov e lets us translate regret bounds directly into L 2 ( P ) error b ounds under mild curv ature conditions. 3 Pro of blueprin t for regret b ound 3.1 Step 1: the basic inequality (deterministic regret b ound) The starting point of an y ERM analysis is a deterministic upp er bound on the regret R ( ˆ f n ) − R ( f 0 ). The following theorem establishes the so-called “basic inequalit y ,” whic h provides such a b ound. Theorem 1 (Basic inequality for constrained ERM) . F or any empirical risk minimizer ˆ f n ∈ arg min f ∈F R n ( f ) and an y population risk minimizer f 0 ∈ arg min f ∈F R ( f ), R ( ˆ f n ) − R ( f 0 ) ≤ ( P n − P ) ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) . Pr o of. Since ˆ f n minimizes R n , w e hav e R n ( ˆ f n ) ≤ R n ( f 0 ), i.e., 0 ≤ R n ( f 0 ) − R n ( ˆ f n ). Add and subtract R ( f 0 ) and R ( ˆ f n ) to obtain R ( ˆ f n ) − R ( f 0 ) = R ( ˆ f n ) − R n ( ˆ f n ) + R n ( ˆ f n ) − R n ( f 0 ) + R n ( f 0 ) − R ( f 0 ) ≤ R ( ˆ f n ) − R n ( ˆ f n ) + R n ( f 0 ) − R ( f 0 ) = R n ( f 0 ) − R ( f 0 ) − R n ( ˆ f n ) − R ( ˆ f n ) = ( P n − P ) ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) . □ The basic inequalit y in Theorem 1 reduces the regret analysis to con trolling the empirical- pro cess fluctuation ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) } . The remainder of the argumen t is to bound this term with high probability and then translate the resulting b ound into a rate for R ( ˆ f n ) − R ( f 0 ). W e b egin with a warm-up that builds intuition for the b ehavior of this term and previews the main pro of techniques used in ERM analyses. 3.2 W arm-up: turning the basic inequalit y in to rates This section illustrates, at a heuristic level, ho w the basic inequality in Theorem 1 can be conv erted in to a rate for the regret R ( ˆ f n ) − R ( f 0 ). The key term is the empirical-process fluctuation ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) } : once this quantit y is controlled (in probability or with high probability), the b ound translates directly into a regret rate. It is helpful to separate the sources of randomness in the empirical-pro cess fluctuation ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) } . There are tw o sources: (i) P n a verages o ver an i.i.d. sample, and (ii) the function b eing av eraged is random b ecause it dep ends on the data through the ERM ˆ f n . Any 9 ERM analysis must account for b oth effects. The first source alone can b e handled using standard concen tration inequalities for i.i.d. av erages of the form ( P n − P ) h with h fixe d . The second source is the main complication: b ecause ˆ f n is data dep enden t, this fixed-function viewpoint is no longer sufficien t, and one needs uniform control of the empirical process ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , f ) } : f ∈ F o ver the class F . Indeed, ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) } is obtained by ev aluating this pro cess at the data-dep enden t index f = ˆ f n . W e therefore b egin with a warm-up in the fixed-function setting, isolating source (i), to illustrate ho w regret b ounds are derived. Handling source (ii) requires more sophisticated uniform (typically lo cal) concen tration to ols, but once such b ounds are av ailable, the o verall ERM analysis follo ws the same proof template with only minor modifications. As a thought exp eriment, supp ose we hav e a deterministic sequence { f n } ∞ n =1 ⊂ F that happ ens to satisfy the basic inequalit y R ( f n ) − R ( f 0 ) ≤ ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , f n ) } . (2) This is generally not possible in practice, since ( 2 ) is typically obtained only for a data-dep endent c hoice, suc h as f n = ˆ f n . In the deterministic case, how ev er, the fluctuation term reduces to a sample mean of indep endent, mean-zero random v ariables. Indeed, letting X n,i := ℓ ( Z i , f 0 ) − ℓ ( Z i , f n ) − P { ℓ ( · , f 0 ) − ℓ ( · , f n ) } , w e ha v e ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , f n ) } = n − 1 P n i =1 X n,i . Under mild tail conditions, a central limit theorem for triangular arra ys suggests that, for large n , this quantit y b ehav es like a normal random v ariable with mean 0 and standard deviation σ f n / √ n , where σ 2 f n := V ar ℓ ( Z, f 0 ) − ℓ ( Z, f n ) . In particular, ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , f n ) } = O p ( σ f n / √ n ) . Com bining this with ( 2 ) yields R ( f n ) − R ( f 0 ) = O p ( σ f n / √ n ), and hence the slow r ate R ( f n ) − R ( f 0 ) = O p ( n − 1 / 2 ) provided that σ f n = O (1). As the following lemma sho ws, one can in fact obtain an exp onential-tail (P A C-style) bound for the empirical-pro cess fluctuation, and hence for the regret. Lemma 2 (Bernstein b ound for a fixed loss difference) . Fix n and let f ∈ F b e deterministic. Define σ 2 f := V ar { ℓ ( Z , f 0 ) − ℓ ( Z , f ) } and M f := ∥ ℓ ( · , f 0 ) − ℓ ( · , f ) ∥ ∞ . Then there exists a universal constan t C > 0 suc h that, for all η ∈ (0 , 1), | ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , f ) }| ≤ C ( σ f r log(2 /η ) n + M f log(2 /η ) n ) with probability at least 1 − η . If b f n ≲ 1, then Lemma 2 combined with ( 2 ) yields the P A C-style guaran tee that, whenev er 10 n ≳ log(1 /η ), R ( f n ) − R ( f 0 ) ≲ r log(1 /η ) n with probability at least 1 − η . (3) This slow r ate is often lo ose; for example, b y Section 2.3 , it t ypically implies only ∥ f n − f 0 ∥ = O p ( n − 1 / 4 ). The looseness stems from the fact that ( 3 ) ignores the lo c al v ariance term in Bernstein’s inequalit y , σ 2 f n = V ar { ℓ ( Z , f 0 ) − ℓ ( Z , f n ) } , whic h can shrink when f n is close to f 0 . T reating σ f n as a constan t forces the leading term to scale like n − 1 / 2 , regardless of how close f n is to f 0 . W e no w sho w ho w exploiting the lo calization of f n around f 0 can yield substan tially faster behavior in our deterministic thought exp eriment, namely R ( f n ) − R ( f 0 ) = O p ( n − 1 ) and ∥ f n − f 0 ∥ = O p ( n − 1 / 2 ). T o formalize this lo c alization , a standard approach is to imp ose a Bernstein condition, whic h relates the v ariance of the loss difference ℓ ( · , f ) − ℓ ( · , f 0 ) to the corresp onding regret. Sp ecifically , assume there exists c Bern > 0 such that, for all f ∈ F , V ar ℓ ( Z, f ) − ℓ ( Z, f 0 ) ≤ c Bern { R ( f ) − R ( f 0 ) } . (4) In other words, small regret implies small v ariance of the loss difference ℓ ( Z , f ) − ℓ ( Z , f 0 ). This condition is mild and holds in many standard settings—for example, when F is con vex and the risk is strongly conv ex. In particular, it follows by com bining a Lipsc hitz-type v ariance b ound with lo cal quadratic curv ature of the risk, as in Section 2.3 . 1 Lemma 3 (Sufficient conditions for the Bernstein condition) . Suppose that, for all f ∈ F , V ar ℓ ( Z, f ) − ℓ ( Z , f 0 ) ≤ L ∥ f − f 0 ∥ 2 and κ ∥ f − f 0 ∥ 2 ≤ { R ( f ) − R ( f 0 ) } . Then, for all f ∈ F , V ar ℓ ( Z, f ) − ℓ ( Z, f 0 ) ≤ c Bern { R ( f ) − R ( f 0 ) } with c Bern := Lκ − 1 . W e now illustrate ho w the Bernstein condition can yield faster rates. Under the Bernstein condition, the v ariance σ 2 f n := V ar { ℓ ( Z , f 0 ) − ℓ ( Z , f n ) } shrinks as the regret shrinks. Since Lemma 2 concen trates ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , f n ) } at the scale σ f n / √ n , combining it with the basic inequality yields R ( f n ) − R ( f 0 ) = O p σ f n √ n . (5) A crude b ound treats σ f n as constant and giv es the slo w rate R ( f n ) − R ( f 0 ) = O p ( n − 1 / 2 ). The Bernstein condition impro ves this by linking v ariance to regret: σ 2 f n ≲ R ( f n ) − R ( f 0 ), so ( 5 ) b ecomes R ( f n ) − R ( f 0 ) = O p { R ( f n ) − R ( f 0 ) } 1 / 2 √ n ! . Rearranging yields { R ( f n ) − R ( f 0 ) } 1 / 2 = O p ( n − 1 / 2 ), and therefore R ( f n ) − R ( f 0 ) = O p ( n − 1 ). F rom deterministic to random functions via uniform lo cal concen tration inequalities. Muc h of the in tuition from the deterministic thought exp eriment carries ov er to ERM, where 1 F or conv ex risks ov er con vex classes, strong conv exity—and hence the Bernstein condition—can also b e enforced via Tikhono v regularization; see Appendix A.1 . 11 the random minimizer ˆ f n satisfies the basic inequalit y . The main complication is that Lemma 2 con trols the empirical-pro cess fluctuation only for a fixe d (deterministic) function, so it do es not apply directly to the data-dependent choice ˆ f n . Extending the argument therefore requires high- probabilit y bounds that hold uniformly ov er f ∈ F . As we will see, suc h uniformit y can b e costly: outside low-dimensional parametric mo dels, it typically rules out the optimistic O p ( n − 1 ) regret rate, and slo wer rates are the norm. A simple, though loose, w a y to handle the randomness in ˆ f n is to upper b ound the empirical- pro cess fluctuation ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) } by the glob al supremum sup f ∈F ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , f ) } . This remo ves the data dep endence of ˆ f n , at the cost of con trolling the worst-case v alue of the empirical pro cess. When the class F is not to o large (e.g., when it is Donsker), one can often apply a uniform LLN/CL T to sho w that the supremum is O p ( n − 1 / 2 ), and hence ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) } = O p ( n − 1 / 2 ) ( V an Der V aart and W ellner , 1996 ). This reco vers the “slo w rate” R ( ˆ f n ) − R ( f 0 ) = O p ( n − 1 / 2 ) that we argued heuristically when treating ˆ f n as deterministic. How ev er, just as in the deterministic case, this approac h do es not exploit the lo calization of ˆ f n around f 0 , since it replaces ˆ f n with a suprem um o v er all f ∈ F . The next section addresses this limitation using uniform lo c al c onc entr ation inequalities. Unlik e the global suprem um bound, these results control ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , f ) } simultane ously ov er f ∈ F , but with a high-probabilit y b ound that adapts to f through the lo cal scale parameter σ f = { V ar { ℓ ( Z, f 0 ) − ℓ ( Z , f ) }} 1 / 2 . F or intuition, one can derive these b ounds b y con trolling self- normalized, ratio-t yp e empirical pro cess suprema, as in Gin´ e and Koltc hinskii ( 2006 ). Concretely , one seeks a b ound of the form sup f ∈F ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , f ) } σ f ∨ δ n,η ≲ δ n,η , with probability at least 1 − η , where δ n,η is a deterministic complexit y term (often characterized via a critical radius) that ma y de- ca y more slowly than n − 1 / 2 in nonparametric settings. Suc h a b ound implies that, with probability at least 1 − η , ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) } ≲ ( σ ˆ f n ∨ δ n,η ) δ n,η ≲ σ ˆ f n δ n,η + δ 2 n,η . (6) Consequen tly , arguing as in ( 5 ), one can combine this inequalit y with the basic inequality to obtain the regret b ound R ( ˆ f n ) − R ( f 0 ) = O p σ ˆ f n δ n,η + δ 2 n,η . If, in addition, a Bernstein-t yp e v ariance–risk condition holds so that σ 2 ˆ f n ≲ R ( ˆ f n ) − R ( f 0 ), then similar algebra to the deterministic case yields R ( ˆ f n ) − R ( f 0 ) = O p δ 2 n,η . 12 3.3 A three-step template for ERM rates In this section, w e outline a high-level template for deriving ERM rates. Man y existing ERM analyses follo w this blueprin t, and later sections instantiate each step in concrete settings. The presen tation here is inten tionally schematic; readers who prefer to start with a fully sp ecified result ma y skip ahead to Section 4.1 . The template consists of three steps. First, derive a deterministic basic inequality , as we al- ready did in Theorem 1 . Second, obtain a uniform concentration bound for the empirical-pro cess fluctuation ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) } app earing in the basic inequality . T o obtain fast rates, this b ound should ideally reflect the lo c alization of ˆ f n around f 0 , as illustrated in the previous section. Third, combine the basic inequalit y with the high-probability b ound to obtain regret b ounds for the ERM. Step 1. Deriv e a deterministic inequality . Theorem 1 establishes the basic inequality R ( ˆ f n ) − R ( f 0 ) ≤ ( P n − P ) ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) . (7) This is the key inequality for constrained ERM. In other settings, differen t basic inequalities are needed; for example, in regularized ERM (e.g., ridge- or lasso-t yp e p enalties) the optimization problem includes a p enalty term, and the corresp onding basic inequalit y features additional terms in volving the regularizer ( W ainwrigh t , 2019 ). Likewise, when the loss inv olv es estimated n uisance comp onen ts, further empirical-pro cess and remainder terms ma y appear (see Section 5.3 for one suc h inequality). I n general, the goal is to upp er b ound the regret by a finite sum of empirical- pro cess-t yp e remainders (to b e controlled via high-probability bounds) and p opulation quantities dep ending on ˆ f n and f 0 (e.g., terms that can b e b ounded b y a fractional p o wer of the regret { R ( ˆ f n ) − R ( f 0 ) } γ ). Step 2. Con trol randomness through a high-probabilit y b ound. The next step is to con trol the empirical-pro cess fluctuation ( P n − P ) ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) . Concretely , we assume a Bernstein-t yp e uniform lo cal concentration b ound on the empirical pro cess { ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , f ) } : f ∈ F } of the following form: E1) (Uniform local concentration b ound.) There exists a function ε ( n, σ, η ), nondecreasing in σ , suc h that for all f ∈ F with σ 2 f := V ar ℓ ( Z, f 0 ) − ℓ ( Z , f ) < ∞ and all η ∈ (0 , 1), ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , f ) } ≤ ε ( n, σ f , η ) with probabilit y at least 1 − η . This b ound ensures that, uniformly o ver f ∈ F , w e can control the empirical-process fluctuation ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , f ) } , and that this control is lo c al around f 0 in the sense that it adapts to the standard deviation parameter σ f . Such uniform concen tration b ounds are widely av ailable (e.g., Theorem 8 of Bousquet et al. , 2003 ; Theorem 3.3 of Bartlett et al. , 2005 ; Theorem 2.1 of Gin´ e and Koltc hinskii , 2006 ; Theorem 14.20(b) of W ain wright , 2019 ; App endix K of F oster and Syrgk anis , 2023 ). F or example, in Section 4.1 , Theorem 2 sho ws that ε ( n, σ f , η ) can b e taken proportional 13 to δ n,η σ f + δ 2 n,η , where δ n,η ≍ δ n + p log(1 /η ) /n and δ n is a critical radius determined by the complexit y of the loss-difference class F ℓ := { ℓ ( · , f 0 ) − ℓ ( · , f ) : f ∈ F } . Since the b ound in E1 holds for all f ∈ F , it also holds for the (random) choice f = ˆ f n . In particular, ( P n − P ) ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) ≤ ε ( n, σ ˆ f n , η ) with probabilit y at least 1 − η . (8) Step 3. Bound regret and extract a rate. Denote the regret by ˆ d 2 n := R ( ˆ f n ) − R ( f 0 ). Com bining the basic inequalit y in ( 7 ) and our high-probability b ound in ( 8 ), we find that ˆ d 2 n = R ( ˆ f n ) − R ( f 0 ) ≤ ε ( n, σ ˆ f n , η ) with probabilit y at least 1 − η . Next, as in the previous section, w e exploit the lo calization of ˆ f n around f 0 b y assuming the follo wing Bernstein condition: there exists c Bern > 0 such that, for all f ∈ F , V ar ℓ ( Z, f ) − ℓ ( Z, f 0 ) ≤ c Bern { R ( f ) − R ( f 0 ) } . W e hav e σ 2 ˆ f n ≤ c Bern ˆ d 2 n , and hence ε ( n, σ ˆ f n , η ) ≤ ε ( n, c Bern ˆ d n , η ), since σ 7→ ε ( n, σ, η ) is nondecreas- ing. It then follows that ˆ d 2 n ≤ ε ( n, c Bern ˆ d n , η ) with probabilit y at least 1 − η . (9) This is a fixe d-p oint (self-b ounding) inequality: ˆ d n app ears on b oth sides, and the righ t-hand side t ypically gro ws sub-quadratically in ˆ d n . Consequen tly , the inequality forces ˆ d n to b e small: if ˆ d n w ere too large, the left-hand side w ould dominate the right-hand side, con tradicting ( 9 ). W e obtain a high-probabilit y bound for ˆ d n b y comparing it to the largest deterministic δ ( n, η ) that satisfies the same inequalit y . Define δ ( n, η ) := sup n δ ≥ 0 : δ 2 ≤ ε ( n, c Bern δ, η ) o . On any ev ent where ˆ d 2 n ≤ ε ( n, c Bern ˆ d n , η ) holds, w e hav e ˆ d n ≤ δ ( n, η ), and hence ˆ d 2 n ≤ δ 2 ( n, η ). Therefore, by ( 9 ), R ( ˆ f n ) − R ( f 0 ) ≤ δ 2 ( n, η ) with probabilit y at least 1 − η . In practice, the w orst-case rate δ ( n, η ) is rarely computed exactly . Instead, one t ypically manip- ulates the fixed-p oint inequalit y in ( 9 ) to obtain a closed-form algebraic upper bound on ˆ d n . F or this purp ose, Y oung’s inequalit y is often useful ( Hardy et al. , 1952 ). Lemma 4 (Y oung’s inequality) . Let x, y ≥ 0 and let p, q > 1 satisfy 1 /p + 1 /q = 1. Then xy ≤ x p p + y q q . 14 In particular, for any λ > 0, xy ≤ λ 2 x 2 + 1 2 λ y 2 . T o illustrate Lemma 4 , supp ose w e ha ve sho wn that ε ( n, C ˆ d n , η ) ≤ c δ n,η ˆ d n + δ 2 n,η for some δ n,η > 0 and c > 0, as w ould follo w from ( 6 ). Then ( 9 ) becomes ˆ d 2 n ≤ δ 2 n,η + c δ n,η ˆ d n . Applying Y oung’s inequalit y with x = ˆ d n , y = c δ n,η , and λ = 1 yields c δ n,η ˆ d n ≤ 1 2 ˆ d 2 n + 1 2 c 2 δ 2 n,η . Substituting and rearranging give ˆ d 2 n ≤ (2 + c 2 ) δ 2 n,η , i.e., ˆ d n ≲ δ n,η with probability at least 1 − η . Remark. When the risk is strongly conv ex and satisfies the curv ature b ound κ ∥ f − f 0 ∥ 2 ≤ R ( f ) − R ( f 0 ), it is often simplest to analyze the L 2 error directly ( W ain wright , 2019 ). Indeed, com bining this curv ature b ound with the basic inequalit y of Theorem 1 yields κ ∥ ˆ f n − f 0 ∥ 2 ≤ ( P n − P ) ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) . On the ev ent in Condition E1 (which has probabilit y at least 1 − η ), κ ∥ ˆ f n − f 0 ∥ 2 ≤ ε n, σ ˆ f n , η . If moreov er ℓ is p oint wise Lipsc hitz, so that | ℓ ( Z, f 0 ) − ℓ ( Z, f ) | ≤ L | f 0 ( Z ) − f ( Z ) | for all f ∈ F , then σ ˆ f n ≤ L ∥ ˆ f n − f 0 ∥ , and hence κ ∥ ˆ f n − f 0 ∥ 2 ≤ ε n, L ∥ ˆ f n − f 0 ∥ , η . This is a fixed-p oint inequalit y for ∥ ˆ f n − f 0 ∥ (rather than for the regret), from which rates follo w b y the same algebra as in Step 3. Along this route, the Bernstein condition is not inv oked explicitly; it is implied by the curv ature b ound together with the Lipschitz con trol of σ ˆ f n via Lemma 3 . 4 Regret via lo calized Rademac her complexit y 4.1 A general high-probabilit y regret theorem In this section, w e presen t a general regret theorem for ERM, whic h is the fulcrum of this pap er. Our b ound uses the pro of recipe of Section 3.3 to obtain a P AC-st yle guarantee in terms of a function- class complexity measure. The k ey tec hnical step is to con trol the empirical-pro cess fluctuation ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) } , 15 since this term upper b ounds the regret R ( ˆ f n ) − R ( f 0 ) of the empirical risk minimizer ˆ f n . The magnitude of this fluctuation is gov erned by the size of the loss-difference c lass F ℓ := { ℓ ( · , f 0 ) − ℓ ( · , f ) : f ∈ F } . In tuitively , ric her classes give ERM more opp ortunities to fit random noise, so the empirical risk P n ℓ ( · , ˆ f n ) can b e muc h smaller than the p opulation risk P ℓ ( · , ˆ f n ) at the data-selected ˆ f n , unless the class is suitably controlled. Consequen tly , regret b ounds are t ypically expressed in terms of complexit y measures for F ℓ , which can often b e related back to those of F . W e quan tify the complexit y of a function class G through its R ademacher c omplexity ( Bartlett and Mendelson , 2002 ; Bartlett et al. , 2005 ; W ain wright , 2019 ). F or a radius δ ∈ (0 , ∞ ), the lo c alize d R ademacher c omplexity of G is R n ( G , δ ) := E sup f ∈G ∥ f ∥≤ δ 1 n n X i =1 ϵ i f ( Z i ) , where ϵ 1 , . . . , ϵ n ∈ {− 1 , 1 } are i.i.d. Rademac her random v ariables, indep endent of Z 1 , . . . , Z n . Define the star hull of G by star( G ) := { th : h ∈ G , t ∈ [0 , 1] } . An y conv ex class con taining 0 is star-shap ed, and hence equals its star h ull. A useful feature of star( G ), which we use repeatedly in our pro ofs, is that it ensures the scaling property: for all t ∈ [0 , 1], R n star( G ) , tδ ≥ t R n star( G ) , δ , and hence δ 7→ R n (star( G ) , δ ) /δ is nonincreasing (Lemma 10 ). A key c haracteristic of a function class G is its critic al r adius , which go verns lo cal concentration and, consequently , the regret of ERM. The critical radius δ n ( G ) is defined as the smallest δ > 0 suc h that R n ( G , δ ) ≤ δ 2 , that is, δ n ( G ) := inf n δ > 0 : R n ( G , δ ) ≤ δ 2 o . Larger classes t ypically hav e larger critical radii. Intu itively , δ n ( G ) marks the scale ab ov e whic h the empirical L 2 ( P n ) norm and the p opulation L 2 ( P ) norm are comparable on G (see Theorem 14.1 of W ainwrigh t , 2019 ). Roughly , when g ∈ G satisfies ∥ g ∥ ≍ δ n ( G ), its empirical norm ∥ g ∥ n is t ypically of the same order, namely δ n ( G ). Below this scale, sampling noise can dominate: ∥ g ∥ n ma y b e m uch larger than ∥ g ∥ purely b y c hance. F or this reason, critical radii enter uniform high- probabilit y b ounds for empirical pro cesses (see Theorem 10 in App endix B ). Critical radii are kno wn for many function classes; common examples are giv en in T able 1 . In the next section, w e dev elop general to ols for upp er b ounding the critical radius via metric entrop y b ounds on lo calized Rademac her complexit y . T o b ound the regret, w e first establish a uniform high-probabilit y b ound of the form E1 con- trolling the lo cal concentration of the empirical pro cess { ( P n − P ) g : g ∈ F ℓ } . The resulting b ound 16 T able 1: Examples of critical radii for common function classes. F unction class G Critical radius δ n Reference s -sparse linear predictors in R p p s log( ep/s ) /n W ainwrigh t ( 2019 ) V C-subgraph of dimension V p V log( n/V ) /n V an Der V aart and W ellner ( 1996 ) H¨ older/Sob olev with smo othness s in dimension d n − s/ (2 s + d ) Nic kl and P¨ otsc her ( 2007 ) Bounded Hardy–Krause v ariation n − 1 / 3 (log n ) 2( d − 1) / 3 Bibaut and v an der Laan ( 2019 ) RKHS with eigen v alue deca y σ j ≍ j − 2 α , α > 1 / 2 n − α/ (2 α +1) Mendelson and Neeman ( 2010 ) is gov erned b y the critical radius of the centered loss-difference class F ℓ := n ℓ ( · , f 0 ) − ℓ ( · , f ) − P { ℓ ( · , f 0 ) − ℓ ( · , f ) } : f ∈ F o . The b ound and its pro of follo w Bartlett et al. ( 2005 ), leveraging that the map δ 7→ R n (star( G ) , δ ) /δ is nonincreasing. Alternative lo cal concentration inequalities include Theorem 14.20 of W ain wright ( 2019 ) (see also App endix K of F oster and Syrgk anis , 2023 ). Theorem 2 (Uniform lo cal concentration inequalit y) . Let δ n > 0 satisfy the critical radius con- dition R n (star( F ℓ ) , δ n ) ≤ δ 2 n . Define B := sup f ∈F ∥ ℓ ( · , f 0 ) − ℓ ( · , f ) ∥ ∞ and σ 2 f := V ar { ℓ ( Z , f 0 ) − ℓ ( Z, f ) } for eac h f ∈ F . F or all η ∈ (0 , 1), there exists a univ ersal constan t C > 0 suc h that, with probabilit y at least 1 − η , for ev ery f ∈ F , ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , f ) } ≤ C " σ f δ n + δ 2 n + σ f ( B ∨ 1) r log(1 /η ) n + ( B ∨ 1) 2 log(1 /η ) n # . Alternativ ely , if δ n > 0 satisfies the critical radius condition R n (star( F ℓ ) , δ n ) ≤ δ 2 n , then the same b ound holds with σ f replaced by ∥ ℓ ( · , f 0 ) − ℓ ( · , f ) ∥ . Our main result is the following P AC-st yle regret b ound. Its pro of follo ws the template of Section 3.3 , using that Condition E1 holds with ε ( n, σ f , η ) giv en b y the righ t-hand side of Theo- rem 2 . W e assume the follo wing high-lev el Bernstein condition; sufficient conditions are pro vided in Lemma 3 . A1) (Bernstein v ariance–risk b ound) There exists c Bern ∈ [1 , ∞ ) suc h that, for all f ∈ F , V ar ℓ ( Z, f ) − ℓ ( Z, f 0 ) ≤ c Bern { R ( f ) − R ( f 0 ) } . Theorem 3 (Regret b ound for ERM) . Assume A1 . Let δ n > 0 satisfy the critical radius condition R n (star( F ℓ ) , δ n ) ≲ δ 2 n , and suppose sup f ∈F ∥ ℓ ( · , f ) ∥ ∞ ≤ M for some M ∈ [1 , ∞ ). Then, for all η ∈ (0 , 1), there exists a universal constan t C ∈ (0 , ∞ ) suc h that, with probability at least 1 − η , R ( ˆ f n ) − R ( f 0 ) ≤ C c Bern " δ 2 n + M 2 log(1 /η ) n # . F urthermore, if the righ t-hand side is at most 1, then Pr { R ( ˆ f n ) − R ( f 0 ) ≤ 1 } ≥ 1 − η . In this case, 17 p ossibly with a differen t univ ersal constan t C , the same conclusion holds with probability at leas t 1 − η when δ n > 0 is c hosen to satisfy R n (star( F ℓ ) , δ n ) ≲ δ 2 n . When working with the L 2 error and a stronger Bernstein condition, applying the second part of Theorem 2 yields a sligh tly cleaner bound than Theorem 3 , in which the critical radius may b e tak en with resp ect to the uncen tered loss-difference class star( F ℓ ) at an y sample size. A2) (Bernstein-t yp e MSE–risk bound) There exists L ∈ [1 , ∞ ) and κ ∈ (0 , ∞ ) such that, for all f ∈ F , κ ∥ f − f 0 ∥ 2 ≤ R ( f ) − R ( f 0 ) , ∥ ℓ ( · , f ) − ℓ ( · , f 0 ) ∥ ≤ L ∥ f − f 0 ∥ . Theorem 4 ( L 2 error b ound for ERM) . Assume A2 . Let δ n > 0 satisfy the critical radius condition R n (star( F ℓ ) , δ n ) ≲ δ 2 n , and supp ose sup f ∈F ∥ ℓ ( · , f ) ∥ ∞ ≤ M for some M ∈ [1 , ∞ ). Then, for all η ∈ (0 , 1), there exists a universal constan t C ∈ (0 , ∞ ) suc h that, with probability at least 1 − η , ∥ ˆ f n − f 0 ∥ 2 ≤ C κ − 1 L 2 " δ 2 n + M 2 log(1 /η ) n # . Theorem 3 reduces b ounding the regret R ( ˆ f n ) − R ( f 0 ) to controlling the lo calized Rademacher complexit y—and the corresponding critical radius—of the star hull of the cen tered loss-difference class F ℓ . As a result, the regret con verges to zero at rate O p ( δ 2 n ), up to a rapidly v anishing and t ypically negligible O p ( n − 1 ) term. F or strongly conv ex losses ov er conv ex classes, Lemma 1 further yields the L 2 error b ound ∥ ˆ f n − f 0 ∥ = O p ( δ n ) + O p ( n − 1 / 2 ). Moreov er, if R ( ˆ f n ) − R ( f 0 ) = o p (1), the second part of Theorem 3 allows one to work with the uncen tered loss-difference class: it suffices to c ho ose δ n so that R n (star( F ℓ ) , δ n ) ≲ δ 2 n . With this c hoice, the same high-probability b ound holds for n large enough, yielding R ( ˆ f n ) − R ( f 0 ) = O p ( δ 2 n ) + O p ( n − 1 ). When w orking directly with L 2 error, Theorem 4 shows that ∥ ˆ f n − f 0 ∥ = O p ( δ n ) + O p ( n − 1 / 2 ) under the same c hoice of δ n , with the corresp onding high-probability b ound v alid at an y sample size. A general strategy for selecting δ n to satisfy the critical radius condition is as follo ws: (i) construct a nondecreasing env elop e ϕ n : [0 , ∞ ) → [0 , ∞ ) suc h that, for all δ in the relev ant range, R n star( F ℓ ) , δ ≲ 1 √ n ϕ n ( δ ); (10) and (ii) c ho ose δ n suc h that ϕ n ( δ n ) ≲ √ n δ 2 n . Then δ n is, up to universal constan ts, a v alid upp er b ound on the critical radius of star( F ℓ ), and therefore satisfies the critical radius condition in Theorem 3 . A conv enien t wa y to construct suc h en velopes ϕ n is via metric-en tropy in tegrals, whic h w e dev elop next. 4.2 F rom en tropy to critical radii: b ounding lo calized complexit y This section dev elops upper b ounds on localized Rademac her complexities in terms of metric en- tropies of function classes. These b ounds are useful b ecause they reduce the task of deriving ERM 18 rates to con trolling cov ering n um b ers for F ℓ , which are well understo o d for man y common classes. In particular, they yield w ork able env elop es ϕ n ( δ ) satisfying ( 10 ), thereby reducing critical-radius calculations to b ounding suitable metric-en tropy in tegrals ( Lei et al. , 2016 ; W ainwrigh t , 2019 ). Let G b e a class of measurable functions. Let N ε, G , L 2 ( Q ) denote the ε –co vering n umber of G with resp ect to the L 2 ( Q ) metric d 2 ( f , g ) := ∥ f − g ∥ L 2 ( Q ) . That is, N ε, G , L 2 ( Q ) is the smallest n um b er of d 2 –balls of radius ε needed to cov er G ; for background, see Chapter 2 of Sen ( 2018 ), Chapter 2 of V an Der V aart and W ellner ( 1996 ), and Chapter 5 of W ain wright ( 2019 ). Define the uniform L 2 -en tropy integral J 2 ( δ, G ) := sup Q Z δ 0 p log N ( ε, G , L 2 ( Q )) dε, where the supremum is ov er all discrete distributions Q supported on supp( P ). W e also define the L ∞ -en tropy integral J ∞ ( δ, G ) := Z δ 0 p log N ∞ ( ε, G ) dε, where N ∞ ( ε, G ) is the ε –co v ering num b er with resp ect to the P -essential suprem um metric. Note that J 2 ( δ, G ) ≤ J ∞ ( δ, G ). Entrop y in tegrals enjo y a useful monotonicit y: δ 7→ J 2 ( δ, G ) is nonde- creasing, while δ 7→ J 2 ( δ, G ) /δ is nonincreasing. Here the low er limit of integration is 0, so the in tegral m a y diverge; for example, it diverges for H¨ older classes when d > 2 s . While we do not pursue this here, a standard w ork around is to use a truncated entrop y integral that integrates from δ 2 / 2 to δ , as in Corollary 14.3 of W ainwrigh t , 2019 . A k ey adv an tage of en trop y integrals is that they b ehav e w ell under Lipsc hitz transformations of function classes. This allows us to control the complexity of comp osite classes by reducing the task to en trop y bounds for simpler, “primitive” classes. In particular, the follo wing Lipschitz preserv ation b ound is often useful; it is a sp ecial case of Theorem 2.10.20 in V an Der V aart and W ellner , 1996 . Theorem 5 (Lipsc hitz preserv ation for uniform entrop y) . Let F 1 , . . . , F k b e class es of measurable functions on Z , and let φ : Z × R k → R satisfy , for some L 1 , . . . , L k ≥ 0, | φ ( z , x ) − φ ( z , y ) | ≤ k X j =1 L j | x j − y j | , z ∈ Z , x, y ∈ I , on a set I ⊆ R k con taining { ( f 1 ( z ) , . . . , f k ( z )) : f j ∈ F j , z ∈ Z } . Define φ ( F 1 , . . . , F k ) := z 7→ φ z , f 1 ( z ) , . . . , f k ( z ) : f j ∈ F j . Then, for ev ery δ > 0, up to universal constants, J 2 ( δ, φ ( F 1 , . . . , F k )) ≲ k X j =1 J 2 δ L j , F j . 19 If | φ ( z , x ) − φ ( z , y ) | ≤ L ∥ x − y ∥ 2 for all z ∈ Z and all x, y ∈ I , then J 2 ( δ, φ ( F 1 , . . . , F k )) ≲ P k j =1 J 2 δ L , F j . Uniform entrop y bounds are a v ailable for man y function classes (T able 2 ). F or example, for d -v ariate Sob olev or H¨ older classes with smo othness exponent s > d/ 2, one t ypically has log N ∞ ( ε, G ) ≲ ε − α with α := d/s , and therefore J ∞ ( δ, G ) ≲ δ 1 − α 2 ( V an Der V aart and W ellner , 1996 ; Nickl and P¨ otscher , 2007 ). Classes of b ounded Hardy–Krause v ariation satisfy J 2 ( δ, G ) ≲ δ 1 / 2 { log(1 /δ ) } d − 1 (Prop osition 2 of Bibaut and v an der Laan , 2019 ). VC-subgraph classes with V C dimension V satisfy the uniform entrop y b ound J 2 ( δ, G ) ≲ δ q V log 1 δ (Theorem 2.6.7 of V an Der V aart and W ellner , 1996 ). In particular, if G is a linear space of real-v alued functions of (finite) dimension p , then its VC-subgraph dimension satisfies V ≤ p + 2 (see, e.g., Lemma 2.6.15 of V an Der V aart and W ellner , 1996 ). T able 2: Examples of uniform en tropy bounds for common function classes. F unction class G En tropy b ound H¨ older/Sob olev (smoothness s > d/ 2 in dimension d ) J ∞ ( δ, G ) ≲ δ 1 − d/ 2 s ; Monotone functions on [0 , 1] J 2 ( δ, G ) ≲ δ 1 / 2 Bounded Hardy–Krause v ariation (dimension d ) J 2 ( δ, G ) ≲ δ 1 / 2 { log(1 /δ ) } d − 1 s -sparse linear predictors in R p (supp ort size ≤ s ) J 2 ( δ, G ) ≲ δ p s log( ep/ ( sδ )) V C-subgraph (V C dimension V ) J 2 ( δ, G ) ≲ δ p V log(1 /δ ) W e no w presen t our lo cal maximal inequalities for Rademac her complexities. Our first result b ounds the localized Rademac her complexity of G in terms of the uniform entrop y in tegral and is adapted from Theorem 2.1 of V an Der V aart and W ellner ( 2011 ). It shows that, up to univ ersal constan ts, a solution to the critical inequality R n ( G , δ ) ≤ δ 2 can be obtained b y solving the en trop y- based inequality J 2 ( δ, G ) ≤ √ n δ 2 . In what follows, our b ounds are stated in terms of the localized class G ( δ ) := { f ∈ G : ∥ f ∥ 2 ≤ δ } . The same conclusion holds if G ( δ ) is replaced b y G , since J 2 ( δ, G ( δ )) ≤ J 2 ( δ, G ). Theorem 6 (Lo cal maximal inequality under uniform L 2 -en tropy) . Let δ > 0. Define δ ∞ := sup f ∈G ( δ ) ∥ f ∥ ∞ . Then, there exists a universal constan t C ∈ (0 , ∞ ) suc h that R n ( G , δ ) ≤ C √ n J 2 δ, G ( δ ) 1 + δ ∞ √ n δ 2 J 2 δ, G ( δ ) . Theorem 6 yields an en velope bound of the form R n ( G , δ ) ≲ n − 1 / 2 ϕ n ( δ ), where ϕ n ( δ ) := J 2 δ, G 1 + δ ∞ √ n δ 2 J 2 δ, G . If δ n satisfies √ n δ 2 n ≳ J 2 δ n , G , then the brac keted factor is O (1) at δ = δ n , so ϕ n ( δ n ) ≲ J 2 δ n , G ≲ √ n δ 2 n . Consequently , δ n is, up to constants, a solution to the critical inequalit y R n ( G , δ n ) ≲ δ 2 n and therefore upp er b ounds the critical radius of G . The next theorem sharp ens Theorem 6 when metric entrop y is measured in the supremum norm. It shows that R n ( G , δ ) ≲ n − 1 / 2 ϕ n ( δ ) with ϕ n ( δ ) := J ∞ δ ∨ n − 1 / 2 , G ; in particular, R n ( G , δ ) 20 is of order 1 √ n J ∞ δ, G whenev er δ ≳ n − 1 / 2 . In most applications, how ev er, Theorem 6 already suffices to derive ERM rates via Theorem 3 . The pro of inv ok es Theorem 2.2 of v an de Geer ( 2014 ) on concentration of empirical L 2 norms under L ∞ -en tropy . Theorem 7 (Lo cal maximal inequality under L ∞ -en tropy) . Let δ > 0. Define G ( δ ) := { f ∈ G : ∥ f ∥ ≤ δ } and B := 1 + J ∞ ( ∞ , G ). Then there exists a univ ersal constan t C ∈ (0 , ∞ ) such that R n ( G , δ ) ≤ C B √ n J ∞ ( δ ∨ n − 1 / 2 , G ( δ )) . The bounds ab ov e on localized Rademac her complexities can be used to select radii satisfying the critical inequalit y in Theorem 3 . In particular, an upp er bound on the critical radius of star( F ℓ ) can b e obtained by choosing δ n > 0 to satisfy the corresponding en trop y-based critical inequality , J 2 δ n , star( F ℓ ) ≲ √ n δ 2 n or J ∞ δ n , star( F ℓ ) ≲ √ n δ 2 n . It is sharp er to solve the former, since J 2 δ, star( F ℓ ) ≤ J ∞ δ, star( F ℓ ) for all δ > 0. W e summarize this in the follo wing lemma. Lemma 5 (Critical radii via en trop y in tegrals) . Let ϕ : (0 , ∞ ) → (0 , ∞ ) satisfy J 2 δ, star( F ℓ ) ≲ ϕ ( δ ) for all δ > 0. If δ n > 0 satisfies ϕ ( δ n ) ≲ √ n δ 2 n , then R n star( F ℓ ) , δ n ≲ δ 2 n . The regret b ound in Theorem 3 is stated in terms of the critical radius of star( F ℓ ), the star hull of the cen tered loss-difference class. At first glance, taking a star hull may app ear to substan tially enlarge the class, inflating the critical radius and degrading rates. The follo wing lemma alleviates this concern by showing that cen tering and star-hull closure increase the relev an t en tropy integral only mildly . Moreov er, when the loss is p oint wise Lipsc hitz, the entrop y in tegral of F ℓ can b e con trolled b y that of the original ERM class F . T ogether with Lemma 5 , this implies that ERM rates can b e read off directly from the en trop y in tegrals for F . A3) (Lipsc hitz loss.) There exists a constant L ∈ (0 , ∞ ) suc h that, for all f ∈ F , ℓ ( Z, f ) − ℓ ( Z, f 0 ) ≤ L f ( Z ) − f 0 ( Z ) almost surely . Lemma 6 (Uniform en tropy for star h ulls and Lipsc hitz transformations) . Let δ > 0 and assume M := sup f ∈F ∥ ℓ ( · , f ) ∥ ∞ < ∞ . Then, up to univ ersal constan ts, J 2 δ, star( F ℓ ) ≲ J 2 ( δ, F ℓ ) + δ s log 1 + M δ . If A3 holds, we may further upp er b ound J 2 ( δ, F ℓ ) by J 2 ( δ /L, F ) in the display ab ov e. It is useful to note that δ p log(1 + 1 /δ ) ≲ √ n δ 2 whenev er δ ≳ { log ( en ) /n } 1 / 2 . Consequen tly , if δ n satisfies J 2 ( δ n , F ℓ ) ≲ √ n δ 2 n , then, by Lemma 6 , the expanded radius e δ n := δ n ∨ { log( en ) /n } 1 / 2 satisfies the critical inequalit y J 2 ( e δ n , star( F ℓ )) ≲ √ n e δ 2 n , up to constants dep ending on M . Com- bining Theorem 3 with Lemma 5 and Lemma 6 yields the follo wing corollary . 21 Corollary 1 (Regret b ound under uniform entrop y) . Assume A1 and A3 , and supp ose that sup f ∈F ∥ ℓ ( · , f ) ∥ ∞ ≤ M for some M ∈ [1 , ∞ ). Suppose there exists an en velope function ϕ such that J 2 ( δ, F ) ≲ ϕ ( δ ) for all δ > 0, and let δ n > 0 satisfy ϕ ( δ n ) ≲ √ n δ 2 n . Then, for all η ∈ (0 , 1), there exists a universal constant C ∈ (0 , ∞ ) suc h that, with probability at least 1 − η , R ( ˆ f n ) − R ( f 0 ) ≤ C c Bern " L 2 δ 2 n + log( eM n ) n + M 2 log(1 /η ) n # . The corollary is stated in terms of the en tropy-based critical radius δ n of the original ERM class F , rather than the critical radius of star( F ℓ ). This simplification comes at a cost: b y Lemma 6 , it incurs an additional log ( eM n ) /n term in the regret b ound. This term is t ypically lo ose and can often b e improv ed with more refined argumen ts (see, e.g., Theorem 14.20(b) of W ain wright , 2019 and App endix K of F oster and Syrgk anis , 2023 ). In particular, Appendix B.3 giv es a lo cal concen tration inequalit y , deriv ed from W ainwrigh t , 2019 , that ma y b e preferred ov er Theorem 2 when the loss ℓ satisfies A3 ; it allows log( eM n ) to b e replaced by log log ( M en ). Alternatively , if one only seeks the in-probabilit y rate R ( ˆ f n ) − R ( f 0 ) = O p ( δ 2 n ), it suffices to combine Theorem 3.4.1 of V an Der V aart and W ellner ( 1996 ) with Theorem 6 applied directly to F ℓ , av oiding star-shap ed hulls and the resulting extra logarithmic factors. Nonetheless, the corollary suffices for most nonparametric applications. Outside low-dimensional parametric settings—where δ 2 n can scale as n − 1 and log n/n ma y dominate—it yields the usual sharp rates. 4.3 Illustrativ e example In the next example, we illustrate the standard w orkflo w for computing critical radii and apply- ing the regret theorems: we b ound the uniform en tropy integral of the loss-difference class via Theorem 5 in terms of that of F , ev aluate the resulting integral, and v erify the critical-radius condition. Example 2 (Least-Squares Regression) . Consider the regression setting where Z = ( X , Y ), with X used to predict Y , and let F be a conv ex class. Assume | Y | ≤ B almost s urely and sup f ∈F ∥ f ∥ ∞ ≤ B . The least-squares loss is ℓ ( z , f ) := { y − f ( x ) } 2 with z = ( x, y ). F or least squares, the p opulation risk R ( f ) := E { ( Y − f ( X )) 2 } is strongly conv ex in f and satisfies R ( f ) − R ( f 0 ) ≥ E ( f ( X ) − f 0 ( X )) 2 = ∥ f − f 0 ∥ 2 . Moreo ver, uniform boundedness of Y and F implies point wise Lipsc hitzness: for all f , f ′ ∈ F , ℓ { ( X , Y ) , f } − ℓ { ( X , Y ) , f ′ } = ( Y − f ( X )) 2 − ( Y − f ′ ( X )) 2 ≤ 4 B | f ( X ) − f ′ ( X ) | a.s. In particular, Theorem 5 yields J 2 ( δ, F ℓ ) ≲ J 2 ( δ / (4 B ) , F ). Thus Condition A3 holds and, by Lemma 3 , so do es Condition A1 . W e therefore apply Corollary 1 . F or illustration, w e b ound the uniform en tropy integral of the loss-difference class explicitly . F or least squares, F ℓ := { z 7→ f 0 ( x ) 2 − f ( x ) 2 − 2 y { f 0 ( x ) − f ( x ) } : f ∈ F } . This class is contained 22 in e F ℓ − e F ℓ , where e F ℓ := { z 7→ f ( x ) 2 − 2 y f ( x ) : f ∈ F } . Hence J 2 ( δ, F ℓ ) ≤ J 2 δ, e F ℓ − e F ℓ ≲ J 2 ( δ, e F ℓ ) , where the last step uses the standard difference-class bound J 2 ( δ, H − H ) ≲ J 2 ( δ, H ). Next, the class e F ℓ is a Lipsc hitz transformation of F 2 × F , where F 2 := { f 2 : f ∈ F } , via the map φ : R × R 2 → R defined by φ ( y ; u, v ) := u − 2 y v . Indeed, for an y | y | ≤ B and any u, u ′ ∈ R , v , v ′ ∈ R , | φ ( y ; u, v ) − φ ( y ; u ′ , v ′ ) | = | ( u − u ′ ) − 2 y ( v − v ′ ) | ≤ | u − u ′ | + 2 | y | | v − v ′ | ≤ | u − u ′ | + 2 B | v − v ′ | . Th us φ ( y ; · , · ) is Lipschitz uniformly ov er | y | ≤ B with respect to the weigh ted ℓ 1 metric ( u, v ) 7→ | u | + 2 B | v | . Consequently , by Lemma 5 , J 2 ( δ, e F ℓ ) ≲ J 2 ( δ, F 2 ) + J 2 δ 2 B , F . Moreo ver, since t 7→ t 2 is 2 B -Lipschitz on [ − B , B ], J 2 ( δ, F 2 ) ≲ J 2 δ 2 B , F . Com bining the displays yields J 2 ( δ, F ℓ ) ≲ J 2 ( δ / (2 B ) , F ). T o apply Corollary 1 , it remains to choose δ n satisfying the critical-radius condition J 2 ( δ n , F ℓ ) ≲ √ n δ 2 n . Supp ose the metric en tropy of F gro ws polynomially: for some α ∈ (0 , 2), sup Q log N ε, F , L 2 ( Q ) ≲ ε − α , ε ∈ (0 , 1] . F or H¨ older or Sob olev classes on [0 , 1] d with smo othness s > d/ 2, this holds with α = d/s . Then J 2 ( δ, F ) ≲ R δ 0 ε − α/ 2 dε ≍ δ 1 − α/ 2 , up to constants dep ending only on α . Consequently , J 2 ( δ, F ℓ ) ≲ J 2 δ B , F ≲ δ 1 − α/ 2 , up to constants dep ending on α and B . The critical-radius condition holds provided δ 1 − α/ 2 n ≲ √ n δ 2 n , equiv alently δ 1+ α/ 2 n ≳ n − 1 / 2 . Th us one may take δ n ≍ n − 1 2+ α . Corollary 1 , together with strong con v exity , yields ∥ ˆ f n − f 0 ∥ = O p ( δ n ) = O p ( n − 1 / (2+ α ) ). Alternatively , one ma y apply Theo- rem 4 , using that δ n satisfies the critical inequality via Lemma 5 with en velope ϕ ( δ ) ≍ δ 1 − α/ 2 . 5 ERM with n uisance comp onen ts 5.1 W eighted ERM and regret transfer W eighted empirical risk minimization arises in man y settings where the empirical distribution of the observ ed sample differs from the target distribution of interest, including missing data, causal inference, and domain adaptation. In these problems, the target risk can often b e written as a w eighted exp ectation R ( f ; w 0 ) = E P { w 0 ( Z ) ℓ ( Z , f ) } , for a p otentially unknown w eight function w 0 : Z → R . A natural estimator of R ( f ; w 0 ) is the weigh ted empirical risk P n { ˆ w ( · ) ℓ ( · , f ) } , where ˆ w estimates w 0 . The weigh ted empirical risk 23 minimizer is defined as an y solution ˆ f n ∈ arg min f ∈F n X i =1 ˆ w ( Z i ) ℓ ( Z i , f ) . This section dev elops regret b ounds for weigh ted ERM, highlighting how weigh t estimation modifies the usual ERM guarantees. The follo wing theorem characterizes the excess regret incurred b y having to estimate the w eight function. In what follo ws, let f 0 ∈ arg min f ∈F R ( f ; w 0 ) minimize the p opulation risk and let ˆ f n b e an y random element of F . W e define the regret Reg( f ; w ) := R ( f ; w ) − inf f ∈F R ( f ; w ) , where R ( f ; w ) := P { w ( · ) ℓ ( · , f ) } . W e also impose a weigh ted v ariant of the Bernstein condition. B1) (W eigh ted Bernstein b ound.) There exists a constant c ∈ (0 , ∞ ) suc h that, for all f ∈ F , V ar w 0 ( Z ) ℓ ( Z, f ) − ℓ ( Z, f 0 ) ≤ c R ( f ; w 0 ) − R ( f 0 ; w 0 ) . Theorem 8 (Regret transfer under w eigh t error) . Assume B1 and that 1 − ˆ w w 0 < 1 / 2. Then Reg( ˆ f n ; w 0 ) ≤ Reg( ˆ f n ; ˆ w ) + 4 c 2 1 − ˆ w w 0 2 . The quantit y Reg ( ˆ f n ; ˆ w ) is the regret incurred with resp ect to the estimated w eights ˆ w . When ˆ w is trained on data indep endent of { Z i } n i =1 , for example via sample splitting, Reg ( ˆ f n ; ˆ w ) can b e b ounded by applying Theorem 3 conditional on the weigh t-training data. Sample splitting is a standard technique for obtaining fast regret b ounds: conditional on the weigh t-training data, the w eighted loss ˆ w ( · ) ℓ ( · , f ) is fixed and deterministic, which enables the use of regret-analysis to ols for kno wn losses ( F oster and Syrgk anis , 2023 ). In this case, Theorem 8 shows that the additional regret incurred b y w eigh t estimation, relativ e to an oracle pro cedure that uses the true w eights, is of order ∥ 1 − ˆ w/w 0 ∥ 2 , which is the chi-squared divergence betw een ˆ w and w 0 . In the next se ctions, we study ho w this dependence on weigh t-estimation error can be impro ved using Neyman-orthogonal loss functions, and how in-sample n uisance estimation affects regret rates via Reg( ˆ f n ; ˆ w ). 5.2 Orthogonal losses and n uisance-robust learning In many mo dern learning problems, the loss dep ends on unknown nuisance comp onents, esp ecially in causal inference and missing-data settings ( Rubin and v an der Laan , 2006 ; V an Der Laan and Dudoit , 2003 ; K ¨ unzel et al. , 2019 ; Nie and W ager , 2021 ; Kennedy , 2023 ; F oster and Syrgk anis , 2023 ; V an Der Laan et al. , 2023 ; Y ang et al. , 2023 ; v an der Laan et al. , 2024a ). The weigh ted ERM setup from the previous section is a simple instance, with the weigh t function playing the role of the n uisance. A general treatmen t of learning with nuisance comp onents, and the role of orthogonality in obtaining fast rates, is given in F oster and Syrgk anis ( 2023 ). Here w e summarize the main ideas and refer the reader there for a comprehensive analysis. 24 Concretely , supp ose w e are giv en a family of losses { ℓ g ( z , f ) : g ∈ G , f ∈ F } , where g denotes n uisance functions (e.g., regression functions, prop ensity scores, or density ratios) that are not themselv es the primary learning target. Let g 0 denote the true nuisance, and define the p opulation risk R ( f ; g ) := P ℓ g ( · , f ) , f 0 ∈ arg min f ∈F R ( f ; g 0 ) . In practice, one replaces g 0 with an estimator ˆ g (often fit using flexible regression methods and t ypically cross-fitted), and then computes an ERM under the plug-in loss, ˆ f n ∈ arg min f ∈F P n ℓ ˆ g ( · , f ) . A k ey difficulty is that the plug-in ob jective can b e first-or der sensitive to n uisance error: ev en if ˆ f n nearly minimizes the estimated risk R ( f ; ˆ g ), the regret under the true risk, Reg( ˆ f n ; g 0 ) := R ( ˆ f n ; g 0 ) − R ( f 0 ; g 0 ) , can dep end on ˆ g − g 0 through a leading term of order ∥ ˆ g − g 0 ∥ 2 (rather than a higher-order term, suc h as ∥ ˆ g − g 0 ∥ 4 ). Equiv alently , the L 2 error ∥ ˆ f n − f 0 ∥ ma y depend on the nuisance through the first-order error ∥ ˆ g − g 0 ∥ , rather than ∥ ˆ g − g 0 ∥ 2 . This issue app ears, for example, in weigh ted ERM, where Theorem 8 shows that w eigh t-estimation error enters at first order. Ortho gonal statistic al le arning ( F oster and Syrgk anis , 2023 ) addresses this issue by working with Neyman-ortho gonal losses. Sp ecifically , for all f ′ ∈ F and g ′ ∈ G , ∂ g ∂ f R ( f 0 ; g 0 ) [ f ′ − f 0 , g ′ − g 0 ] := d dt t =0 d ds s =0 R f 0 + s ( f ′ − f 0 ); g 0 + t ( g ′ − g 0 ) = 0 , so the first-order optimality conditions at ( f 0 , g 0 ) are lo cally insensitive to p erturbations of the n uisance around g 0 . In many problems, one can obtain an orthogonal loss from a non-orthogonal criterion b y adding a bias-correction term (often deriv ed from an influence-function expansion) that cancels the leading effect of nuisance error; see App endix D of F oster and Syrgk anis ( 2023 ) and Section 2.3 of v an der Laan et al. ( 2024a ). In the framework of F oster and Syrgk anis ( 2023 ), one first con trols the regret under the estimate d loss, Reg( ˆ f n ; ˆ g ) := R ( ˆ f n ; ˆ g ) − inf f ∈F R ( f ; ˆ g ) , and then relates it to regret under the true loss. In particular, Theorem 1 of F oster and Syrgk anis ( 2023 ) gives a a regret-transfer bound of the form Reg( ˆ f n ; g 0 ) ≤ C Reg( ˆ f n ; ˆ g ) + Rem(ˆ g , g 0 ) , where Rem( ˆ g , g 0 ) is a se c ond-or der remainder in the nuisance error (e.g., Rem( ˆ g , g 0 ) ≲ ∥ ˆ g − g 0 ∥ 4 under appropriate conditions), and C is a problem-dependent constan t. Thus, if ˆ f n ac hieves small regret under the plug-in loss and ˆ g con verges sufficien tly quickly , then Reg( ˆ f n ; g 0 ) can b e controlled 25 without a first-order n uisance p enalty . This decomposition is useful b ecause it separates (i) statis- tical/optimization error for the main learner under the estimated loss from (ii) higher-order error due to n uisance estimation. T o b ound the plug-in regret Reg ( ˆ f n ; ˆ g ), one can reduce to a fixed-loss analysis via sample splitting: split the data, estimate ˆ g on the first part, and compute ˆ f n on the second using ℓ ˆ g , so that, conditional on the n uisance-training sample (and hence on ˆ g ), standard high-probabilit y ERM regret bounds for deterministic losses (e.g., Theorem 3 ) apply . T o a void the inefficiency of using only part of the sample at eac h stage, one can instead use K -fold cross-fitting ( v an der Laan et al. , 2011 ; Chernozhuk ov et al. , 2018 ): partition { 1 , . . . , n } into folds I 1 , . . . , I K , fit ˆ g ( − k ) on I c k , ev aluate ℓ ˆ g ( − k ) on I k , and minimize the resulting cross-fitted risk P cf n ℓ ˆ g ( · , f ) := 1 K K X k =1 1 | I k | X i ∈ I k ℓ ˆ g ( − k ) ( Z i , f ) , ˆ f n ∈ arg min f ∈F P cf n ℓ ˆ g ( · , f ) . Conditional on { ˆ g ( − k ) } K k =1 , each observ ation is ev aluated under a loss fit on indep endent data, so the same conditional-independence argumen t as in sample splitting t ypically yields comparable ERM regret b ounds while using all observ ations for b oth nuisance estimation and learning. 5.3 In-sample nuisance estimation: ERM without sample splitting The previous tw o sections show ed that, in empirical risk minimization with nuisance components, the regret under the true n uisance, Reg( ˆ f n ; g 0 ), splits in to tw o terms: a nuisance-estimation term and an in-sample regret term Reg( ˆ f n ; ˆ g ) that arises because the loss dep ends on the data through ˆ g . With sample splitting, ˆ g is estimated on an independent subsample, so standard regret bounds for fixed losses apply directly . Sample splitting, how ev er, is data-inefficient, and while cross-fitting can reco ver statistical efficiency , it may b e computationally exp ensive. Moreov er, in-sample n uisance estimation is in trinsic to certain calibration ( v an der Laan et al. , 2024b , c ) and debiasing pro ce- dures ( v an der Laan et al. , 2024a ). F or example, in the Efficient Plug-in (EP) learning framew ork of v an der Laan et al. ( 2024a )—a plug-in alternativ e to orthogonal learning—initial cross-fitted n uisance estimators are calibrated p ost ho c via an in-sample, siev e-based adjustmen t designed to remo ve first-order bias in the n uisance components. In this section, we b ound Reg ( ˆ f n ; ˆ g ) when the nuisance is estimated in-sample. The main message is that, for suitably smo oth optimization classes F , ERM without sample splitting can ac hieve the same rate as with sample splitting, provided the n uisance class G satisfies a Donsk er- t yp e condition. That said, w e generally recommend sample splitting in practice, particularly when using highly adaptive mac hine learning metho ds (e.g., neural netw orks or b o osted trees), for which suc h Donsk er-t yp e conditions are unlik ely to hold. Let F b e a class of candidate functions and G a nuisance class. F or g ∈ G and f ∈ F , let ℓ g ( · , f ) denote the loss ev aluated at nuisance v alue g , and define the p opulation risk R ( f ; g ) := P ℓ g ( · , f ). 26 Denote the empirical and population risk minimizers under the estimated nuisance ˆ g by ˆ f n := arg min f ∈F P n ℓ ˆ g ( · , f ) , ˆ f 0 := arg min f ∈F R ( f ; ˆ g ) . Our goal is to b ound the regret Reg ( ˆ f n ; ˆ g ) := R ( ˆ f n ; ˆ g ) − R ( ˆ f 0 ; ˆ g ) with resp ect to the estimated n uisance ˆ g . W e mak e no sample-splitting or cross-fitting assumptions on ˆ g . W e impose the following mild conditions on the loss. B2) (Str ong c onvexity) The class F is con vex, and there exists κ ∈ (0 , ∞ ) suc h that R ( f ; ˆ g ) − R ( ˆ f 0 ; ˆ g ) ≥ κ ∥ f − ˆ f 0 ∥ 2 for all f ∈ F . B3) (Pr o duct-structur e d loss) Assume that the follo wing hold: (i) (Pr o duct structur e) There exist functions m 1 , r 1 : Z × F → R and m 2 , r 2 : Z × G → R suc h that, for all z ∈ Z , ℓ g ( z , f ) = m 1 ( z , f ) m 2 ( z , g ) + r 1 ( z , f ) + r 2 ( z , g ) . (ii) (Pointwise Lipschitz) There exists L ∈ (0 , ∞ ) suc h that, for all z ∈ Z , | r 1 ( z , f 1 ) − r 1 ( z , f 2 ) | ≤ L | f 1 ( z ) − f 2 ( z ) | , | m 1 ( z , f 1 ) − m 1 ( z , f 2 ) | ≤ L | f 1 ( z ) − f 2 ( z ) | , | m 2 ( z , g 1 ) − m 2 ( z , g 2 ) | ≤ L | g 1 ( z ) − g 2 ( z ) | . Condition B2 holds under the sufficient conditions of Lemma 1 . Condition B3-ii ensures that f and g enter the loss ℓ g ( · , f ) in a p oin twise Lipschitz manner. The pro duct structure in Condition B3-i is satisfied by many losses. F or example, it holds for pseudo-outcome regression losses ( Rubin and v an der Laan , 2006 ; Y ang et al. , 2023 ), suc h as the DR-learner ( v an der Laan , 2013 ; Luedtk e and v an der Laan , 2016 ; Kennedy , 2023 ), which take the form ℓ g ( z , f ) = { g ( z ) − f ( z ) } 2 for a pseudo-outcome function g . In this case, one may set m 1 ( z , f ) := f ( z ), m 2 ( z , g ) := − 2 g ( z ), r 1 ( z , f ) := f ( z ) 2 , and r 2 ( z , g ) := g ( z ) 2 . The condition also holds for the w eighted loss in Section 5.1 , ℓ g ( z , f ) := g ( z ) ℓ ( z , f ), with m 1 ( z , f ) := ℓ ( z , f ) and m 2 ( z , g ) := g ( z ). More generally , our analysis extends to losses that can b e written as finite sums of terms with this pro duct structure, possibly in volving m ultiple nuisance components, as in the general loss classes studied by v an der Laan et al. ( 2024a ). Our analysis of ERM with in-sample nuisance estimation follo ws the template of Section 3.3 , but uses a refined basic inequality . In particular, starting from ( 7 ), w e further decomp ose R ( ˆ f n ; ˆ g ) − R ( ˆ f 0 ; ˆ g ) ≤ ( P n − P ) ℓ ˆ g ( · , ˆ f 0 ) − ℓ ˆ g ( · , ˆ f n ) = ( P n − P ) ℓ g 0 ( · , ˆ f 0 ) − ℓ g 0 ( · , ˆ f n ) + ( P n − P ) h ℓ ˆ g ( · , ˆ f 0 ) − ℓ ˆ g ( · , ˆ f n ) − ℓ g 0 ( · , ˆ f 0 ) − ℓ g 0 ( · , ˆ f n ) i , 27 where we ma y take g 0 ∈ G to b e the true nuisance, or more generally any L 2 limit of ˆ g , i.e., ∥ ˆ g − g 0 ∥ = o p (1), assuming suc h a limit exists. The first term is handled exactly as in Section 3.3 , since it in volv es the fixed loss ℓ g 0 . The second term captures the data-dependence of ℓ ˆ g and is the main difficulty in our analysis, since ℓ ˆ g is random when ˆ g is estimated in-sample. Nonetheless, the term is lo calized in b oth ˆ f n and ˆ g , and we control it using specialized maximal inequalities that exploit this double lo calization and its difference structure. In particular, Condition B3-i shows that it reduces to bounding an empirical inner pro duct of the form ( P n − P ) n m 1 ( · , ˆ f n ) − m 1 ( · , ˆ f 0 ) on m 2 ( · , ˆ g ) − m 2 ( · , g 0 ) o , for which the required concentration to ols are dev elop ed in App endix C . With in-sample n uisance estimation, the regret typically dep ends on the lo cal complexity of a loss-differ enc e differ enc e class, n ℓ g ( · , f ) − ℓ g ( · , f ′ ) − ℓ g 0 ( · , f ) − ℓ g 0 ( · , f ′ ) : f , f ′ ∈ F , g ∈ G o , rather than only on the complexity of F under the fixed nuisance g 0 . When G is substantially more complex than F , oracle rates for ERM under ℓ g 0 ma y b e unattainable. In the worst case, the regret can scale with the square of the larger of the critical radii asso ciated with the classes used to learn ˆ f n and ˆ g ; in particular, if ˆ g is an ERM o ver G , then Reg ( ˆ f n ; ˆ g ) can b e no smaller than the regret of ˆ g , regardless of whether the loss is orthogonal. Nonetheless, under suitable smo othness assumptions on F —namely , that the supremum norm ∥ · ∥ ∞ is controlled b y a fractional p ow er ∥ · ∥ α of the L 2 ( P ) norm ov er the pairwise difference class F − F —the sensitivity of ERM to n uisance complexit y can be mitigated. Under additional conditions, this yields oracle rates even without sample splitting. Our main result relies on the follo wing additional conditions. B4) (Sup-norm r elation) There exist c ∞ ∈ (0 , ∞ ) and β ≥ 1 / 2 suc h that ∥ f − f ′ ∥ ∞ ≤ c ∞ ∥ f − f ′ ∥ 1 − 1 2 β for all f , f ′ ∈ F . B5) (P AC b ound for the nuisanc e) F or all η ∈ (0 , 1), ∥ ˆ g − g 0 ∥ G ≤ ε nuis ( n, η ) with probabilit y at least 1 − η / 2 . Condition B4 alwa ys holds with β = 1 / 2 and c ∞ := 2 sup f ∈F ∥ f ∥ ∞ , but for smoothness classes it often holds with β > 1 / 2. F or finite-dimensional linear models of dimension p , one ma y tak e c ∞ ≍ √ p and β = ∞ . It holds for repro ducing kernel Hilb ert spaces with eigenv alue decay λ j ≍ j − 2 β ( Mendelson and Neeman , 2010 , Lemma 5.1). The em b edding also holds for signed conv ex h ulls of suitable bases ( v an de Geer , 2014 , Lemma 2), and for d -v ariate H¨ older and Sob olev classe s of order s > d/ 2 on b ounded domains, where β = s/d (see Bibaut et al. 2021 , Lemma 4 and Adams and F ournier 2003 ; T rieb el 2006 ). Condition B5 requires that the n uisance estimator satisfy a high-lev el 28 P AC guarantee; for instance, this follo ws from T heorem 3 when the nuisance is itself obtained via empirical risk minimization ov er G , with ε nuis ( n, η ) scaling with the critical radius of G . The following theorem characterizes the L 2 ( P ) error of ˆ f n in terms of the critical radius δ n, F of the optimization class F , as w ell as the lo calized Rademac her complexit y R n G , ε nuis ( n, η ) of the class G used to learn ˆ g . Its pro of leverages a lo cal concen tration b ound for empirical inner pro ducts of the form { ( P n − P )( f g ) : f ∈ F , g ∈ G } (Theorem 12 in App endix C ). W e denote b y F − F the pairwise difference class { f − f ′ : f , f ′ ∈ F } , and similarly for G − G . Theorem 9 ( L 2 -error b ounds without sample splitting) . Assume Conditions B2 – B4 . Let δ n, F > 0 and δ n, G > 0 satisfy the critical inequalities δ 2 n, F ≳ R n δ n, F , F − F , δ 2 n, G ≳ R n δ n, G , star( G − G ) . Fix η ∈ (0 , 1 / 2) and assume further that δ n, F ≳ M r log(1 /η ) + log log ( eM n ) n , δ n, G ≳ r log(1 /η ) + log log ( eM n ) n , where M := 1 + sup g ∈G ∥ g ∥ ∞ + sup f ∈F ∥ f ∥ ∞ . Then, with probability at least 1 − η , ∥ ˆ f n − ˆ f 0 ∥ 2 ≲ δ 2 n, F + n δ 2 n, G + δ n, G ε nuis ( n, η ) o 4 β / (2 β +1) . where the implicit constant dep ends only on κ, L, c ∞ , α , sup f ∈F ∥ f ∥ ∞ , and sup g ∈G ∥ g ∥ ∞ . F or simplicity , supp ose that ˆ g is obtained b y empirical risk minimization o ver G . In this case, Theorem 3 typically yields a P AC guaran tee ε nuis ( n, η ) ≍ δ n, G , where δ n, G is the critical radius of G . Under this specialization, Theorem 9 makes explicit how the critical radii of F and G jointly con trol the L 2 ( P ) error of ˆ f n . The leading term δ 2 n, F matc hes the oracle rate one w ould obtain if the n uisance were known. The additional price of estimating the nuisance in-sample app ears through { δ 2 n, G } 4 β / (2 β +1) . The H¨ older exp onent β in Condition B4 quan tifies how effectiv ely L 2 ( P ) lo calization translates in to sup-norm control, and th us gov erns how strongly the n uisance terms are attenuated. More generally , the theorem highlights tw o regimes. In an oracle regime, if δ n, G = O ( δ (2 β +1) / (4 β ) n, F ) , then the n uisance terms are O ( δ 2 n, F ) and the ov erall rate matc hes that of an oracle that kno ws g 0 ; in particular, ∥ ˆ f n − ˆ f 0 ∥ 2 ≲ δ 2 n, F up to constan ts. By con trast, if δ n,η , G exceeds δ (2 β +1) / (4 β ) n, F , then at least one n uisance term dominates δ 2 n, F , and the rate is gov erned by the n uisance complexity . The following corollary shows that the oracle rate can b e attained under the mild Donsk er-type requiremen t δ 2 n, G = O ( n − 1 / 2 ) (and likewise ε 2 nuis ( n, η ) = O ( n − 1 / 2 )) for function classes satisfying the critical-radius scaling δ n, F ≍ n − β / (2 β +1) . Corollary 2 (Sufficient condition for the oracle rate) . Fix η ∈ (0 , 1 / 2), and assume the setup and conditions of Theorem 9 . Supp ose, in addition, that δ 2 n, G ∨ ε 2 ( n, η ) = O ( n − 1 / 2 ) and δ n, F ≍ n − β / (2 β +1) 29 for the same β ≥ 1 / 2 as in Condition B4 . Then, with probabilit y at least 1 − η , ∥ ˆ f n − ˆ f 0 ∥ 2 ≲ δ 2 n, F ≍ n − 2 β / (2 β +1) , where the implicit constant dep ends only on κ, L, c ∞ , β , sup f ∈F ∥ f ∥ ∞ , and sup g ∈G ∥ g ∥ ∞ . When δ n, G ≍ ε nuis ( n, η ), Corollary 2 yields a simple sufficient condition for oracle performance without sample splitting. In particular, the oracle rate holds if (i) δ 2 n, G = O ( n − 1 / 2 ) and (ii) there exists β > 1 / 2 suc h that the critical radius satisfies δ n, F ≍ n − β / (2 β +1) and the sup-norm bound in Condition B4 holds with this same β . Condition (i), equiv alently δ n, G = O ( n − 1 / 4 ), is a Donsker-t yp e requiremen t on the n uisance class; for example, it holds whenev er R n ( ∞ , star( G − G )) = O ( n − 1 / 2 ), or, more classically , whenev er J 2 ( ∞ , G ) < ∞ , whic h is sufficient for G to b e Donsker ( V an Der V aart and W ellner , 1996 ). The scaling in (ii) is, in turn, implied if the lo cal Rademacher complexit y 2 and a lo cal L 2 ( P )–to– L ∞ em b edding both hold at the same scale, namely √ n R n ( δ, F − F ) ≍ δ 1 − 1 2 β and sup f ∈F −F : ∥ f ∥≤ δ ∥ f ∥ ∞ ≲ δ 1 − 1 2 β . (11) It turns out that requirement (ii) holds for many standard function classes. F or H¨ older and Sob olev classes in dimension d with smo othness s > d/ 2, one t ypically has ( 11 ) with β = s/d (see Bibaut et al. 2021 , Lemma 4 and Adams and F ournier 2003 ; T rieb el 2006 ). The same scaling also holds for a broad class of repro ducing k ernel Hilb ert spaces (RKHS) whose kernel-operator eigen v alues satisfy λ j ≍ j − 2 β . In particular, the sup-norm b ound in Condition B4 follows from Lemma 5.1 of Mendelson and Neeman ( 2010 ). Lik ewise, the lo cal Rademac her complexit y scales as R n δ, F − F ≍ δ 1 − 1 2 β , with critical radius δ n, F ≍ n − β / (2 β +1) ( W ainwrigh t , 2019 ). The n − 1 / 4 -rate condition ε nuis ( n, η ) = O ( n − 1 / 4 ) imposed by (i), together with the scalings in (ii), also app ears in the R-learner analyses of Nie and W ager ( 2021 ), which study ERM with cross- fitted n uisance estimation ov er RKHS. This suggests that, at least for RKHS, the requiremen ts for oracle rates are similar with and without cross-fitting, pro vided that G satisfies a Donsker-t yp e condition and that δ n, G ≍ ε nuis ( n, η ), as is typical when ˆ g is obtained via ERM ov er G . Finally , our analysis establishes the oracle rate only up to multiplicativ e constants; a sharp er argumen t, as in Theorem 5 of v an der Laan et al. ( 2024a ), also recov ers the oracle constant. 6 Conclusion The goal of this guide is to provide a self-contained template for analyzing empirical risk mini- mization via high-probability regret bounds. The central message is that many ERM analyses can b e organized around a basic inequalit y , a uniform lo cal concentration b ound, and a fixed-p oint argumen t, leading to rates go verned by lo calized complexity and a critical radius. W e also summa- rize how these critical radii can be related to entrop y integrals in common settings, and ho w the same template extends to n uisance-dep endent losses, including b oth sample splitting and in-sample 2 The scaling √ n R n ( δ, F − F ) ≍ δ 1 − 1 2 β need only hold for δ ≳ δ n, F to ensure that δ n, F ≍ n − β / (2 β +1) . 30 n uisance estimation. W e view this do cumen t as a compact reference for this pro of template, with ro om to incorp orate further examples and v ariations as these notes ev olve. Ac kno wledgements. This guide reflects my understanding of empirical risk minimization as it dev elop ed while learning the topic, b eginning in my master’s program. I first encountered ERM through W ainwrigh t ( 2019 ) in the theoretical statistics course ST A T 210B at UC Berkeley . My understanding of ERM—particularly in settings with nuisance comp onents and without sample splitting—w as further shap ed during my PhD at the Universit y of W ashington, in part through a dissertation pro ject on Efficient Plug-in learning ( v an der Laan et al. , 2024a ). I thank m y PhD advisors, Marco Carone and Alex Luedtke, whose guidance, together with Chapter 3 of V an Der V aart and W ellner ( 1996 ) and V an Der V aart and W ellner ( 2011 ), help ed clarify the structure of ERM analyses. I also thank Aur ´ elien Bibaut and Nathan Kallus, whose collab oration in tro duced me to uniform lo cal concen tration inequalities and P AC-st yle analyses of ERM. References Pieter Abb eel and Andrew Y. Ng. Apprenticeship learning via in verse reinforcemen t learning. In ICML , 2004. Rob ert A Adams and John JF F ournier. Sob olev sp ac es , v olume 140. Elsevier, 2003. Andr´ as Antos, Csaba Szep esv´ ari, and R ´ emi Munos. Fitted q-iteration in contin uous action-space mdps. A dvanc es in neur al information pr o c essing systems , 20, 2007. P eter Bartlett. Stat 210b spring 2013: Theoretical statistics. Course w ebpage, Universit y of California, Berkeley , 2013. URL https://www.stat.berkeley.edu/ ~ bartlett/courses/ 2013spring- stat210b/ . Accessed 2026-02-22. P eter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk b ounds and structural results. Journal of machine le arning r ese ar ch , 3(Nov):463–482, 2002. P eter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Lo cal rademac her complexities. 2005. Aurelien Bibaut, May a Petersen, Nik os Vlassis, Maria Dimakopoulou, and Mark v an der Laan. Sequen tial causal inference in a single world of connected units. arXiv pr eprint arXiv:2101.07380 , 2021. Aur ´ elien F Bibaut and Mark J v an der Laan. F ast rates for empirical risk minimization ov er c` adl` ag functions with b ounded sectional v ariation norm. arXiv pr eprint arXiv:1907.09244 , 2019. Olivier Bousquet. A b ennett concen tration inequality and its application to suprema of empirical pro cesses. Comptes R endus Mathematique , 334(6):495–500, 2002. Olivier Bousquet, St ´ ephane Bouc heron, and G´ abor Lugosi. In tro duction to statistical learning theory . In Summer scho ol on machine le arning , pages 169–207. Springer, 2003. 31 Andrea Cap onnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. F oundations of Computational mathematics , 7(3):331–368, 2007. Victor Chernozhuk o v, Denis Chetverik ov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney New ey , and James Robins. Double/debiased machine learning for treatment and structural parameters, 2018. Ric hard M Dudley . The sizes of compact subsets of hilb ert space and contin uit y of gaussian pro cesses. Journal of F unctional Analysis , 1(3):290–330, 1967. Heinz W erner Engl, Martin Hanke, and Andreas Neubauer. R e gularization of inverse pr oblems , v olume 375. Springer Science & Business Media, 1996. Dylan J F oster and V asilis Syrgk anis. Orthogonal statistical learning. The Annals of Statistics , 51 (3):879–908, 2023. Sara A Geer. Empiric al Pr o c esses in M-estimation , v olume 6. Cambridge universit y press, 2000. Ev arist Gin´ e and Vladimir Koltc hinskii. Concen tration inequalities and asymptotic results for ratio t yp e empirical pro cesses. 2006. Adit ya Gun tub o yina. Spring 2018 statistics 210b (theoretical statistics): All lec- tures. PDF, Jan uary 2018. URL https://www.stat.berkeley.edu/ ~ aditya/resources/ FullNotes210BSpring2018.pdf . Lecture notes, Univ ersity of California, Berk eley . Dated Jan- uary 16, 2018. Go dfrey Harold Hardy , John Edensor Littlewoo d, and George P´ olya. Ine qualities . Cambridge univ ersity press, 1952. T revor Hastie, Rob ert Tibshirani, Jerome F riedman, et al. The elements of statistical learning, 2009. Arth ur E Ho erl and Rob ert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. T e chnometrics , 12(1):55–67, 1970. Gareth James, Daniela Witten, T revor Hastie, and Robert Tibshirani. An intr o duction to statistic al le arning: with applic ations in R , v olume 103. Springer, 2013. Edw ard H Kennedy . T ow ards optimal doubly robust estimation of heterogeneous causal effects. Ele ctr onic Journal of Statistics , 17(2):3008–3049, 2023. Vladimir Koltchinskii. Or acle ine qualities in empiric al risk minimization and sp arse r e c overy pr ob- lems: Ec ole D’Et ´ e de Pr ob abilit´ es de Saint-Flour XXXVIII-2008 , v olume 2033. Springer, 2011. S¨ oren R. K ¨ unzel, Jasjeet S. Sekhon, P eter J. Bick el, and Bin Y u. Metalearners for estimating heterogeneous treatment effects using mac hine learning. Pr o c e e dings of the National A c ademy of Scienc es , 116(10):4156–4165, 2019. 32 Y unw en Lei, Lixin Ding, and Yingzhou Bi. Local rademacher complexit y b ounds based on cov ering n umbers. Neur o c omputing , 218:320–330, 2016. Alexander Luedtke and Mark v an der Laan. Super-learning of an optimal dynamic treatment rule. The international journal of biostatistics , 12:305–332, 05 2016. doi: 10.1515/ijb- 2015- 0052. Andreas Maurer. A v ector-contraction inequality for rademacher complexities. In International Confer enc e on Algorithmic L e arning The ory , pages 3–17. Springer, 2016. Shahar Mendelson. Impro ving the sample complexit y using global data. IEEE tr ansactions on Information The ory , 48(7):1977–1991, 2002. Shahar Mendelson and Joseph Neeman. Regularization in kernel learning. 2010. Mehry ar Mohri, Afshin Rostamizadeh, and Ameet T alw alk ar. F oundations of machine le arning . MIT press, 2018. R ´ emi Munos. Error bounds for approximate v alue iteration. In Pr o c e e dings of the National Confer- enc e on Artificial Intel ligenc e , volume 20, page 1006. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2005. R ´ emi Munos and Csaba Szep esv´ ari. Finite-time b ounds for fitted v alue iteration. Journal of Machine L e arning R ese ar ch , 9(5), 2008. Ric hard Nickl and Benedikt M P¨ otscher. Brac k eting metric en tropy rates and empirical central limit theorems for function classes of b esov-and sob olev-type. Journal of The or etic al Pr ob ability , 20(2):177–199, 2007. Xinkun Nie and Stefan W ager. Quasi-oracle estimation of heterogeneous treatmen t effects. Biometrika , 108(2):299–319, 2021. Daniel Rubin and Mark J v an der Laan. Doubly robust censoring unbiased transformations. 2006. Bo dhisattv a Sen. A gen tle in tro duction to empirical process theory and applications. L e ctur e Notes, Columbia University , 11:28–29, 2018. Shai Shalev-Sh w artz and Shai Ben-Da vid. Understanding machine le arning: F r om the ory to algo- rithms . Cambridge universit y press, 2014. Ingo Steinw art and Andreas Christmann. Supp ort ve ctor machines . Springer Science & Business Media, 2008. Ingo Steinw art, Don R Hush, Clint Scov el, et al. Optimal rates for regularized least squares regression. In COL T , pages 79–93, 2009. Kunio T akeza wa. Intr o duction to nonp ar ametric r e gr ession . John Wiley & Sons, 2005. 33 Andrej Nikolaevi ˇ c Tihonov and V asilij Jak o vleviˇ c Arsenin. Solutions of il l-p ose d pr oblems . Winston, 1977. Hans T rieb el. The ory of function sp ac es III . Springer, 2006. Sara v an de Geer. On the uniform conv ergence of empirical norms and inner pro ducts, with application to causal inference. 2014. Lars v an der Laan and Nathan Kallus. Fitted q ev aluation without b ellman completeness via stationary weigh ting. arXiv pr eprint arXiv:2512.23805 , 2025a. Lars v an der Laan and Nathan Kallus. Stationary rew eighting yields local conv ergence of soft fitted q-iteration. arXiv pr eprint arXiv:2512.23927 , 2025b. Lars V an Der Laan, Ernesto Ulloa-P ´ erez, Marco Carone, and Alex Luedtke. Causal isotonic cali- bration for heterogeneous treatmen t effects. In International Confer enc e on Machine L e arning , pages 34831–34854. PMLR, 2023. Lars v an der Laan, Marco Carone, and Alex Luedtke. Com bining t-learning and dr-learning: A framew ork for oracle-efficient estimation of causal con trasts. arXiv pr eprint arXiv:2402.01972 , 2024a. Lars v an der Laan, Ziming Lin, Marco Carone, and Alex Luedtke. Stabilized inv erse probabilit y w eighting via isotonic calibration. arXiv pr eprint arXiv:2411.06342 , 2024b. Lars v an der Laan, Alex Luedtk e, and Marco Carone. Doubly robust inference via calibration. arXiv pr eprint arXiv:2411.02771 , 2024c. Lars v an der Laan, Nathan Kallus, and Aur´ elien Bibaut. In verse reinforcement learning using just classification and a few regressions. arXiv pr eprint arXiv:2509.21172 , 2025a. Lars v an der Laan, Nathan Kallus, and Aur´ elien Bibaut. Nonparametric instrumental v ariable inference with man y w eak instruments. arXiv pr eprint arXiv:2505.07729 , 2025b. Mark J. v an der Laan. T argeted learning of an optimal dynamic treatment, and statistical inference for its mean outcome. U.C. Berkeley Division of Biostatistics Working Pap er Series. Working Pap er 317. , 2013. Mark J V an Der Laan and Sandrine Dudoit. Unified cross-v alidation metho dology for selection among estimators and a general cross-v alidated adaptiv e epsilon-net estimator: Finite sample oracle inequalities and examples. 2003. Mark J v an der Laan, Sherri Rose, W enjing Zheng, and Mark J v an der Laan. Cross-v alidated targeted minim um-loss-based estimation. T ar gete d le arning: c ausal infer enc e for observational and exp erimental data , pages 459–474, 2011. Aad V an Der V aart and Jon A W ellner. A local maximal inequality under uniform en trop y . Ele c- tr onic Journal of Statistics , 5(2011):192, 2011. 34 Aad W V an Der V aart and Jon A W ellner. W eak conv ergence. In We ak c onver genc e and empiric al pr o c esses: with applic ations to statistics , pages 16–28. Springer, 1996. Vladimir N V apnik. An o verview of statistical learning theory . IEEE tr ansactions on neur al networks , 10(5):988–999, 1999. Martin J W ainwrigh t. High-dimensional statistics: A non-asymptotic viewp oint , volume 48. Cam- bridge universit y press, 2019. Larry W asserman. A l l of nonp ar ametric statistics . Springer, 2006. Y achong Y ang, Arun Kumar Kuchibhotla, and Eric Tc hetgen Tc hetgen. F orster-w armuth counter- factual regression: A unified learning approac h. arXiv pr eprint arXiv:2307.16798 , 2023. Haob o Zhang, Yic heng Li, W eihao Lu, and Qian Lin. Optimal rates of k ernel ridge r egression under source condition in large dimensions. Journal of Machine L e arning R ese ar ch , 26(219):1–63, 2025. A Regularized ERM A.1 Enforcing strong con vexit y via Tikhonov regularization (Ridge) Strong conv exity of the risk is desirable b ecause it yields fast ERM rates ov er con v ex classes. In our setting, it also ensures that the Bernstein condition in Condition A1 holds, by Lemma 3 . While man y risks are strongly con vex, some are not. Tikhonov (ridge) regularization is a general strategy for enforcing strong con vexit y b y adding a quadratic p enalty ( Hoerl and Kennard , 1970 ; Tihonov and Arsenin , 1977 ). F or simplicity , we take the regularization p enalty to b e the squared L 2 ( P ) norm. F or λ ≥ 0, define the Tikhono v-regularized p opulation and empiric al risks R λ ( f ) := P ℓ ( · , f ) + λ 2 ∥ f ∥ 2 , R n,λ ( f ) := P n ℓ ( · , f ) + λ 2 ∥ f ∥ 2 n , where ∥ f ∥ 2 := P ( f 2 ) and ∥ f ∥ 2 n := P n ( f 2 ). Denote the regularized p opulation minimizer and a regularized empirical minimizer by f 0 ,λ ∈ arg min f ∈F R λ ( f ) , ˆ f n,λ ∈ arg min f ∈F R n,λ ( f ) . Other p enalties are p ossible (e.g., an RKHS norm for kernel classes or an ℓ 2 p enalt y on co efficien ts for linear classes), and our arguments extend to these settings with minor c hanges. Supp ose that the risk R is conv ex and F is conv ex. Then ∥ · ∥ 2 is strongly conv ex on F , and the p enalt y makes R λ strongly con v ex even when R is not. An argumen t similar to Lemma 1 yields the quadratic gro wth bound R λ ( f ) − R λ ( f 0 ,λ ) ≳ λ ∥ f − f 0 ,λ ∥ 2 , f ∈ F . 35 Consequen tly , the Bernstein condition in Condition A1 holds for the regularized ob jectiv e (by Lemma 3 ), with f 0 replaced by f 0 ,λ and Bernstein constan t c Bern ≍ λ − 1 . Next, apply Theorem 3 to the modified loss ℓ λ ( z , f ) := ℓ ( z , f ) + λ 2 f ( z ) 2 . This yields R λ ( ˆ f n,λ ) − R λ ( f 0 ,λ ) = O p λ − 1 δ 2 n , where δ n is the critical radius asso ciated with the loss class F ℓ,λ := { ℓ λ ( · , f ) : f ∈ F } . The quadratic gro wth bound implies ∥ ˆ f n,λ − f 0 ,λ ∥ = O p λ − 1 / 2 δ n . The radius δ n can be obtained as in Section 4.2 b y com bining en tropy integral b ounds for F ℓ and F . In particular, by Lipsc hitz preserv ation for en tropy in tegrals (e.g., Theorem 5 ), J 2 ( δ, F ℓ,λ ) ≲ J 2 ( δ, F ℓ ) + J 2 δ λB , F , B := sup f ∈F ∥ f ∥ ∞ . Moreo ver, if λB ≲ 1, then by monotonicit y of δ 7→ J 2 ( δ, F ), Moreov er, if λB ≲ 1, then b y monotonicit y of δ 7→ J 2 ( δ, F ) we hav e J 2 δ λB , F ≲ J 2 ( δ, F ). It remains to b ound the regularization bias R λ ( f 0 ,λ ) − R ( f 0 ), i.e., the discrepancy b etw een the regularized p opulation minimizer and the unregularized minimizer. Indeed, by the triangle inequalit y , R λ ( ˆ f n,λ ) − R ( f 0 ) ≤ R λ ( f 0 ,λ ) − R ( f 0 ) + O p λ − 1 δ 2 n . (12) Lemma 7 (Regularization bias b ound) . Let f 0 ∈ argmin f ∈F R ( f ). It holds that R λ ( f 0 ,λ ) − R ( f 0 ) ≤ inf f ∈F { R ( f ) − R ( f 0 ) } + λ 2 ∥ f ∥ 2 . In particular, if ∥ f 0 ∥ < ∞ , then R λ ( f 0 ,λ ) − R ( f 0 ) ≤ λ 2 ∥ f 0 ∥ 2 . Substituting the bound the b ound R ( f 0 ,λ ) − R ( f 0 ) ≤ λ 2 ∥ f 0 ∥ 2 from Lemma 7 in to ( 12 ), w e obtain R λ ( ˆ f n,λ ) − R ( f 0 ) ≲ λ + O p λ − 1 δ 2 n . Balancing the t wo terms suggests taking λ ≍ δ n , which yields R λ ( ˆ f n,λ ) − R ( f 0 ) = O p ( δ n ) . Remark (on the choice of p enalt y). W e use an L 2 ( P ) p enalty primarily for exp ositional con venience: it yields strong con v exity directly in the same norm used throughout our analysis. More generally , regularization can improv e regret rates by trading off estimation error against the regularization bias R ( f 0 ,λ ) − R ( f 0 ); when this bias deca ys rapidly as λ ↓ 0, one may tak e λ smaller and obtain a faster ov erall rate. The limitation of an L 2 ( P ) p enalty is that it do es not enforce smo othness beyond L 2 , and therefore do es not deliv er the classical gains associated with RKHS or Sobolev regularization in ill-p osed problems where the risk itself is not strongly conv ex. 36 Under stronger p enalties and additional smo othness assumptions on f 0 (e.g., source conditions), the regularization bias can often b e con trolled more sharply ( Engl et al. , 1996 ; Cap onnetto and De Vito , 2007 ; Steinw art and Christmann , 2008 ; Stein wart et al. , 2009 ; Zhang et al. , 2025 ). B Uniform lo cal concen tration inequalities for empirical pro cesses B.1 Notation W e introduce the following notation. Let Z 1 , . . . , Z n ∈ Z b e indep endent random v ariables. Let F b e a class of measurable functions f : Z → R . F or an y f ∈ F , define ∥ f ∥ := v u u t 1 n n X i =1 E { f ( Z i ) 2 } . W e define the global and localized Rademac her complexities b y R n ( F ) := E " sup f ∈F 1 n n X i =1 ϵ i f ( Z i ) # , R n ( F , δ ) := E sup f ∈F ∥ f ∥≤ δ 1 n n X i =1 ϵ i f ( Z i ) , where ϵ 1 , . . . , ϵ n are i.i.d. Rademacher random v ariables, independent of Z 1 , . . . , Z n . W e define the critic al r adius of F as the smallest δ n ∈ (0 , ∞ ) suc h that R n ( F , δ n ) ≤ δ 2 n , that is, δ n := inf δ > 0 : R n ( F , δ ) ≤ δ 2 . Finally , define P n f := 1 n n X i =1 f ( Z i ) , P f := 1 n n X i =1 E { f ( Z i ) } , P ϵ n f := 1 n n X i =1 ϵ i f ( Z i ) . Let ∥ · ∥ := ∥ · ∥ L 2 ( P ) and ∥ · ∥ n := ∥ · ∥ L 2 ( P n ) denote the p opulation and empirical L 2 norms. W e sa y that a function class F is star-shap e d (with respect to 0) if, for every f ∈ F and every λ ∈ [0 , 1], w e ha v e λf ∈ F . W e define the star h ull of F as star( F ) := { tf : f ∈ F , t ∈ [0 , 1] } . B.2 Uniform lo cal concentration for empirical pro cesses The following theorem is our main result and provides a lo cal maximal inequality . Related inequal- ities app ear in Bartlett et al. ( 2005 ) and W ainwrigh t ( 2019 ); see also F oster and Syrgk anis ( 2023 , App endix K, Lemmas 11–14) and v an der Laan et al. ( 2025b , Lemma 11). Our pro of largely follo ws the technique outlined in Section 3.1.4 of Bartlett et al. ( 2005 ). 37 Theorem 10 (Uniform lo cal concentration inequality) . Let F be star-shap ed and satisfy sup f ∈F ∥ f ∥ ∞ ≤ M . Let δ n > 0 satisfy the critical radius condition R n ( F , δ n ) ≤ δ 2 n . Then there exists a univ ersal constan t C > 0 suc h that, with probability at least 1 − η , for ev ery f ∈ F , ( P n − P ) f ≤ C ∥ f ∥ δ n + δ 2 n + ( M ∨ 1) ∥ f ∥ r log(1 /η ) n + ( M 2 ∨ 1) log(1 /η ) n ! . In particular, if δ n ≳ ( M ∨ 1) q log(1 /η ) n , then, up to universal constants, ( P n − P ) f ≲ ∥ f ∥ δ n + δ 2 n . The pro of of the theorem ab o ve relies on the follo wing tec hnical lemmas. The first lemma, due to Bousquet ( 2002 ), pro vides a finite-sample concentration inequality for the suprem um of an empirical pro cess around its expectation. This result is a Bennett/Bernstein-t yp e refinement of T alagrand’s concentration inequality , and its explicit dep endence on the en velope and the maximal v ariance is particularly con venien t for lo calization. F or a related bound, see Theorem 3.27 of W ainwrigh t ( 2019 ). Lemma 8 (Bousquet’s version of T alagrand’s inequalit y .) . Let G b e a class of measurable functions g : O → R suc h that sup g ∈G ∥ g ∥ ∞ ≤ M and σ 2 := sup g ∈G V ar { g ( O ) } < ∞ . Then there exists a univ ersal constan t c > 0 such that, for all u ≥ 0, with probabilit y at least 1 − e − u , sup g ∈G ( P n − P ) g ≤ E h sup g ∈G ( P n − P ) g i + c r uσ 2 n + M u n . The next lemma shows that the expected supremum term E sup g ∈G ( P n − P ) g in Bousquet’s inequalit y can b e upper b ounded b y the (unlo calized) Rademac her complexit y of the function class. Rademac her symmetrization is a standard technique, used for example to derive Dudley-t yp e en- trop y b ounds in V an Der V aart and W ellner ( 1996 ) and in the study of Rademac her complexities in Bartlett and Mendelson ( 2002 ) (see Lemma 2.3.1 of V an Der V aart and W ellner ( 1996 ); Prop osition 4.11 of W ainwrigh t ( 2019 )). Lemma 9 (Rademacher symmetrization bound) . F or an y measurable class of functions G , E h sup g ∈G ( P n − P ) g i ≤ 2 R n ( G ) . The following lemma restates Lemma 3.4 of Bartlett et al. ( 2005 ). Lemma 10 (Scaling prop erty of localized Rademac her complexit y) . Let G b e star-shaped. Then, for all t ∈ [0 , 1], R n ( G , tδ ) ≥ t R n ( G , δ ) , and consequently δ 7→ R n ( G , δ ) /δ is nonincreasing on (0 , ∞ ). Pr o of. Fix δ > 0 and t ∈ [0 , 1]. If ∥ f ∥ ≤ δ and G is star-shap ed, then tf ∈ G and ∥ tf ∥ = t ∥ f ∥ ≤ tδ . 38 Hence, the collection { tf : f ∈ G , ∥ f ∥ ≤ δ } is con tained in { h ∈ G : ∥ h ∥ ≤ tδ } , and therefore R n ( G , tδ ) = E h sup h ∈G ∥ h ∥≤ tδ P ϵ n h i ≥ E h sup f ∈G ∥ f ∥≤ δ P ϵ n ( tf ) i = t E h sup f ∈G ∥ f ∥≤ δ P ϵ n f i = t R n ( G , δ ) . F or monotonicity , let 0 < δ 1 ≤ δ 2 and set t = δ 1 /δ 2 ∈ (0 , 1]. Applying the first claim with δ = δ 2 giv es R n ( G , δ 1 ) ≥ ( δ 1 /δ 2 ) R n ( G , δ 2 ). Dividing by δ 1 yields R n ( G , δ 1 ) /δ 1 ≥ R n ( G , δ 2 ) /δ 2 , so δ 7→ R n ( G , δ ) /δ is nonincreasing. □ The next lemma b ounds the Rademacher complexity of the self-normalized class { f / ( ∥ f ∥ ∨ δ n ) : f ∈ F } , which yields uniform high-probability b ounds for ratio-type functionals such as ( P − P n ) f / ∥ f ∥ . This self-normalization (via star-shap edness) is a standard device in the lo cal Rademac her complexit y literature; see, e.g., Bartlett et al. ( 2005 ); Mendelson ( 2002 ). Lemma 11 (Rademac her complexit y for self-normalized classes) . Let F be star-shaped. Fix δ n > 0 and define G := g = f / ( ∥ f ∥ ∨ δ n ) : f ∈ F . Then R n ( G ) ≤ 2 δ n R n ( F , δ n ) . Pr o of. Define S ( ϵ ) := sup g ∈G ( P ϵ n ) g = sup f ∈F ( P ϵ n ) f ∥ f ∥ ∨ δ n . Let F ≤ := { f ∈ F : ∥ f ∥ ≤ δ n } and F > := { f ∈ F : ∥ f ∥ > δ n } . Then S ( ϵ ) = max n sup f ∈F ≤ ( P ϵ n ) f δ n , sup f ∈F > ( P ϵ n ) f ∥ f ∥ o ≤ A ( ϵ ) + B ( ϵ ) , where A ( ϵ ) := sup f ∈F : ∥ f ∥≤ δ n ( P ϵ n ) f δ n , B ( ϵ ) := sup f ∈F : ∥ f ∥ >δ n ( P ϵ n ) f ∥ f ∥ . W e show that B ( ϵ ) ≤ A ( ϵ ). Fix any f ∈ F with ∥ f ∥ > δ n , set λ := δ n / ∥ f ∥ ∈ (0 , 1), and define h := λf . By star-shapedness, h ∈ F and ∥ h ∥ = δ n . Moreov er, ( P ϵ n ) f ∥ f ∥ = ( P ϵ n ) h δ n ≤ sup u ∈F : ∥ u ∥≤ δ n ( P ϵ n ) u δ n = A ( ϵ ) . T aking the suprem um ov er f ∈ F with ∥ f ∥ > δ n yields B ( ϵ ) ≤ A ( ϵ ), and hence S ( ϵ ) ≤ 2 A ( ϵ ), that is, sup g ∈G ( P ϵ n ) g ≤ 2 δ n sup f ∈F : ∥ f ∥≤ δ n ( P ϵ n ) f . T aking expectations o v er ϵ and o v er the data completes the proof. □ W e no w provide the proof of our main result. 39 Pr o of of The or em 10 . Define the self-normalized class G := { g = f / ( ∥ f ∥ ∨ δ n ) : f ∈ F } , where, b y definition, δ n satisfies the critical inequality R n ( F , δ n ) ≤ δ 2 n . Since sup f ∈F ∥ f ∥ ∞ ≤ M and, for an y f ∈ F , f ∥ f ∥ ∨ δ n = ∥ f ∥ ∥ f ∥ ∨ δ n ≤ 1 , w e ha v e sup g ∈G ∥ g ∥ ∞ ≤ M δ n , sup g ∈G ∥ g ∥ ≤ 1 . Hence, by Bousquet’s inequalit y in Lemma 8 , there exists a universal constant c > 0 suc h that, for all u ≥ 0, with probabilit y at least 1 − e − u , sup g ∈G ( P n − P ) g ≤ E h sup g ∈G ( P n − P ) g i + c r u n + M u δ n n . By the Rademac her symmetrization b ound in Lemma 9 , E h sup g ∈G ( P n − P ) g i ≤ 2 E h sup g ∈G ( P ϵ n ) g i = 2 R n ( G ) . By Lemma 11 , R n ( G ) ≤ 2 δ n R n ( F , δ n ) . Th us, combining the abov e, there exists a univ ersal constant C > 0 such that, for all u ≥ 0, with probabilit y at least 1 − e − u , sup g ∈G ( P n − P ) g ≤ 4 δ n R n ( F , δ n ) + C r u n + M u δ n n . W riting out G , this means that, for all u ≥ 0, with probabilit y at least 1 − e − u , sup f ∈F ( P n − P ) f ∥ f ∥ ∨ δ n ≤ 4 δ n R n ( F , δ n ) + C r u n + M u δ n n . Th us, on the same even t, for every f ∈ F , ( P n − P ) f ≤ ( ∥ f ∥ ∨ δ n ) " 4 δ n R n ( F , δ n ) + C r u n + M u δ n n # . By definition of the critical radius δ n , we hav e R n ( F , δ n ) ≤ δ 2 n , and hence 4 δ n R n ( F , δ n ) ≤ 4 δ n . Therefore, there exists a universal constan t C > 0 suc h that, for all u ≥ 0, with probability at least 40 1 − e − u , for ev ery f ∈ F , ( P n − P ) f ≤ ( ∥ f ∥ ∨ δ n ) " 4 δ n + C r u n + M u δ n n # . Equiv alently , absorbing constan ts into a univ ersal c ∈ (0 , ∞ ) and using ( ∥ f ∥ ∨ δ n ) δ n = ∥ f ∥ δ n ∨ δ 2 n , w e ma y write ( P n − P ) f ≤ c " ( ∥ f ∥ δ n ∨ δ 2 n ) + ( ∥ f ∥ ∨ δ n ) r u n + M u δ n n # . Setting u = log(1 /η ) and using ( a ∨ b ) ≤ a + b , we obtain that there exists a universal constan t C > 0 suc h that, with probabilit y at least 1 − η , for ev ery f ∈ F , ( P n − P ) f ≤ C " δ n ∥ f ∥ + δ 2 n + ( ∥ f ∥ + δ n ) r log(1 /η ) n + ( ∥ f ∥ + δ n ) M log(1 /η ) δ n n # . Recall that η ∈ (0 , 1) and assume that, up to univ ersal constan ts, δ n ≳ M r log(1 /η ) n . Then M log(1 /η ) δ n n = M δ n · log(1 /η ) n ≲ r log(1 /η ) n , so the term { M log(1 /η ) / ( δ n n ) }∥ f ∥ can b e absorb ed in to ∥ f ∥ p log(1 /η ) /n . Moreov er, δ n r log(1 /η ) n ≳ M log(1 /η ) n , so the term ( M log(1 /η ) /n ) is absorb ed into δ n p log(1 /η ) /n . Regrouping yields that there exists a univ ersal constan t C > 0 suc h that, with probabilit y at least 1 − η , for every f ∈ F , ( P n − P ) f ≤ C " ∥ f ∥ δ n + δ 2 n + ( ∥ f ∥ + δ n ) r log(1 /η ) n # . Therefore, ( P n − P ) f ≤ C ( ∥ f ∥ + δ n ) r log(1 /η ) n + δ n ! . Moreo ver, if δ n ≳ ( M ∨ 1) q log(1 /η ) n , then ( P n − P ) f ≲ 2 C ∥ f ∥ δ n + δ 2 n , whic h giv es the second b ound. Finally , if e δ n satisfies R n ( F , e δ n ) ≤ e δ 2 n , then δ n := e δ n + ( M ∨ 1) q log(1 /η ) n satisfies δ n ≳ ( M ∨ 41 1) q log(1 /η ) n . Hence, applying the b ound with this choice of δ n and expanding squares, we conclude that, on the same ev ent, ( P n − P ) f ≲ ∥ f ∥ e δ n + e δ 2 n + ( M ∨ 1) ∥ f ∥ r log(1 /η ) n + ( M 2 ∨ 1) log(1 /η ) n , whic h giv es the first b ound. □ B.3 Uniform lo cal concentration for Lipsc hitz transformations The following result is a direct corollary of Theorem 14.20(b) in W ain wright ( 2019 ) (see Lemma 14 of F oster and Syrgk anis ( 2023 ) for a vector-v alued extension). The main differences are that we state the b ound in P AC form and, rather than imp osing a lo wer-bound condition on the critical radius δ n , we include the additional term L 2 log log( M en ) n . Theorem 11 (Uniform local concentration (Lipsc hitz transform; P AC form)) . Fix f 0 ∈ F and assume the difference class F − f 0 := { f − f 0 : f ∈ F } is star-shap ed. Let φ : Z × F → R be a map with sup f ∈F ∥ φ ( · , f ) − φ ( · , f 0 ) ∥ ∞ ≤ M for some M ∈ [1 , ∞ ). Assume φ is p oint wise Lipschitz: there exists L ∈ (0 , ∞ ) suc h that for all f , f ′ ∈ F , φ ( z , f ) − φ ( z , f ′ ) ≤ L | f ( z ) − f ′ ( z ) | , z ∈ Z . Let δ n > 0 satisfy the critical-radius condition R n F − f 0 , δ n ≲ δ 2 n . (13) Then there exists a universal constan t C ∈ (0 , ∞ ) such that for ev ery η ∈ (0 , 1), with probability at least 1 − η , the following holds sim ultaneously for all f ∈ F : ( P n − P ) { φ ( · , f ) − φ ( · , f 0 ) } ≤ C " L δ n ∥ f − f 0 ∥ + L 2 δ 2 n + L 2 log log ( M en ) n (14) + L ∥ f − f 0 ∥ r log(1 /η ) n + M log(1 /η ) n # . (15) Remark. A similar b ound to Theorem 11 can also b e obtained by applying Theorem 10 to the star-shap ed h ull of the transformed class F φ := { φ ( · , f ) − φ ( · , f 0 ) : f ∈ F } . This approach av oids the additional L 2 log log( M en ) n term, whic h is an artifact of the p eeling argumen t. The tradeoff is that Theorem 10 would then b e stated in terms of the critical radius of F φ , rather than that of F . As sho wn in Lemma 6 in Section 4.2 , this distinction is t ypically mild once w e pass to en tropy in tegrals, since star-shap ed hulls inflate cov ering n umbers only slightly . By contrast, the result in Theorem 11 leverages the Lipsc hitz prop erty of φ and star-shap edness of F − f 0 to express the b ound directly in terms of the critical radius of F − f 0 . In most settings, the tw o approaches lead to comparable ERM guarantees. 42 B.4 Lo cal maximal inequalities via metric en tropy W e restate and prov e the lo cal maximal inequalities from Section 4.2 . Our pro ofs rely on the follo wing standard lemma, taken from Theorem 5.22 (see also equation (5.48)) of W ain wright ( 2019 ). The lemma follo ws from a c haining argument com bined with sub-Gaussian concen tration for Rademacher pro cesses. Lemma 12 (Dudley entrop y b ound ( Dudley , 1967 )) . Let δ > 0 and define the empirical radius ˆ δ n := sup ∥ f ∥ n : f ∈ G , ∥ f ∥ ≤ δ . Then, R n ( G , δ ) ≤ 12 √ n E " Z ˆ δ n 0 p log N ( ε, G , L 2 ( P n )) dε # . Theorem (Lo cal maximal inequality under uniform L 2 -en tropy) . Let δ > 0. Define δ ∞ := sup f ∈G ( δ ) ∥ f ∥ ∞ . Then, there exists a universal constan t C ∈ (0 , ∞ ) suc h that R n ( G , δ ) ≤ C √ n J 2 δ, G ( δ ) 1 + δ ∞ √ n δ 2 J 2 δ, G ( δ ) . Pr o of of The or em 6 . Dudley’s b ound in Lemma 12 implies that R n ( G , δ ) ≤ 12 √ n E h J 2 ( ˆ δ n , G ) i , ˆ δ n := sup ∥ f ∥ n : f ∈ G , ∥ f ∥ ≤ δ . In the pro of of Theorem 2.1 of V an Der V aart and W ellner ( 2011 ), it is shown that the map t 7→ J 2 ( √ t, G ) is conca ve. Hence, b y Jensen’s inequalit y , E h J 2 ( ˆ δ n , G ) i ≤ J 2 ( δ n , G ) , δ 2 n := E h ˆ δ 2 n i . Th us, R n ( G , δ ) ≤ 12 √ n J 2 ( δ n , G ) . (16) W e no w pro ceed b y bounding δ 2 n . Observe that δ 2 n = E " sup f ∈G : ∥ f ∥≤ δ P n f 2 # ≤ E " sup f ∈G : ∥ f ∥≤ δ ( P n − P ) f 2 # + E " sup f ∈G : ∥ f ∥≤ δ P f 2 # ≤ E " sup f ∈G : ∥ f ∥≤ δ ( P n − P ) f 2 # + δ 2 . F urthermore, b y Lemma 9 , E " sup f ∈G : ∥ f ∥≤ δ ( P n − P ) f 2 # ≤ 2 R n ( G 2 , δ ) , 43 where G 2 := { f 2 : f ∈ G } . Next, define δ ∞ := sup ∥ f ∥ ∞ : f ∈ G , ∥ f ∥ ≤ δ . The map t 7→ t 2 is 2 δ ∞ -Lipsc hitz on [ − δ ∞ , δ ∞ ]. Hence, by T alagrand’s contraction lemma and ( 16 ), R n ( G 2 , δ ) ≤ 2 δ ∞ R n ( G , δ ) ≤ δ ∞ 24 √ n J 2 ( δ n , G ) . W e conclude that δ 2 n ≤ δ 2 + δ ∞ 24 √ n J 2 ( δ n , G ) . As in the proof of Theorem 2.1 of V an Der V aart and W ellner ( 2011 ) (see Equation 2.4), we apply Lemma 2.1 of that same work with r = 1, A = δ , and B 2 = 24 δ ∞ / √ n to conclude that J 2 ( δ n , G ) ≤ J 2 ( δ, G ) + 24 δ ∞ √ n δ 2 J 2 ( δ, G ) 2 . (17) Com bining the ab ov e with ( 16 ) yields the claim. □ Theorem (Lo cal maximal inequality under L ∞ -en tropy) . Let δ > 0. Define G ( δ ) := { f ∈ G : ∥ f ∥ ≤ δ } and B := 1 + J ∞ ( ∞ , G ). Then there exists a univ ersal constan t C ∈ (0 , ∞ ) such that R n ( G , δ ) ≤ C B √ n J ∞ ( δ ∨ n − 1 / 2 , G ( δ )) . Pr o of of The or em 7 . The pro of of Theorem 6 sho w ed that R n ( G , δ ) ≤ 12 √ n J 2 ( δ n , G ) , δ 2 n := E h ˆ δ 2 n i , where ˆ δ n := sup ∥ f ∥ n : f ∈ G , ∥ f ∥ ≤ δ . Since ∥ · ∥ L 2 ( Q ) ≤ ∥ · ∥ ∞ for any distribution Q , it follo ws that J 2 ( δ, G ) ≤ J ∞ ( δ, G ). Thus, R n ( G , δ ) ≤ 12 √ n J ∞ ( δ n , G ) . (18) W e no w pro ceed b y bounding δ 2 n . The pro of of Theorem 6 further show ed that δ 2 n ≤ E " sup f ∈G : ∥ f ∥≤ δ ( P n − P ) f 2 # + δ 2 . Let G δ := { f ∈ G : ∥ f ∥ ≤ δ } and se t δ ∞ := sup f ∈G δ ∥ f ∥ ∞ . Since ( P n − P ) f 2 = ∥ f ∥ 2 n − ∥ f ∥ 2 , w e ha ve sup f ∈G δ ( P n − P ) f 2 ≤ sup f ∈G δ ∥ f ∥ 2 n − ∥ f ∥ 2 . 44 Therefore, Theorem 2.1 in v an de Geer ( 2014 ) (applied with R = sup f ∈G δ ∥ f ∥ ≤ δ and K = δ ∞ ) yields E " sup f ∈G : ∥ f ∥≤ δ ( P n − P ) f 2 # ≤ 2 δ √ n J ∞ ( δ ∞ , G ) + 4 n J 2 ∞ ( δ ∞ , G ) . Th us, δ 2 n ≤ δ 2 + 2 B δ √ n + 4 B 2 n , where B := J ∞ ( δ ∞ , G ). By assumption δ ≥ n − 1 / 2 , w e hav e δ / √ n ≤ δ 2 and B 2 /n ≤ B 2 δ 2 . Therefore, δ 2 n ≤ (1 + 2 B ) δ 2 + 4 B 2 δ 2 ≤ (2 + 8 B 2 ) δ 2 . Com bining the previous display with ( 18 ), we obtain that there exists a universal constant C ∈ (0 , ∞ ) suc h that R n ( G , δ ) ≤ C √ n J ∞ (1 + B ) δ, G ≤ C (1 + B ) √ n J ∞ ( δ, G ) . where B is upper bounded b y J ∞ ( ∞ , G ). The claim follo ws. □ B.5 En trop y preserv ation results for star shap ed h ulls The follo wing lemma sho ws that passing from F to its star hull star( F ) increases the metric en trop y (log cov ering n umber) only mildly: sp ecifically , log N ( ε, star( F ) , L q ( Q )) ≲ log N ( ε, F , L q ( Q )) + log (1 /ε ) . See Lemma 8 of F oster and Syrgk anis ( 2023 ) and Lemma 4.5 of Mendelson ( 2002 ) for related b ounds. Lemma 13 (Co v ering num b ers for Lipsc hitz images of star hulls in L q ( Q )) . Let q ∈ [1 , ∞ ) and let φ : R → R be L -Lipsc hitz in the sense that | φ ( f 1 ( z )) − φ ( f 2 ( z )) | ≤ L | f 1 ( z ) − f 2 ( z ) | for all z ∈ Z , f 1 , f 2 ∈ F . Assume M := sup f ∈F ∥ φ ◦ f ∥ ∞ < ∞ . Then for every ε > 0 and ev ery probabilit y measure Q , log N ( ε, star( φ ◦ F ) , L q ( Q )) ≤ log N ( ε/ (4 L ) , F , L q ( Q )) + log 1 + 4 M /ε . Corollary 3 (Uniform en tropy for Lipsc hitz images of star hulls) . Under the assumptions of 45 Lemma 13 with q = 2, for all δ > 0, J 2 ( δ, star( φ ◦ F )) ≲ J 2 ( δ /L, F ) + δ s log 1 + M δ . Pr o of of L emma 13 and Cor ol lary 3 . Fix ε > 0 and set δ := ε/ 4. Let { f 1 , . . . , f N } b e a ( δ /L )-net for F in L q ( Q ), where N = N ( δ /L, F , L q ( Q )). F or eac h f ∈ F , choose j ( f ) ∈ { 1 , . . . , N } suc h that ∥ f − f j ( f ) ∥ L q ( Q ) ≤ δ /L. By Lipschitz contin uit y and monotonicity of L q norms, ∥ φ ◦ f − φ ◦ f j ( f ) ∥ L q ( Q ) = Z | φ ( f ) − φ ( f j ( f ) ) | q dQ 1 /q ≤ Z ( L | f − f j ( f ) | ) q dQ 1 /q = L ∥ f − f j ( f ) ∥ L q ( Q ) ≤ δ , so { φ ◦ f 1 , . . . , φ ◦ f N } is a δ -net for φ ◦ F in L q ( Q ). Next, discretize t ∈ [0 , 1]. Let T := 0 , δ M , 2 δ M , . . . , ⌈ M /δ ⌉ δ M ∩ [0 , 1] , so that |T | ≤ 1 + ⌈ M /δ ⌉ = 1 + ⌈ 4 M /ε ⌉ and for ev ery t ∈ [0 , 1] there exists t ′ ∈ T with | t − t ′ | ≤ δ/ M . Fix h = t ( φ ◦ f ) ∈ star( φ ◦ F ) and let f ′ := f j ( f ) and t ′ ∈ T b e c hosen as ab ov e. Then, using ∥ φ ◦ f ′ ∥ L q ( Q ) ≤ ∥ φ ◦ f ′ ∥ ∞ ≤ M , ∥ t ( φ ◦ f ) − t ′ ( φ ◦ f ′ ) ∥ L q ( Q ) ≤ ∥ t ( φ ◦ f ) − t ( φ ◦ f ′ ) ∥ L q ( Q ) + ∥ t ( φ ◦ f ′ ) − t ′ ( φ ◦ f ′ ) ∥ L q ( Q ) ≤ t ∥ φ ◦ f − φ ◦ f ′ ∥ L q ( Q ) + | t − t ′ | ∥ φ ◦ f ′ ∥ L q ( Q ) ≤ δ + ( δ / M ) M = 2 δ = ε/ 2 . Th us the set { t ′ ( φ ◦ f j ) : t ′ ∈ T , j ∈ { 1 , . . . , N }} is an ε/ 2-net for star( φ ◦ F ) in L q ( Q ), and hence also an ε -net, with cardinality at most N |T | . T aking logarithms yields the claim. Corollary 3 follows from straigh tforw ard in tegration. □ C Uniform lo cal concen tration for empirical inner pro ducts C.1 A general b ound The following theorem extends Theorem 11 to function classes defined b y point wise products of t wo classes. Such pro duct classes arise naturally when controlling the concen tration of empirical inner pro ducts P n ( f g ) around P ( f g ), as in Appendix C and v an de Geer ( 2014 ). Let F and G be classes of measurable functions on Z , and assume ∥F ∥ ∞ := sup f ∈F ∥ f ∥ ∞ < ∞ and ∥G ∥ ∞ := sup g ∈G ∥ g ∥ ∞ < ∞ . Let φ 1 , φ 2 : R → R b e Lipsc hitz on the ranges of F and G , 46 resp ectiv ely: there exist L 1 , L 2 < ∞ such that | φ 1 ( x ) − φ 1 ( x ′ ) | ≤ L 1 | x − x ′ | , ∀ x, x ′ ∈ [ −∥F ∥ ∞ , ∥F ∥ ∞ ] , and | φ 2 ( y ) − φ 2 ( y ′ ) | ≤ L 2 | y − y ′ | , ∀ y , y ′ ∈ [ −∥G ∥ ∞ , ∥G ∥ ∞ ] . The following theorem establishes a local concen tration inequality for the pro duct-increment class n φ 1 ( f ) − φ 1 ( f 0 ) φ 2 ( g ) − φ 2 ( g 0 ) : f , f 0 ∈ F , g , g 0 ∈ G o . The key condition b elow is a H¨ older-type coupling b et ween the supremum and L 2 ( P ) norms on F − F , namely , ∥ u ∥ ∞ ≤ c ∞ ∥ u ∥ α for all u ∈ F − F . One may alwa ys take α = 0 with c ∞ := 2 sup f ∈F ∥ f ∥ ∞ . Theorem 12 (Uniform lo cal concen tration for Lipschitz-transformed increment pro ducts) . Let φ 1 and φ 2 b e p oin twise Lipschitz con tin uous on the ranges of F and G , resp ectively , with Lipsc hitz constan ts L 1 and L 2 . Assume there exist c ∞ ∈ (0 , ∞ ) and α ∈ [0 , 1] such that ∥ u ∥ ∞ ≤ c ∞ ∥ u ∥ α for all u ∈ F − F , and that M := 1 + ∥G ∥ ∞ ∨ ∥F ∥ ∞ < ∞ . Define the (star-hull) critical radii δ n, F := inf n δ > 0 : R n star( F − F ) , δ ≤ δ 2 o , δ n, G := inf n δ > 0 : R n star( G − G ) , δ ≤ δ 2 o . Then for every η ∈ (0 , 1), with probabilit y at least 1 − η , the following holds simultaneously for all f , f 0 ∈ F and g , g 0 ∈ G : ( P n − P ) { ( φ 1 ( f ) − φ 1 ( f 0 ))( φ 2 ( g ) − φ 2 ( g 0 )) } ≲ L 1 L 2 ∥G ∥ ∞ δ n, F ∥ f − f 0 ∥ ∨ δ n, F + c ∞ L 1 L 2 ∥ f − f 0 ∥ α δ n, G ∥ g − g 0 ∥ ∨ δ n, G + L 1 L 2 ∥G ∥ ∞ ∥ f − f 0 ∥ s log log eM n + log(1 /η ) n + c ∞ L 1 L 2 ∥ f − f 0 ∥ α ∥G ∥ ∞ log log eM n + log(1 /η ) n , where the implicit constant is univ ersal. Our pro of of the theorem relies on the following tw o lemmas. Fix radii δ 1 , δ 2 > 0 and define the lo calized classes F φ ( δ 1 ) := { φ 1 ( f ) − φ 1 ( f ′ ) : f , f ′ ∈ F , ∥ f − f ′ ∥ ≤ δ 1 } , G φ ( δ 2 ) := { φ 2 ( g ) − φ 2 ( g ′ ) : g , g ′ ∈ G , ∥ g − g ′ ∥ ≤ δ 2 } , and the pro duct class H ( δ 1 , δ 2 ) := { ˜ f ˜ g : ˜ f ∈ F φ ( δ 1 ) , ˜ g ∈ G φ ( δ 2 ) } . Lemma 14 (Complexit y b ounds for a lo calized pro duct class) . Assume ∥F φ ( δ 1 ) ∥ ∞ < ∞ and ∥G φ ( δ 2 ) ∥ ∞ < ∞ . Then, up to univ ersal constan ts, R n H ( δ 1 , δ 2 ) ≲ ∥G φ ( δ 2 ) ∥ ∞ L 1 R n F − F , δ 1 + ∥F φ ( δ 1 ) ∥ ∞ L 2 R n G − G , δ 2 . 47 Com bining Lemma 14 with Bousquet’s v ersion of T alagrand’s concentration inequalit y yields the following high-probability b ound. Our main result then follo ws from a peeling argumen t applied to Lemma 15 b elow. Lemma 15 (High probability b ound for empirical inner pro ducts) . Assume that there exist c ∞ ∈ (0 , ∞ ) and α ∈ [0 , 1] suc h that ∥ f ∥ ∞ ≤ c ∞ ∥ f ∥ α for all f ∈ F − F . Assume ∥G ∥ ∞ ∨ ∥F ∥ ∞ < ∞ . Then for ev ery η ∈ (0 , 1), with probabilit y at least 1 − η , sup h ∈H ( δ 1 ,δ 2 ) | ( P n − P ) h | ≲ L 1 L 2 ∥G ∥ ∞ R n F − F , δ 1 + c ∞ L 1 L 2 δ α 1 R n G − G , δ 2 + L 1 L 2 δ 1 ∥G ∥ ∞ r log(1 /η ) n + c ∞ L 1 L 2 δ α 1 ∥G ∥ ∞ log(1 /η ) n . (19) Our proof outline is as follo ws. Lemma 14 is the crux of the result. The proof of Lemma 15 follo ws from standard argumen ts Pr o of of L emma 14 . Condition on Z 1: n . Let S := F φ ( δ 1 ) × G φ ( δ 2 ) and, for s = ( ˜ f , ˜ g ) ∈ S , define ψ i ( s ) := ˜ f ( Z i ) ˜ g ( Z i ) , i = 1 , . . . , n. F or an y s = ( ˜ f , ˜ g ) and s ′ = ( ˜ f ′ , ˜ g ′ ) in S , w e ha ve | ψ i ( s ) − ψ i ( s ′ ) | = | ˜ f ( Z i ) ˜ g ( Z i ) − ˜ f ′ ( Z i ) ˜ g ′ ( Z i ) | ≤ | ˜ g ( Z i ) | | ˜ f ( Z i ) − ˜ f ′ ( Z i ) | + | ˜ f ′ ( Z i ) | | ˜ g ( Z i ) − ˜ g ′ ( Z i ) | ≤ ∥G φ ( δ 2 ) ∥ ∞ | ˜ f ( Z i ) − ˜ f ′ ( Z i ) | + ∥F φ ( δ 1 ) ∥ ∞ | ˜ g ( Z i ) − ˜ g ′ ( Z i ) | . In tro duce the 2-vector-v alued maps ϕ i ( s ) := ∥G φ ( δ 2 ) ∥ ∞ ˜ f ( Z i ) , ∥F φ ( δ 1 ) ∥ ∞ ˜ g ( Z i ) ∈ R 2 . Then | ψ i ( s ) − ψ i ( s ′ ) | ≤ ∥ ϕ i ( s ) − ϕ i ( s ′ ) ∥ 2 for all s, s ′ ∈ S , since | a | + | b | ≤ √ 2 √ a 2 + b 2 and the factor √ 2 can b e absorb ed in to the universal constant. Therefore, by the v ector con traction inequality for Rademacher pro cesses (e.g., Theorem 3 of Maurer ( 2016 )), E ϵ h sup ( ˜ f , ˜ g ) ∈ S P ϵ n ( ˜ f ˜ g ) i ≲ ∥G φ ( δ 2 ) ∥ ∞ E ϵ (1) h sup ˜ f ∈F φ ( δ 1 ) P ϵ (1) n ˜ f i + ∥F φ ( δ 1 ) ∥ ∞ E ϵ (2) h sup ˜ g ∈G φ ( δ 2 ) P ϵ (2) n ˜ g i , where ϵ (1) and ϵ (2) are indep enden t i.i.d. Rademac her sequences. T aking exp ectation ov er Z 1: n yields R n H ( δ 1 , δ 2 ) ≲ ∥G φ ( δ 2 ) ∥ ∞ R n F φ ( δ 1 ) + ∥F φ ( δ 1 ) ∥ ∞ R n G φ ( δ 2 ) , as claimed. Moreov er, by contraction, R n F φ ( δ 1 ) ≤ L 1 R n F − F , δ 1 , R n G φ ( δ 2 ) ≤ L 2 R n G − G , δ 2 , 48 where F − F := { f − f ′ : f , f ′ ∈ F } and R n ( F − F , δ 1 ) denotes the lo calized complexity ov er { u ∈ F − F : ∥ u ∥ ≤ δ 1 } (and analogously for G . □ Pr o of of L emma 15 . Fix δ 1 , δ 2 > 0 and write H := H ( δ 1 , δ 2 ). Set the env elop e M := sup h ∈H ∥ h ∥ ∞ , σ 2 := sup h ∈H V ar { h ( Z ) } ≤ sup h ∈H P h ( Z ) 2 . Step 1: Apply Bousquet’s inequalit y . Apply Lemma 8 to the class H with u := log(1 /η ) to obtain that, with probability at least 1 − η , sup h ∈H ( P n − P ) h ≤ E h sup h ∈H ( P n − P ) h i + c q log(1 /η ) σ 2 n + M log (1 /η ) n , (20) for a univ ersal constan t c > 0. Applying the same b ound to the class −H and com bining the t wo displa ys yields sup h ∈H | ( P n − P ) h | ≲ E h sup h ∈H | ( P n − P ) h | i + q log(1 /η ) σ 2 n + M log (1 /η ) n . (21) Step 2: Bound the exp ectation b y the Rademac her complexit y . By symmetrization, E h sup h ∈H | ( P n − P ) h | i ≲ R n ( H ) . Com bining with ( 21 ) giv es sup h ∈H | ( P n − P ) h | ≲ R n ( H ) + q log(1 /η ) σ 2 n + M log (1 /η ) n . (22) Step 3: Bound R n ( H ) using Lemma 14 . Lemma 14 yields R n ( H ) ≲ ∥G φ ( δ 2 ) ∥ ∞ L 1 R n F − F , δ 1 + ∥F φ ( δ 1 ) ∥ ∞ L 2 R n G − G , δ 2 . (23) Moreo ver, since sup g ∈G ∥ g ∥ ∞ ≤ ∥G ∥ ∞ and φ 2 is L 2 -Lipsc hitz, ∥G φ ( δ 2 ) ∥ ∞ = sup g ,g ′ ∈G ∥ g − g ′ ∥≤ δ 2 ∥ φ 2 ( g ) − φ 2 ( g ′ ) ∥ ∞ ≤ L 2 sup g ,g ′ ∈G ∥ g − g ′ ∥≤ δ 2 ∥ g − g ′ ∥ ∞ ≤ 2 L 2 ∥G ∥ ∞ . Similarly , b y Lipschitzness of φ 1 and the assumed lo cal em b edding on F − F , ∥F φ ( δ 1 ) ∥ ∞ = sup f ,f ′ ∈F ∥ f − f ′ ∥≤ δ 1 ∥ φ 1 ( f ) − φ 1 ( f ′ ) ∥ ∞ ≤ L 1 sup u ∈F −F ∥ u ∥≤ δ 1 ∥ u ∥ ∞ ≤ c ∞ L 1 δ α 1 . 49 Substituting these t wo b ounds into ( 23 ) yields R n ( H ) ≲ L 1 L 2 ∥G ∥ ∞ R n F − F , δ 1 + c ∞ L 1 L 2 δ α 1 R n G − G , δ 2 . (24) Step 4: Bound the v ariance pro xy σ 2 . F or h = ˜ f ˜ g ∈ H , we hav e P h 2 = P ˜ f 2 ˜ g 2 ≤ ∥ ˜ g ∥ 2 ∞ P [ ˜ f 2 ] ≤ ∥G φ ( δ 2 ) ∥ 2 ∞ sup ˜ f ∈F φ ( δ 1 ) ∥ ˜ f ∥ 2 . By Lipschitzness, for an y ˜ f = φ 1 ( f ) − φ 1 ( f ′ ) with ∥ f − f ′ ∥ ≤ δ 1 , ∥ ˜ f ∥ ≤ L 1 ∥ f − f ′ ∥ ≤ L 1 δ 1 , so sup ˜ f ∈F φ ( δ 1 ) ∥ ˜ f ∥ ≤ L 1 δ 1 . Using also ∥G φ ( δ 2 ) ∥ ∞ ≤ 2 L 2 ∥G ∥ ∞ , we obtain σ 2 ≤ sup h ∈H P h 2 ≲ L 2 1 L 2 2 δ 2 1 ∥G ∥ 2 ∞ , and therefore q log(1 /η ) σ 2 n ≲ L 1 L 2 δ 1 ∥G ∥ ∞ q log(1 /η ) n . (25) Step 5: Bound the env elop e M . F or h = ˜ f ˜ g ∈ H , ∥ h ∥ ∞ ≤ ∥ ˜ f ∥ ∞ ∥ ˜ g ∥ ∞ ≤ ∥F φ ( δ 1 ) ∥ ∞ ∥G φ ( δ 2 ) ∥ ∞ . Using ∥F φ ( δ 1 ) ∥ ∞ ≤ c ∞ L 1 δ α 1 and ∥G φ ( δ 2 ) ∥ ∞ ≤ 2 L 2 ∥G ∥ ∞ giv es M ≲ c ∞ L 1 L 2 δ α 1 ∥G ∥ ∞ , hence M log (1 /η ) n ≲ c ∞ L 1 L 2 δ α 1 ∥G ∥ ∞ log(1 /η ) n . (26) Conclusion. Substituting ( 24 ), ( 25 ), and ( 26 ) in to ( 22 ) yields ( 19 ). □ Pr o of of The or em 12 . Our pro of combines Lemma 15 with a standard p eeling argument, follo wing the pro of of Lemma 14 in F oster and Syrgk anis ( 2023 ) and the pro of of Theorem 14.20 in W ain wright ( 2019 ). Set M := 1 + ∥G ∥ ∞ ∨ ∥F ∥ ∞ . Since ∥ f − f 0 ∥ ≤ ∥ f − f 0 ∥ ∞ ≤ 2 ∥F ∥ ∞ ≤ 2 M and similarly ∥ g − g 0 ∥ ≤ 2 ∥G ∥ ∞ ≤ 2 M , it suffices to consider radii in (0 , 2 M ]. Step 1 (dyadic p eeling and union bound). Let J := l log 2 (2 M √ n ) m + , δ j := 2 j n − 1 / 2 ( j = 0 , . . . , J ) , η j,k := η c 0 ( j + 1) 2 ( k + 1) 2 , where c 0 > 0 is c hosen so that P J j,k =0 η j,k ≤ η . By Lemma 10 , the map δ 7→ R n (star( A ) , δ ) /δ is nonincreasing. Hence, if δ n, A > 0 satisfies the critical inequality R n (star( A ) , δ n, A ) ≤ δ 2 n, A , then for all δ > 0, R n star( A ) , δ ≲ δ n, A ( δ ∨ δ n, A ) . (27) 50 Apply Lemma 15 with ( δ 1 , δ 2 ) = ( δ j , δ k ) and confidence level η j,k , and use ( 27 ) with A = F − F and A = G − G . A union b ound ov er ( j, k ) ∈ { 0 , . . . , J } 2 yields an even t E with Pr( E ) ≥ 1 − η suc h that, on E , for all 0 ≤ j, k ≤ J , sup h ∈H ( δ j ,δ k ) | ( P n − P ) h | ≲ L 1 L 2 ∥G ∥ ∞ δ n, F δ j ∨ δ n, F + c ∞ L 1 L 2 δ α j δ n, G δ k ∨ δ n, G + L 1 L 2 δ j ∥G ∥ ∞ r log(1 /η j,k ) n + c ∞ L 1 L 2 δ α j ∥G ∥ ∞ log(1 /η j,k ) n . (28) Step 2 (extend off the grid). Fix δ 1 , δ 2 ∈ (0 , 2 M ] and choose j, k so that δ j − 1 < δ 1 ≤ δ j and δ k − 1 < δ 2 ≤ δ k (with δ − 1 = 0). Then H ( δ 1 , δ 2 ) ⊆ H ( δ j , δ k ) and δ j ≤ 2 δ 1 , δ k ≤ 2 δ 2 , so δ n, F δ j ∨ δ n, F ≲ δ n, F δ 1 ∨ δ n, F , δ n, G δ k ∨ δ n, G ≲ δ n, G δ 2 ∨ δ n, G . Moreo ver, log(1 /η j,k ) = log(1 /η ) + O log( j + 1) + log ( k + 1) ≤ log(1 /η ) + O log( J + 1) , and log( J + 1) ≲ log log( eM n ). Also, since α ∈ [0 , 1] and δ j ≤ 2 δ 1 , w e hav e δ α j ≤ 2 α δ α 1 ≲ δ α 1 . Substituting these b ounds into ( 28 ) sho ws that, on E , for all δ 1 , δ 2 > 0, sup h ∈H ( δ 1 ,δ 2 ) | ( P n − P ) h | ≲ L 1 L 2 ∥G ∥ ∞ δ n, F δ 1 ∨ δ n, F + c ∞ L 1 L 2 δ α 1 δ n, G δ 2 ∨ δ n, G + L 1 L 2 δ 1 ∥G ∥ ∞ r log log ( eM n ) + log (1 /η ) n (29) + c ∞ L 1 L 2 δ α 1 ∥G ∥ ∞ log log ( eM n ) + log (1 /η ) n . (30) Step 3 (plug in incremen ts). F or fixed f , f 0 ∈ F and g , g 0 ∈ G , set δ 1 = ∥ f − f 0 ∥ and δ 2 = ∥ g − g 0 ∥ . Then ( φ 1 ( f ) − φ 1 ( f 0 ))( φ 2 ( g ) − φ 2 ( g 0 )) ∈ H ( δ 1 , δ 2 ), so ( 29 ) yields the claim. □ C.2 Lo cal maximal inequality via sup-norm metric entrop y The lo cal maximal inequality for the Rademacher complexit y in Lemma 14 can b e sharp ened if one is willing to work with sup-norm entrop y integrals for F . Let F and G b e t wo (p ossibly localized) function classes, and define the pro duct class H := { f g : f ∈ F , g ∈ G } . W rite ∥F ∥ and ∥F ∥ ∞ for the L 2 ( P ) and L ∞ ( P ) radii of F , resp ectively , and define ∥G ∥ and ∥G ∥ ∞ analogously . The following maximal inequality is most useful when F and G are suitably lo calized. The setting of the previous section is recov ered b y taking e F := F φ ( δ 1 ) and e G := G φ ( δ 2 ). The pro of follo ws b y modifying the argument of Theorem 2.1 in V an Der V aart and W ellner ( 2011 ); in place of their step corresp onding to Equation (2.2), w e argue as in the pro of of Theorem 3.1 in v an de 51 Geer ( 2014 ). Theorem 13 (Lo cal maximal inequality for inner pro duct classes) . Supp ose that C F := J ∞ ∥F ∥ ∞ , F < ∞ and ∥G ∥ ∞ < ∞ . F or an y δ > 0, it holds that R n ( H ) ≲ 1 √ n ( ∥F ∥ ∞ J 2 ( ∥G ∥ ∨ δ n, G , G ) + ( ∥G ∥ ∨ δ n, G ) J ∞ ∥G ∥ ∞ ∥G ∥ ∨ δ n, G ∥F ∥ + C F √ n , F ) . Consequen tly , R n ( H ) ≲ 1 √ n ∥F ∥ ∞ J 2 ( ∥G ∥ ∨ δ n, G , G ) + ∥G ∥ ∞ J ∞ ∥F ∥ ∨ n − 1 / 2 , F , where the implicit constant dep ends only on C F . Pr o of of The or em 13 . Using Dudley’s inequalit y (Lemma 12 ), w e find that R n ( H ) ≤ 12 √ n E " Z ∥H∥ n 0 p log N ( ε, H , L 2 ( P n )) dε # . W e next relate the cov ering num b er N ε, H , L 2 ( P n ) of H to appropriate cov ering num b ers F an b d G . F or f 1 , f 2 ∈ F and g 1 , g 2 ∈ G , note the basic identit y: f 1 g 1 − f 2 g 2 = f 1 ( g 1 − g 2 ) + g 2 ( f 1 − f 2 ) . T aking the L 2 ( P n ) norm of b oth sides and applying the triangle inequalit y , w e find that ∥ f 1 g 1 − f 2 g 2 ∥ n ≤ ∥ f 1 ( g 1 − g 2 ) ∥ n + ∥ g 2 ( f 1 − f 2 ) ∥ n ≤ ∥ f 1 ∥ ∞ ∥ g 1 − g 2 ∥ n + ∥ g 2 ∥ n ∥ f 1 − f 2 ∥ ∞ ≤ ∥F ∥ ∞ ∥ g 1 − g 2 ∥ n + ∥G ∥ n ∥ f 1 − f 2 ∥ ∞ , where ∥G ∥ n := sup g ∈G ∥ g ∥ n . Thus, for ev ery ε > 0, log N ε, H , L 2 ( P n ) ≤ log N ε 2 ∥F ∥ ∞ , G , L 2 ( P n ) + log N ε 2 ∥G ∥ n , F , L ∞ ≤ sup Q log N ε 2 ∥F ∥ ∞ , G , L 2 ( Q ) + log N ε 2 ∥G ∥ n , F , L ∞ , where the suprem um is taken ov er all discrete distributions Q supp orted on the support of P . 52 Com bining the cov ering b ound with Dudley’s inequality (Lemma 12 ), w e obtain R n ( H ) ≤ 12 √ n E " Z ∥H∥ n 0 p log N ( ε, H , L 2 ( P n )) dε # ≤ 12 √ n E " Z ∥H∥ n 0 s log N ε 2 ∥F ∥ ∞ , G , L 2 ( P n ) + log N ε 2 ∥G ∥ n , F , L ∞ dε # ≤ 12 √ n E " Z ∥H∥ n 0 s sup Q log N ε 2 ∥F ∥ ∞ , G , L 2 ( Q ) + log N ε 2 ∥G ∥ n , F , L ∞ dε # , where the suprem um is taken ov er all discrete distributions Q supp orted on the support of P . Using √ a + b ≤ √ a + √ b and the c hange of v ariables u = ε/ (2 ∥F ∥ ∞ ) in the first in tegral and v = ε/ (2 ∥G ∥ n ) in the second, w e further obtain R n ( H ) ≲ 1 √ n ∥F ∥ ∞ E J 2 ∥H∥ n 2 ∥F ∥ ∞ , G + E ∥G ∥ n J ∞ ∥H∥ n 2 ∥G ∥ n , F , where ∥H∥ n := sup h ∈H ∥ h ∥ n . Using the b ounds ∥H∥ n ≤ ∥F ∥ ∞ ∥G ∥ n and ∥H∥ n ≤ ∥F ∥ n ∥G ∥ ∞ , and the fact that the en tropy integrals are nondecreasing in their first argumen t, we obtain R n ( H ) ≲ 1 √ n ( ∥F ∥ ∞ E [ J 2 ( ∥G ∥ n , G )] + E ∥G ∥ n J ∞ ∥F ∥ n ∥G ∥ ∞ ∥G ∥ n , F ) . As in the pro of of Theorem 2.1 of V an Der V aart and W ellner ( 2011 ), concavit y of the maps ( x, y ) 7→ √ y J ∞ r x y , F and ( x, y ) 7→ √ y J 2 r x y , G , together with Jensen’s inequality , yields R n ( H ) ≲ 1 √ n ( ∥F ∥ ∞ J 2 E ∥G ∥ 2 n 1 / 2 , G + E ∥G ∥ 2 n 1 / 2 J ∞ E ∥F ∥ 2 n 1 / 2 ∥G ∥ ∞ E ∥G ∥ 2 n 1 / 2 , F !) . T o complete the pro of, we apply Theorems 2.1 and 2.2 of v an de Geer ( 2014 ), whic h resp ectiv ely b ound E ∥F ∥ 2 n and E ∥G ∥ 2 n . In particular, these results imply E ∥F ∥ 2 n ≲ ∥F ∥ 2 + 1 n J 2 ∞ ∥F ∥ ∞ , F ; E ∥G ∥ 2 n ≲ ∥G ∥ 2 + δ 2 n, G , where δ n, G is any solution of the critical inequalit y J 2 ( δ n, G , G ) ≲ √ n δ 2 n, G . Hence, J 2 E ∥G ∥ 2 n 1 / 2 , G ≲ J 2 ( ∥G ∥ , G ) + √ n δ 2 n, G , 53 and E ∥G ∥ 2 n 1 / 2 J ∞ E ∥F ∥ 2 n 1 / 2 ∥G ∥ ∞ E ∥G ∥ 2 n 1 / 2 , F ! ≲ ( ∥G ∥ + δ n, G ) J ∞ ∥G ∥ ∞ ∥G ∥ + δ n, G ∥F ∥ + 1 √ n J ∞ ∥F ∥ ∞ , F , F . Com bining these b ounds yields R n ( H ) ≲ 1 √ n ( ∥F ∥ ∞ J 2 ( ∥G ∥ ∨ δ n, G , G ) + ( ∥G ∥ ∨ δ n, G ) J ∞ ∥G ∥ ∞ ∥G ∥ ∨ δ n, G ∥F ∥ + 1 √ n J ∞ ∥F ∥ ∞ , F , F ) . □ C.3 A sp ecialized b ound for star-shap ed classes Let F and G b e classes of measurable functions. In this section, w e study the empirical inner- pro duct pro cess { ( P n − P )( f g ) : f ∈ F , g ∈ G } . This is a sp ecial case of the setup in App endix C.1 , in which φ 1 and φ 2 are the iden tity maps. The follo wing theorem extends Theorem 10 to p oint wise product classes of the form H . It is t ypically most useful when G is suitably lo calized. Refined lo cal maximal inequalities and lo cal con- cen tration b ounds based on the sup-norm entrop y integral J ∞ ( δ, F ) are provided in App endix C.2 . Theorem 14 (Uniform local concentration bound for empirical inner pro ducts) . Let F b e star- shap ed, and let G satisfy sup g ∈G ∥ g ∥ ∞ ≤ ∥G ∥ ∞ . Assume that there exist constan ts c ∞ > 0 and α ∈ [0 , 1] such that ∥ f ∥ ∞ ≤ c ∞ ∥ f ∥ α for all f ∈ F . Let δ n, F > 0 satisfy the critical radius condition R n ( F , δ n, F ) ≤ δ 2 n, F . F or all η ∈ (0 , 1), there exists a universal constant C > 0 such that, with probability at least 1 − η , for every f ∈ F and g ∈ G , ( P n − P )( f g ) ≤ ( ∥ f ∥∨ δ n, F ) C " ∥G ∥ ∞ δ n, F + c ∞ δ α − 1 n, F R n ( G )+ ∥G ∥ ∞ r log(1 /η ) n + c ∞ ∥G ∥ ∞ δ α − 1 n, F log(1 /η ) n # . Consequen tly , if δ n, F > p log(1 /η ) /n and { R n ( G ) } 1 / 2 > p log(1 /η ) /n , then ( P n − P )( f g ) ≲ ( ∥ f ∥ ∨ δ n, F ) ∥G ∥ ∞ δ n, F + c ∞ (1 ∨ ∥G ∥ ∞ ) δ α − 1 n, F R n ( G ) . Remark. A result similar to Theorem 12 could also b e obtained from Theorem 14 b y applying it to the star-shap ed h ull of the transformed classes e F := { φ 1 ( f ) − φ 1 ( f 0 ) : f ∈ F } and e G := { φ 2 ( g ) − φ 2 ( g 0 ) : g ∈ G } . A limitation of this approach is that the sup-norm b ound condition in 54 Theorem 14 would need to hold for e F , whic h does not, in general, follo w from the corresp onding condition on F , ev en when the transformations are Lipsc hitz. W e include the less general result b ecause it leverages the star-shap edness of F to yield a clean b ound and proof, without in voking the p eeling argumen t used in Theorem 12 . The pro of of the ab o ve theorem relies on the follo wing generalization of Lemma 11 , whic h b ounds the Rademacher complexity of self-normalized inner-pro duct classes and is pro ved afterw ard. Lemma 16 (Complexit y b ounds for self-normalized pro duct class) . Let F and G b e classes of functions. Assume F is star shaped and that there exists a c ∈ (0 , ∞ ) and α ∈ [0 , 1] suc h that, for all f ∈ F that ∥ f ∥ ∞ ≤ c ∥ f ∥ α . F or δ n > 0, define the normalized class H := f g ∥ f ∥ ∨ δ n : f ∈ F , g ∈ G . Then, up to universal constants, R n ( H ) ≲ δ − 1 n ∥G ∥ ∞ R n ( F , δ n ) + c δ α − 1 n R n ( G ) . Pr o of of The or em 14 . Let δ n, G b e an y solution to R n ( G ) ≤ δ 2 n, G . Define the self-normalized pro duct class H := f g ∥ f ∥ ∨ δ n, F : f ∈ F , g ∈ G , and let δ n, F satisfy the critical inequality R n ( F , δ n, F ) ≤ δ 2 n, F . F or an y h = f g ∥ f ∥∨ δ n, F ∈ H , ∥ h ∥ ≤ ∥ g ∥ ∞ ∥ f ∥ ∥ f ∥ ∨ δ n, F ≤ ∥G ∥ ∞ . Moreo ver, by the assumption ∥ f ∥ ∞ ≤ c ∥ f ∥ α (with α ∈ [0 , 1]), ∥ h ∥ ∞ ≤ ∥ g ∥ ∞ ∥ f ∥ ∞ ∥ f ∥ ∨ δ n, F ≤ c ∥G ∥ ∞ ∥ f ∥ α ∥ f ∥ ∨ δ n, F ≤ c ∥G ∥ ∞ δ α − 1 n, F . Hence, b y Bousquet’s inequalit y in Lemma 8 , there exists a univ ersal constan t c 0 > 0 suc h that, for all u ≥ 0, with probability at least 1 − e − u , sup h ∈H ( P n − P ) h ≤ E h sup h ∈H ( P n − P ) h i + c 0 ∥G ∥ ∞ r u n + c ∥G ∥ ∞ δ α − 1 n, F u n . By the Rademac her symmetrization b ound in Lemma 9 , E h sup h ∈H ( P n − P ) h i ≤ 2 R n ( H ) . By Lemma 16 with δ n = δ n, F , up to universal constants, R n ( H ) ≲ δ − 1 n, F ∥G ∥ ∞ R n ( F , δ n, F ) + c δ α − 1 n, F R n ( G ) . 55 Let δ n, G b e such that R n ( G ) ≤ δ 2 n, G . Then, b y the critical inequalit y for F , R n ( H ) ≲ ∥G ∥ ∞ δ n, F + c δ α − 1 n, F δ 2 n, G . Com bining the ab ov e displa ys, there exists a univ ersal constant C > 0 such that, for all u ≥ 0, with probability at least 1 − e − u , sup f ∈F , g ∈G ( P n − P )( f g ) ∥ f ∥ ∨ δ n, F ≤ C " ∥G ∥ ∞ δ n, F + c δ α − 1 n, F δ 2 n, G + ∥G ∥ ∞ r u n + c ∥G ∥ ∞ δ α − 1 n, F u n # . Equiv alently , on the same ev en t, for every f ∈ F and g ∈ G , ( P n − P )( f g ) ≤ ( ∥ f ∥ ∨ δ n, F ) C " ∥G ∥ ∞ δ n, F + c δ α − 1 n, F δ 2 n, G + ∥G ∥ ∞ r u n + c ∥G ∥ ∞ δ α − 1 n, F u n # . Setting u = log(1 /η ) in the preceding display , there exists a universal constan t C > 0 suc h that, with probability at least 1 − η , sup f ∈F , g ∈G ( P n − P )( f g ) ∥ f ∥ ∨ δ n, F ≤ C " ∥G ∥ ∞ δ n, F + c δ α − 1 n, F δ 2 n, G + ∥G ∥ ∞ r log(1 /η ) n + c ∥G ∥ ∞ δ α − 1 n, F log(1 /η ) n # . (31) Equiv alently , on the same ev en t, for every f ∈ F and g ∈ G , ( P n − P )( f g ) ≤ ( ∥ f ∥ ∨ δ n, F ) C " ∥G ∥ ∞ δ n, F + c δ α − 1 n, F δ 2 n, G + ∥G ∥ ∞ r log(1 /η ) n + c ∥G ∥ ∞ δ α − 1 n, F log(1 /η ) n # . (32) No w assume that δ n, F > p log(1 /η ) /n , and δ n, G > p log(1 /η ) /n . Then, com bining lik e terms, for a p ossibly differen t C , ( P n − P )( f g ) ≤ ( ∥ f ∥ ∨ δ n, F ) C " ∥G ∥ ∞ δ n, F + c (1 ∨ ∥G ∥ ∞ ) δ α − 1 n, F δ 2 n, G # . (33) □ W e no w prov e the lemma. Pr o of of L emma 16 . Define d n ( f ) := ∥ f ∥ ∨ δ n . Consider the classes H := { f g /d n ( f ) : f ∈ F , g ∈ G } , H ≤ := { f g /d n ( f ) = f g /δ n : f ∈ F , g ∈ G , ∥ f ∥ ≤ δ n } , H > := { f g /d n ( f ) = f g / ∥ f ∥ : f ∈ F , g ∈ G , ∥ f ∥ > δ n } . W e first sho w that R n ( H ) ≲ R n ( H ≤ ), so that it suffices to bound R n ( H ≤ ). Fix an y g ∈ G and f ∈ F with ∥ f ∥ > δ n . Set λ := δ n / ∥ f ∥ ∈ (0 , 1) and define h := λf . By star-shap edness, h ∈ F and 56 ∥ h ∥ = δ n . Then d n ( f ) = ∥ f ∥ and ( P ϵ n )( f g ) d n ( f ) = ( P ϵ n )( f g ) ∥ f ∥ = ( P ϵ n )( hg ) δ n ≤ sup f ∈F : ∥ f ∥≤ δ n ( P ϵ n )( f g ) δ n . T aking exp ec tations, it follo ws that R n ( H > ) ≤ R n ( H ≤ ) . Hence, b y sub-additivity of Rademacher complexities, R n ( H ) ≤ R n ( H ≤ ) + R n ( H > ) ≤ 2 R n ( H ≤ ) . W e no w b ound R n ( H ≤ ) conditional on Z 1: n . Note that ( f g )( z ) = φ { f ( z ) , g ( z ) } with φ ( x, y ) = xy and φ (0 , 0) = 0. If ∥G ∥ ∞ < ∞ and ∥F n ∥ ∞ < ∞ , then for all ( x, y ) , ( x ′ , y ′ ) ∈ [ −∥F n ∥ ∞ , ∥F n ∥ ∞ ] × [ −∥G ∥ ∞ , ∥G ∥ ∞ ], | φ ( x, y ) − φ ( x ′ , y ′ ) | = | xy − x ′ y ′ | ≤ ∥G ∥ ∞ | x − x ′ | + ∥F n ∥ ∞ | y − y ′ | ≤ √ 2 p ∥G ∥ 2 ∞ | x − x ′ | 2 + ∥F n ∥ 2 ∞ | y − y ′ | 2 . Moreo ver, by the assumed embedding ∥ f ∥ ∞ ≤ c ∥ f ∥ α and the definition F n = { f ∈ F : ∥ f ∥ ≤ δ n } , w e ha ve ∥F n ∥ ∞ ≤ c δ α n . Therefore, φ is Lipschitz in eac h co ordinate with constants ∥G ∥ ∞ and c δ α n . W e now apply a vector con traction inequality for Rademac her pro cesses (Theorem 3 of Maurer ( 2016 )). Let S := F n × G and, for s = ( f , g ) ∈ S , define ψ i ( s ) := f ( Z i ) g ( Z i ) , ϕ i ( s ) := √ 2 ∥G ∥ ∞ f ( Z i ) , c δ α n g ( Z i ) ∈ R 2 . Then for all s = ( f , g ) , s ′ = ( f ′ , g ′ ) ∈ S , | ψ i ( s ) − ψ i ( s ′ ) | = | f ( Z i ) g ( Z i ) − f ′ ( Z i ) g ′ ( Z i ) | ≤ ∥G ∥ ∞ | f ( Z i ) − f ′ ( Z i ) | + ∥F n ∥ ∞ | g ( Z i ) − g ′ ( Z i ) | ≤ √ 2 p ∥G ∥ 2 ∞ | f ( Z i ) − f ′ ( Z i ) | 2 + ∥F n ∥ 2 ∞ | g ( Z i ) − g ′ ( Z i ) | 2 ≤ ∥ ϕ i ( s ) − ϕ i ( s ′ ) ∥ 2 , where w e used ∥F n ∥ ∞ ≤ c δ α n . Thus, the h yp otheses of Theorem 3 in Maurer ( 2016 ) hold. W e conclude that E ϵ h sup f ∈F n , g ∈G P ϵ n ( f g ) i ≲ ∥G ∥ ∞ E ϵ (1) h sup f ∈F n P ϵ (1) n f i + c δ α n E ϵ (2) h sup g ∈G P ϵ (2) n g i , where ϵ (1) and ϵ (2) are indep endent i.i.d. Rademac her sequences. T aking exp ectations yields R n ( H ≤ ) = δ − 1 n E h sup f ∈F n , g ∈G P ϵ n ( f g ) i ≲ ∥G ∥ ∞ R n ( F n ) + c δ α n R n ( G ) ≲ δ − 1 n ∥G ∥ ∞ R n ( F , δ n ) + c δ α − 1 n R n ( G ) . 57 W e conclude that, up to univ ersal constan ts, R n ( H ) ≲ δ − 1 n ∥G ∥ ∞ R n ( F , δ n ) + c δ α − 1 n R n ( G ) . □ D Pro ofs of main results D.1 Pro ofs for Section 4.1 Pr o of of The or em 2 . The first part of the theorem is a direct application of Theorem 10 with F := F ℓ . The second part follo ws similarly b y taking F := F ℓ . □ Pr o of of The or em 3 . W e first prov e the result assuming that, up to universal constan ts, δ n ≳ (1 ∨ M ) q log(1 /η ) n , where M := sup f ∈F ∥ ℓ ( · , f 0 ) − ℓ ( · , f ) ∥ ∞ . By Lemma 1 , R ( ˆ f n ) − R ( f 0 ) ≤ ( P n − P ) ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) . By Theorem 2 , there exists a univ ersal constan t C > 0 such that, with probabilit y at least 1 − η , for every f ∈ F , letting σ 2 f := V ar { ℓ ( Z , f 0 ) − ℓ ( Z , f ) } , ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , f ) } ≤ C σ f δ n + δ 2 n . Since the ab o ve bound holds uniformly ov er f ∈ F , it also holds for the random elemen t ˆ f n ∈ F . Hence, with the same confidence, ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) } ≤ C " ˆ σ n δ n + δ 2 n # , where ˆ σ n := σ ˆ f n . By the Bernstein condition in A1 , ˆ σ 2 n = V ar ℓ ( Z, ˆ f n ) − ℓ ( Z , f 0 ) ≤ c Bern { R ( ˆ f n ) − R ( f 0 ) } . Letting d n ( ˆ f n , f 0 ) := { R ( ˆ f n ) − R ( f 0 ) } 1 / 2 and applying the basic inequality d 2 n ( ˆ f n , f 0 ) ≤ ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) } , w e obtain that there exists a univ ersal constant C > 0 such that, with probabilit y at least 1 − η , d 2 n ( ˆ f n , f 0 ) ≤ C " √ c Bern d n ( ˆ f n , f 0 ) δ n + δ 2 n # . (34) W e no w extract a regret rate using Y oung’s inequality . Let d := d n ( ˆ f n , f 0 ). F rom d 2 ≤ C h √ c Bern d δ n + δ 2 n i , 58 w e ha v e, for any α ∈ (0 , 1), √ c Bern d δ n ≤ α 2 d 2 + c Bern 2 α δ 2 n . T aking α := 1 /C and substituting yields d 2 ≤ 1 2 d 2 + C 1 + c Bern δ 2 n . Rearranging gives, for a universal constan t C ′ > 0, d 2 ≤ C ′ 1 + c Bern δ 2 n . Equiv alently , with probabilit y at least 1 − η , R ( ˆ f n ) − R ( f 0 ) ≤ C ′ 1 + c Bern δ 2 n . No w, the b ound abov e was derived under the assumption that δ n ≳ (1 ∨ M ) r log(1 /η ) n . If δ n do es not satisfy this condition, define the enlarged radius ˜ δ n := δ n ∨ c M r log(1 /η ) n , for a univ ersal constant c > 0. Since δ n satisfies the critical inequalit y R n (star( F ℓ ) , δ n ) ≤ δ 2 n , an y ˜ δ n ≥ δ n also satisfies R n (star( F ℓ ) , ˜ δ n ) ≤ ˜ δ 2 n , b ecause δ 7→ R n ( G , δ ) /δ is nonincreasing by Lemma 10 . Rep eating the argumen t yields, with probabilit y at least 1 − η , R ( ˆ f n ) − R ( f 0 ) ≤ C ′ 1 + c Bern ˜ δ 2 n . In particular, using ( a ∨ b ) 2 ≤ a 2 + b 2 , we hav e ˜ δ 2 n ≤ δ 2 n + c 2 M 2 log(1 /η ) n , and hence, with probability at least 1 − η , R ( ˆ f n ) − R ( f 0 ) ≤ C ′ 1 + c Bern δ 2 n + C ′ 1 + c Bern c 2 M 2 log(1 /η ) n . The first claim follows by c hanging the constant C ′ since c Bern ≥ 1. W e now sho w that, if n is large enough that R ( ˆ f n ) − R ( f 0 ) ≤ 1 with probabilit y at least 1 − η / 2, then the same b ound holds (up to a universal constan t) when δ n > 0 is instead c hosen to satisfy the critical radius condition R n (star( F ℓ ) , δ n ) ≤ δ 2 n for the uncen tered class F ℓ . W e note that Theorem 2 can b e mo dified so that δ n satisfies the critical radius condition R n (star( F ℓ ) , δ n ) ≤ δ 2 n for the uncen tered loss-difference class. As b efore, first assume δ n ≳ 59 M q log(1 /η ) n . Then, applying Theorem 10 to F ℓ , there exists a universal constant C > 0 such that, with probabilit y at least 1 − η / 2, for ev ery f ∈ F , letting δ 2 f := ∥ ℓ ( · , f ) − ℓ ( · , f 0 ) ∥ 2 , ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , f ) } ≤ C " δ f δ n + δ 2 n + δ f r log(1 /η ) n + M log(1 /η ) n # . The difference is that σ 2 f is replaced by the larger norm ∥ ℓ ( · , f ) − ℓ ( · , f 0 ) ∥ 2 . By the elemen tary decomp osition ∥ ℓ ( · , f ) − ℓ ( · , f 0 ) ∥ 2 = V ar ℓ ( Z, f ) − ℓ ( Z, f 0 ) + { R ( f ) − R ( f 0 ) } 2 , Condition A1 implies ∥ ℓ ( · , f ) − ℓ ( · , f 0 ) ∥ ≤ p c Bern { R ( f ) − R ( f 0 ) } + { R ( f ) − R ( f 0 ) } . Pro vided n is large enough that R ( ˆ f n ) − R ( f 0 ) ≤ 1, it follo ws that ∥ ℓ ( · , ˆ f n ) − ℓ ( · , f 0 ) ∥ ≤ 1 + √ c Bern q R ( ˆ f n ) − R ( f 0 ) . Com bining the high-probabilit y bound with the basic inequalit y , w e find that, with probabilit y at least 1 − η , letting ˆ d n := q R ( ˆ f n ) − R ( f 0 ), ˆ d 2 n ≤ C " 1 + √ c Bern ˆ d n δ n + δ 2 n + 1 + √ c Bern ˆ d n r log(1 /η ) n + M log(1 /η ) n # . This is the same inequalit y as ( 34 ), up to the m ultiplicative factor (1 + √ c Bern ) on the linear terms. Applying Y oung’s inequality as b efore absorbs these terms in to the left-hand side and yields the same regret bound, with constants depending on c Bern only through univ ersal factors. Hence, the desired bound holds with probability at least 1 − η / 2. If, in addition, Pr { R ( ˆ f n ) − R ( f 0 ) ≤ 1 } ≥ 1 − η / 2, then by a union bound the even t { R ( ˆ f n ) − R ( f 0 ) ≤ 1 } and the desired b ound b oth hold with probability at least 1 − η . □ Our pro of of Theorem 4 uses the second part of Theorem 2 , whic h for clarit y we restate as the follo wing corollary . Corollary 4 (Uniform lo cal concentration inequalit y) . Let δ n > 0 satisfy the critical radius con- dition R n (star( F ℓ ) , δ n ) ≤ δ 2 n . Define M := sup f ∈F ∥ ℓ ( · , f 0 ) − ℓ ( · , f ) ∥ ∞ for eac h f ∈ F . F or all η ∈ (0 , 1), there exists a universal constant C > 0 such that, with probabilit y at least 1 − η , for ev ery f ∈ F , ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , f ) } ≤ C " ∥ ℓ ( · , f 0 ) − ℓ ( · , f ) ∥ δ n + δ 2 n + σ f ( M ∨ 1) r log(1 /η ) n + ( M ∨ 1) 2 log(1 /η ) n # . 60 Pr o of of The or em 4 . The pro of largely follows that of Theorem 3 , with minor mo difications as describ ed in the remark in Section 3.3 . Let ˆ f n ∈ arg min f ∈F P n ℓ ( · , f ). By Theorem 1 , R ( ˆ f n ) − R ( f 0 ) ≤ ( P n − P ) ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) . By the curv ature bound in Condition A1 , κ ∥ ˆ f n − f 0 ∥ 2 ≤ ( P n − P ) ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) . (35) Next, apply Corollary 4 . Define the enlarged radius ˜ δ n := δ n ∨ c M r log(1 /η ) n , for a universal constan t c > 0. Since ˜ δ n ≥ δ n and δ 7→ R n (star( F ℓ ) , δ ) /δ is nonincreasing (Lemma 10 ), ˜ δ n also satisfies the critical inequalit y R n (star( F ℓ ) , ˜ δ n ) ≲ ˜ δ 2 n . By Corollary 4 , there exists a universal constan t C > 0 and an ev ent E with Pr( E ) ≥ 1 − η suc h that, on E , uniformly over al l f ∈ F , ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , f ) } ≤ C h ∥ ℓ ( · , f 0 ) − ℓ ( · , f ) ∥ ˜ δ n + ˜ δ 2 n i . (36) In particular, on E this b ound holds for f = ˆ f n . Using the L 2 ( P )-Lipsc hitz b ound in Condition A1 , ∥ ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) ∥ ≤ L ∥ ˆ f n − f 0 ∥ , and substituting in to ( 36 ) yields, on E , ( P n − P ) { ℓ ( · , f 0 ) − ℓ ( · , ˆ f n ) } ≤ C h L ∥ ˆ f n − f 0 ∥ ˜ δ n + ˜ δ 2 n i . Com bining with ( 35 ) giv es, on E , κ ∥ ˆ f n − f 0 ∥ 2 ≤ C h L ∥ ˆ f n − f 0 ∥ ˜ δ n + ˜ δ 2 n i . (37) Finally , apply Y oung’s inequalit y to the linear term. W riting u := ∥ ˆ f n − f 0 ∥ , C Lu ˜ δ n ≤ κ 2 u 2 + C 2 L 2 2 κ ˜ δ 2 n . Substituting into ( 37 ) and absorbing ( κ/ 2) u 2 in to the left-hand side gives, for a (p ossibly larger) univ ersal constan t C > 0, u 2 ≤ C κ − 1 L 2 ˜ δ 2 n . 61 Using ( a ∨ b ) 2 ≤ a 2 + b 2 and M ≤ 2 LM , w e ha ve ˜ δ 2 n ≤ δ 2 n + c 2 M 2 log(1 /η ) n ≤ δ 2 n + C L 2 M 2 log(1 /η ) n , after adjusting the universal constant. Hence, with probabilit y at least 1 − η , ∥ ˆ f n − f 0 ∥ 2 ≤ C κ − 1 L 2 " δ 2 n + M 2 log(1 /η ) n # , whic h is the desired b ound (up to a univ ersal constan t). □ D.2 Pro ofs for Section 4.2 The metric entrop y maximal inequalities are pro v en in App endix B.4 . Here, we prov e the remaining results. Pr o of of L emma 6 . W e first sho w that J 2 δ, star( F ℓ ) ≲ J 2 ( δ, star( F ℓ )) + δ s log 1 + M δ . By linearity of P , cen tering comm utes with scaling, so star( F ℓ ) = star( F ℓ ) . Moreo ver, for any f ∈ star( F ℓ ) w e hav e f − P f = f + c with c := − P f . Since sup f ∈F ∥ ℓ ( · , f ) ∥ ∞ ≤ M , w e ha ve sup h ∈F ℓ ∥ h ∥ ∞ ≤ 2 M , and hence also sup f ∈ star( F ℓ ) ∥ f ∥ ∞ ≤ 2 M , which implies | c | ≤ 2 M . Therefore, star( F ℓ ) ⊆ star( F ℓ ) + [ − 2 M , 2 M ] . Consequen tly , for every ε > 0 and every Q , using the standard pro duct-cov er b ound N ( ε, A + B ) ≤ N ( ε/ 2 , A ) N ( ε/ 2 , B ), log N ε, star( F ℓ ) , L 2 ( Q ) ≤ log N ε/ 2 , star( F ℓ ) , L 2 ( Q ) + log 1 + 4 M /ε . In tegrating this b ound yields J 2 δ, star( F ℓ ) ≲ J 2 ( δ, star( F ℓ )) + δ s log 1 + M δ . Finally , a direct application of Corollary 3 yields J 2 ( δ, star( F ℓ )) ≲ J 2 ( δ, F ℓ ) + δ s log 1 + M δ . 62 Com bining the previous t w o displa ys and absorbing constants gives J 2 δ, star( F ℓ ) = J 2 δ, star( F ℓ ) ≲ J 2 ( δ, F ℓ ) + δ s log 1 + M δ . If Condition A3 holds, then Theorem 5 implies J 2 ( δ, F ℓ ) ≲ J 2 ( δ /L, F ), and the same argumen t yields J 2 δ, star( F ℓ ) ≲ J 2 ( δ /L, F ) + δ s log 1 + M δ . □ D.3 Pro ofs for Section 5 Pr o of of The or em 8 . Denote R ( f ; w ) := P { w ( · ) ℓ ( · , f ) } . Let ˆ f 0 ∈ arg min f ∈F R ( f ; ˆ w ). Add and subtract R ( ˆ f 0 ; ˆ w ) and R ( f 0 ; ˆ w ) to obtain R ( ˆ f 0 ; w 0 ) − R ( f 0 ; w 0 ) = R ( ˆ f 0 ; w 0 ) − R ( ˆ f 0 ; ˆ w ) + R ( ˆ f 0 ; ˆ w ) − R ( f 0 ; ˆ w ) + R ( f 0 ; ˆ w ) − R ( f 0 ; w 0 ) . By definition, ˆ f 0 minimizes f 7→ R ( f ; ˆ w ) ov er F , so R ( ˆ f 0 ; ˆ w ) ≤ R ( f 0 ; ˆ w ) and hence R ( ˆ f 0 ; ˆ w ) − R ( f 0 ; ˆ w ) ≤ 0. Therefore, R ( ˆ f 0 ; w 0 ) − R ( f 0 ; w 0 ) ≤ R ( ˆ f 0 ; w 0 ) − R ( ˆ f 0 ; ˆ w ) + R ( f 0 ; ˆ w ) − R ( f 0 ; w 0 ) ≤ P n ( w 0 − ˆ w ) ℓ ( · , ˆ f 0 ) o − P { ( w 0 − ˆ w ) ℓ ( · , f 0 ) } = P n ( w 0 − ˆ w ) ℓ ( · , ˆ f 0 ) − ℓ ( · , f 0 ) o = P n (1 − ˆ w /w 0 ) w 0 ℓ ( · , ˆ f 0 ) − ℓ ( · , f 0 ) o . W e further decomp ose P 1 − ˆ w w 0 w 0 ℓ ( · , ˆ f 0 ) − ℓ ( · , f 0 ) = P 1 − ˆ w w 0 h w 0 ℓ ( · , ˆ f 0 ) − ℓ ( · , f 0 ) − P n w 0 ℓ ( · , ˆ f 0 ) − ℓ ( · , f 0 ) oi + P 1 − ˆ w w 0 P n w 0 ℓ ( · , ˆ f 0 ) − ℓ ( · , f 0 ) o . By the Bernstein condition and the Cauc h y–Sch w arz inequalit y , the first term satisfies P 1 − ˆ w w 0 h w 0 ℓ ( · , ˆ f 0 ) − ℓ ( · , f 0 ) − P n w 0 ℓ ( · , ˆ f 0 ) − ℓ ( · , f 0 ) oi ≤ 1 − ˆ w w 0 w 0 ℓ ( · , ˆ f 0 ) − ℓ ( · , f 0 ) − P n w 0 ℓ ( · , ˆ f 0 ) − ℓ ( · , f 0 ) o = 1 − ˆ w w 0 V ar n w 0 ℓ ( · , ˆ f 0 ) − ℓ ( · , f 0 ) o 1 / 2 ≤ c 1 − ˆ w w 0 R ( ˆ f 0 ; w 0 ) − R ( f 0 ; w 0 ) 1 / 2 . 63 Next, since P n w 0 ℓ ( · , ˆ f 0 ) − ℓ ( · , f 0 ) o = R ( ˆ f 0 ; w 0 ) − R ( f 0 ; w 0 ), the second term satisfies P 1 − ˆ w w 0 P n w 0 ℓ ( · , ˆ f 0 ) − ℓ ( · , f 0 ) o = P 1 − ˆ w w 0 R ( ˆ f 0 ; w 0 ) − R ( f 0 ; w 0 ) ≤ P 1 − ˆ w w 0 R ( ˆ f 0 ; w 0 ) − R ( f 0 ; w 0 ) ≤ 1 − ˆ w w 0 R ( ˆ f 0 ; w 0 ) − R ( f 0 ; w 0 ) . Th us, com bining b oth b ounds, P 1 − ˆ w w 0 w 0 ℓ ( · , ˆ f 0 ) − ℓ ( · , f 0 ) ≤ c 1 − ˆ w w 0 R ( ˆ f 0 ; w 0 ) − R ( f 0 ; w 0 ) 1 / 2 + 1 − ˆ w w 0 R ( ˆ f 0 ; w 0 ) − R ( f 0 ; w 0 ) . W e conclude that R ( ˆ f 0 ; w 0 ) − R ( f 0 ; w 0 ) ≤ c 1 − ˆ w w 0 R ( ˆ f 0 ; w 0 ) − R ( f 0 ; w 0 ) 1 / 2 + 1 − ˆ w w 0 R ( ˆ f 0 ; w 0 ) − R ( f 0 ; w 0 ) . Assuming that 1 − ˆ w w 0 < 1 / 2, it follo ws that R ( ˆ f 0 ; w 0 ) − R ( f 0 ; w 0 ) ≤ 4 c 2 1 − ˆ w w 0 2 . □ Pr o of of The or em 6 . By Lemma 6 , for all δ > 0, J 2 δ, star( F ℓ ) ≲ J 2 ( δ /L, F ) + δ r log 1 + M δ . Using J 2 ( δ /L, F ) ≲ ϕ ( δ /L ), define ϕ n ( δ ) := ϕ ( δ /L ) + δ r log 1 + M δ . Let c > 0 denote the implied constan t, and define δ n, 0 := inf δ > 0 : ϕ ( δ ) ≤ c √ n δ 2 , δ n, 1 := r log( eM n ) n . Set δ n := Lδ n, 0 ∨ δ n, 1 . W e show that ϕ n ( δ n ) ≲ √ n δ 2 n . By Lemma 5 , δ n then satisfies the critical- radius condition, and the result follows b y direct application of Theorem 3 . F or the first term, consider t wo cases. If δ n = Lδ n, 0 , then δ n /L = δ n, 0 and ϕ ( δ n /L ) = ϕ ( δ n, 0 ) ≤ c √ n δ 2 n, 0 ≤ c √ n δ 2 n . 64 If δ n = δ n, 1 , then δ n /L = δ n, 1 /L ≥ δ n, 0 , so b y definition of δ n, 0 , ϕ ( δ n /L ) = ϕ ( δ n, 1 /L ) ≤ c √ n ( δ n, 1 /L ) 2 ≤ c √ n δ 2 n . Th us in all cases ϕ ( δ n /L ) ≲ √ n δ 2 n . F or the second term, since δ n ≥ δ n, 1 , δ n r log 1 + M δ n ≤ δ n s log 1 + M δ n, 1 . Moreo ver, δ n, 1 ≥ n − 1 implies M /δ n, 1 ≤ M n , hence 1 + M /δ n, 1 ≤ 1 + M n ≤ eM n (since M n ≥ 1. Therefore δ n r log 1 + M δ n ≤ δ n p log( eM n ) ≤ √ n δ 2 n , where the last step uses p log( eM n ) ≤ √ n δ n since δ n ≥ δ n, 1 = p log( eM n ) /n . Com bining yields ϕ n ( δ n ) ≲ √ n δ 2 n , hence J 2 ( δ n , star( F ℓ )) ≲ √ n δ 2 n . Applying Theorem 3 to star( F ℓ ) gives the claimed regret bound (after collecting constan ts). □ D.4 Pro of for Section 5.3 Pr o of of The or em 9 . In what follo ws, w e use the shorthand α := 1 − 1 / (2 β ), so that Condition B4 can b e written as the existence of an α ≥ 0 suc h that ∥ f − f ′ ∥ ∞ ≤ c ∞ ∥ f − f ′ ∥ α for all f , f ′ ∈ F . Theorem 1 implies that Reg( ˆ f n , ˆ g ) ≤ ( P n − P ) ℓ ˆ g ( · , ˆ f 0 ) − ℓ ˆ g ( · , ˆ f n ) = ( P n − P ) ℓ g 0 ( · , ˆ f 0 ) − ℓ g 0 ( · , ˆ f n ) + ( P n − P ) ℓ ˆ g ( · , ˆ f 0 ) − ℓ ˆ g ( · , ˆ f n ) − ℓ g 0 ( · , ˆ f 0 ) − ℓ g 0 ( · , ˆ f n ) . (38) W e pro ceed by deriving high-probabilit y b ounds for each term on the right-hand side, following the strategy used in the proof of Theorem 2 . Step 1: The first term. Define the class F ℓ,g 0 := n ℓ g 0 ( · , f ) − ℓ g 0 ( · , f ′ ) : f , f ′ ∈ F o . By the Lipsc hitz conditions in Condition B3 , we hav e ∥F ℓ,g 0 ∥ ∞ := sup h ∈F ℓ,g 0 ∥ h ∥ ∞ ≲ L sup f ∈F ∥ f ∥ ∞ . 65 Moreo ver, for all z , ℓ g 0 ( z , f 1 ) − ℓ g 0 ( z , f ′ 1 ) − ℓ g 0 ( z , f 2 ) − ℓ g 0 ( z , f ′ 2 ) ≲ L ∥G ∥ ∞ n | f 1 ( z ) − f ′ 1 ( z ) | + | f 2 ( z ) − f ′ 2 ( z ) | o ≲ L ∥G ∥ ∞ q | f 1 ( z ) − f ′ 1 ( z ) | 2 + | f 2 ( z ) − f ′ 2 ( z ) | 2 . By conv exit y of F , star( F − F ) = F . Hence, a slight mo dification of the proof of Theorem 11 , replacing F − { f 0 } with the difference class F − F , or a direct application of Theorem 12 , yields the following: there exists a universal constant C ∈ (0 , ∞ ) such that, for ev ery η ∈ (0 , 1), with probabilit y at least 1 − η / 4, the follo wing holds simultaneously for all f , f ′ ∈ F : ( P n − P ) { ℓ g 0 ( · , f ) − ℓ g 0 ( · , f ′ ) } ≤ C " L δ n, F ∥ f − f ′ ∥ + L 2 δ 2 n, F + L 2 log log ( M en ) n + L ∥ f − f ′ ∥ r log(1 /η ) n + M log(1 /η ) n # , where we used that δ n, F satisfies the critical inequality R n star( F − F ) , δ n, F ≤ δ 2 n, F . By assumption, w e ha v e δ n, F ≳ (1 ∨ M ) r log(1 /η ) + log log ( M en ) n . Therefore, r log(1 /η ) n ≲ δ n, F , log log ( M en ) n ≲ δ 2 n, F , M log(1 /η ) n ≲ δ 2 n, F , and the explicit remainder terms ab ov e are absorbed. Hence, on the same even t, ( P n − P ) { ℓ g 0 ( · , f ) − ℓ g 0 ( · , f ′ ) } ≲ δ n, F ∥ f − f ′ ∥ + δ 2 n, F , where the implicit constant dep ends only on L , M , and ∥G ∥ ∞ . T aking f = ˆ f n and f ′ = ˆ f 0 , we obtain that, with probabilit y at least 1 − η / 4, ( P n − P ) { ℓ g 0 ( · , ˆ f 0 ) − ℓ g 0 ( · , ˆ f n ) } ≲ ∥ ˆ f n − ˆ f 0 ∥ δ n, F −F + δ 2 n, F −F , (39) where the implicit constant dep ends only on L , M , and ∥G ∥ ∞ . 66 Step 2: the second term. By Condition B3 , write ℓ g ( z , f ) = m 1 ( z , f ) m 2 ( z , g ) + r 1 ( z , f ) + r 2 ( z , g ). Then ( P n − P ) ℓ ˆ g ( · , ˆ f 0 ) − ℓ ˆ g ( · , ˆ f n ) − ℓ g 0 ( · , ˆ f 0 ) − ℓ g 0 ( · , ˆ f n ) = ( P n − P ) h { m 1 ( · , ˆ f 0 ) − m 1 ( · , ˆ f n ) } { m 2 ( · , ˆ g ) − m 2 ( · , g 0 ) } i , since the r 1 terms cancel b et ween the t wo brack ets and the r 2 terms cancel within each brack et. Under Condition B3 , we may write m 1 ( · , f ) = φ 1 ( f ) and m 2 ( · , g ) = φ 2 ( g ) for the p oint wise maps φ 1 , φ 2 in tro duced abov e. Hence the displa y abov e equals ( P n − P ) h { φ 1 ( ˆ f 0 ) − φ 1 ( ˆ f n ) }{ φ 2 ( ˆ g ) − φ 2 ( g 0 ) } i . By Condition B5 , with probabilit y at least 1 − η / 2, ∥ ˆ g − g 0 ∥ G ≤ ε nuis ( n, η ) . Apply Theorem 12 with f = ˆ f 0 , f 0 = ˆ f n , g = ˆ g , and g 0 = g 0 (with classes F and G ). Since the theorem holds uniformly o v er g , g 0 ∈ G , it applies to the random pair ( ˆ g , g 0 ). Th us, with probabilit y at least 1 − η / 4, ( P n − P ) { φ 1 ( ˆ f 0 ) − φ 1 ( ˆ f n ) }{ φ 2 ( ˆ g ) − φ 2 ( g 0 ) } ≲ ∥G ∥ ∞ δ n, F ∥ ˆ f 0 − ˆ f n ∥ ∨ δ n, F + c ∞ ∥ ˆ f 0 − ˆ f n ∥ α δ n, G ∥ ˆ g − g 0 ∥ ∨ δ n, G + ∥G ∥ ∞ ∥ ˆ f 0 − ˆ f n ∥ r log log ( eM n ) + log (1 /η ) n + c ∞ ∥ ˆ f 0 − ˆ f n ∥ α ∥G ∥ ∞ log log ( eM n ) + log (1 /η ) n . (40) where M := 1 + ∥G ∥ ∞ ∨ ∥F ∥ ∞ , and the implicit constant equals C L 1 L 2 for a univ ersal C ∈ (0 , ∞ ). In tersecting the even t in ( 40 ) with the P A C ev en t {∥ ˆ g − g 0 ∥ G ≤ ε nuis ( n, η ) } and taking a union b ound yields that, with probabilit y at least 1 − 3 η / 4, ( P n − P ) { φ 1 ( ˆ f 0 ) − φ 1 ( ˆ f n ) }{ φ 2 ( ˆ g ) − φ 2 ( g 0 ) } ≲ ∥G ∥ ∞ δ n, F ∥ ˆ f 0 − ˆ f n ∥ ∨ δ n, F + c ∞ ∥ ˆ f 0 − ˆ f n ∥ α δ n, G ε nuis ( n, η ) ∨ δ n, G + ∥G ∥ ∞ ∥ ˆ f 0 − ˆ f n ∥ r log log ( eM n ) + log (1 /η ) n + c ∞ ∥ ˆ f 0 − ˆ f n ∥ α ∥G ∥ ∞ log log ( eM n ) + log (1 /η ) n . (41) Step 3: Conclude. Let E 1 denote the ev ent on whic h ( 39 ) holds with confidence lev el η / 4, and let E 2 denote the even t on which ( 41 ) holds with confidence level 3 η / 4. By the preceding steps and a union b ound (adjusting constan ts if needed), we may assume that Pr( E 1 ∩ E 2 ) ≥ 1 − η . 67 W ork on E 1 ∩ E 2 . Combining ( 38 ), ( 39 ), and ( 41 ), and using the strong conv exit y condition ( B2 ) (so that κ ∥ ˆ f n − ˆ f 0 ∥ 2 ≤ Reg( ˆ f n ; ˆ g )), we obtain κ ∥ ˆ f n − ˆ f 0 ∥ 2 ≲ ∥ ˆ f n − ˆ f 0 ∥ δ n, F + δ 2 n, F + ∥G ∥ ∞ δ n, F ∥ ˆ f n − ˆ f 0 ∥ ∨ δ n, F + c ∞ ∥ ˆ f n − ˆ f 0 ∥ α δ n, G ε nuis ( n, η ) ∨ δ n, G + ∥G ∥ ∞ ∥ ˆ f n − ˆ f 0 ∥ r log log ( eM n ) + log (1 /η ) n + c ∞ ∥ ˆ f n − ˆ f 0 ∥ α ∥G ∥ ∞ log log ( eM n ) + log (1 /η ) n , (42) where M := 1 + ∥G ∥ ∞ ∨ ∥F ∥ ∞ . Next, inv oke the assumed low er b ounds on the critical radii (and the definition of the high- probabilit y radii in the theorem statemen t) to absorb the explicit logarithmic remainder terms. In particular, under the same t yp e of lo wer b ound used in Step 1, δ n, F ≳ (1 ∨ M ) r log(1 /η ) + log log ( eM n ) n , δ n, G ≳ r log(1 /η ) + log log ( eM n ) n , w e ha v e, after enlarging the implicit constan ts, r log log ( eM n ) + log (1 /η ) n ≲ δ n, F , log log ( eM n ) + log (1 /η ) n ≲ δ 2 n, F . Substituting these b ounds into ( 42 ) yields κ ∥ ˆ f n − ˆ f 0 ∥ 2 ≲ δ n, F ∥ ˆ f n − ˆ f 0 ∥ + δ 2 n, F + ∥G ∥ ∞ δ n, F ∥ ˆ f n − ˆ f 0 ∥ ∨ δ n, F + c ∞ ∥ ˆ f n − ˆ f 0 ∥ α δ n, G ε nuis ( n, η ) ∨ δ n, G . (43) W e now use the ab ov e fixed-p oint inequalit y to extract a rate b ound. Let ∆ := ∥ ˆ f n − ˆ f 0 ∥ , δ F := δ n, F , and δ G := δ n, G , and define ∆ G := ε nuis ( n, η ) ∨ δ G . F rom ( 43 ) and the b ound ∆ ∨ δ F ≤ ∆ + δ F , w e obtain κ ∆ 2 ≲ (1 + ∥G ∥ ∞ ) δ F ∆ + δ 2 F + c ∞ ∆ α δ G ∆ G . Apply Y oung’s inequalit y to the linear term: δ F ∆ ≤ κ 4 ∆ 2 + C δ 2 F , and absorb κ 4 ∆ 2 in to the left-hand side. This yields ∆ 2 ≲ δ 2 F + ∆ α δ G ∆ G . (44) If α ∈ (0 , 1], apply Y oung’s inequality with exp onen ts p = 2 /α and q = 2 / (2 − α ) to the pro duct 68 ∆ α · ( δ G ∆ G ): ∆ α ( δ G ∆ G ) ≤ ε ∆ 2 + C α ε − α/ (2 − α ) ( δ G ∆ G ) 2 / (2 − α ) . Cho osing ε > 0 sufficiently small and absorbing ε ∆ 2 in to the left-hand side of ( 44 ) gives ∆ 2 ≲ δ 2 F + δ G ∆ G 2 / (2 − α ) , α ∈ (0 , 1) . The endp oint α = 0 follo ws directly from ( 44 ), yielding ∆ 2 ≲ δ 2 F + δ G ∆ G , α = 0 . The result no w follo ws recalling that α := 1 − 1 / (2 β ). □ 69
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment