Optimal and Structure-Adaptive CATE Estimation with Kernel Ridge Regression

Optimal and Structure-A daptiv e CA TE Estimation with Kernel Ridge Regression Seok-Jin Kim ∗ 1 1 Colum bia IEOR This v ersion: F ebruary 25, 2026 Abstract W e prop ose an optimal algorithm for estimating conditional av erage treatment eﬀects (CA TEs) when resp onse functions lie in a reproducing kernel Hilbert space (RKHS). W e study settings in whic h the con trast function is structurally simpler than the nuisance functions: (i) it lies in a lo wer-complexit y RKHS with faster eigenv alue decay , (ii) it satisﬁes a source condition relativ e to the nuisance kernel, or (iii) it dep ends on a kno wn low-dimensional co v ariate represen tation. W e develop a uniﬁed t wo-stage kernel ridge regression (KRR) method that attains minimax rates go verned by the complexity of the contrast function rather than the nuisance class, in terms of b oth sample size and ov erlap. W e also show that a simple mo del-selection step ov er candidate con trast spaces and regularization lev els yields an oracle inequality , enabling adaptation to unkno wn CA TE regularit y . 1 In tro duction Estimating treatmen t eﬀects is a cen tral challenge in causal inference, particularly regarding Condi- tional A v erage T reatmen t Eﬀects (CA TEs), whic h are piv otal for p ersonalized decision-making in domains ranging from precision medicine to economics ( Künzel et al. , 2019 ; Kennedy , 2020 ). While the A verage T reatment Eﬀect (A TE) oﬀers a p opulation-lev el summary , it often obscures critical heterogeneit y across individuals. Conseq uently , CA TE estimation has emerged as a fundamental pursuit, as it targets the individualized eﬀects necessary for optimal p olicy selection. A salient feature in recent literature is the structural asymmetry b et w een the treatmen t eﬀect of interest and the nuisance comp onen ts (e.g., baseline resp onse functions). Empirically , nuisance functions are often highly complex—non-smo oth or high-dimensional—even when the CA TE itself is smo oth, sparse, or constan t ( Kennedy , 2020 ; Kennedy et al. , 2022 ; Kato and Imaizumi , 2023 ). A k ey theoretical imp erativ e is to exploit this simpler structure to achiev e conv ergence rates faster than those dictated by n uisance learning, and to determine whether such rates are minimax optimal. While prior w ork has established adaptive rates for Hölder smo othness using doubly robust estimators ( Kennedy , 2020 ; Kennedy et al. , 2022 ), extending these results to general complexity measures remains an op en c hallenge ( Cinelli et al. , 2025 ). In this work, we address this gap within the repro ducing kernel Hilb ert space (RKHS) framew ork. W e establish statistical guarantees that adapt to the lo wer complexity of the contrast function, even when nuisance functions reside in a signiﬁcan tly more complex ambien t space. ∗ seok-jin.kim@columbia.edu 1 W e consider treatment eﬀect estimation giv en n i.i.d. observ ational samples D = { ( x i , a i , y i ) } n i =1 with cov ariates x i ∈ R d , binary treatments a i ∈ { 0 , 1 } , and resp onses y i . Let f ⋆ 0 and f ⋆ 1 denote the n uisance resp onse functions for control and treated outcomes, resp ect ively . Our estimand is the CA TE function, deﬁned as h ⋆ := f ⋆ 1 − f ⋆ 0 (see Section 2 for formal deﬁnitions). W e prop ose an eﬃcien t metho dology for estimating h ⋆ , assuming f ⋆ 0 , f ⋆ 1 lie in a generic RKHS function space F , while h ⋆ b elongs to a strictly “simpler” space H . Crucially , we ac hieve optimal learning rates gov erned solely b y the complexity of H —rendering the complexity of F negligi- ble—without imp osing structural assumptions on the prop ensit y score. W e formalize this “simpler structure” via three distinct mo dels: • Mo del 1 (Subspace): H ⊂ F is an RKHS exhibiting faster spectral deca y (e.g., higher smo othness). • Mo del 2 (Source Condition): h ⋆ satisﬁes a source condition ( Fisc her and Stein wart , 2020 ; Jun et al. , 2019 ) with resp ect to the kernel of F . • Mo del 3 (Lo w-Dimensional Structure): h ⋆ ( x ) = ˜ h ⋆ ( ˜ x ) dep ends on a known low-dimensional represen tation ˜ x , where ˜ h ⋆ lies in an RKHS ˜ H . While it is established that RKHS nuisance assumptions yield oracle n − 1 / 2 rates for A TE ( Mou et al. , 2023 ), analogous guarantees for CA TE in this general RKHS setting hav e remained elusive. W e resolve this gap. 1.1 Con tribution W e prop ose a uniﬁed, imputation-based t wo-stage kernel ridge regression (KRR) algorithm. Our metho d is minimax optimal across all three mo dels, adapting to the complexit y of H for both L 2 -error and p oin twise ev aluation. F urthermore, we pro vide a ligh t weigh t mo del-selection result (oracle inequalit y) that automatically selects among candidate contrast spaces and tunes the regularizer when H is unkno wn, while preserving the target fast-rate guaran tees. Theorem (Informal) Under Mo del 1, 2, or 3, our algorithm achieves le arning b ounds that ar e minimax optimal with r esp e ct to the c omplexity of the c ontr ast function h ⋆ (matching or acle r e gr ession r ates within H ). Crucial ly, the c omplexity of the nuisanc e class F do es not de gr ade these r ates. This result implies that w e ac hiev e the fast conv ergence rates intrinsic to H —b ypassing the slo wer rates asso ciated with nuisance comp onen ts— without requiring consisten t estimation of the prop ensit y score. Our rates strictly outp erform standard Double Machine Learning (DML) rates, whic h typically dep end on the pro duct of n uisance rates; w e refer to these as fast rates . T o the b est of our kno wledge, we are the ﬁrst to extend this adaptivity b ey ond Hölder spaces to general RKHSs, including Sob olev spaces, Mixed Sob olev spaces, and Neural T angen t Kernels (NTK). Sp ecial Case: Rates for Sob olev Classes. Our framework encompasses Sob olev and Mixed Sob olev spaces ( Kühn et al. , 2015 ; Suzuki , 2018 ). Notably , Mixed Sob olev spaces relax the standard RKHS condition ( β > d/ 2 ). The resulting optimal rates are summarized in T able 1 . Moreo ver, our b ounds are optimal in b oth sample size n and ov erlap κ (where the prop ensit y score lies in [ κ, 1 − κ ] ). W e summarize our contributions b elo w: • W e deriv e the ﬁrst fast rates in the RKHS framew ork that adapt to the complexity of H without prop ensit y score estimation, generalizing results b ey ond Hölder spaces to three distinct RKHS structural mo dels. 2 T able 1: (Squared) L 2 -error and p oin twise estimation error derived for Sob olev and mixed Sob olev spaces. W e ac hieve these rates without requiring estimation of the prop ensit y score. See Section 2.2 for discussion of mixed Sob olev spaces. Space Condition L 2 -Error P oin twise Error Sob olev ( H = H γ , F = H β ) γ > β > d/ 2 n − 2 γ d +2 γ n − 2 γ − d 2 γ Mixed Sob olev ( H = H γ mix , F = H β mix ) γ > β > 1 / 2 n − 2 γ 1+2 γ n − 2 γ − 1 2 γ • W e establish optimality with resp ect to b oth the degree of ov erlap κ and the sample size n . These b ounds coincide with the regression oracle rate in H based on an eﬀective sample size of nκ . • W e provide a simple, uniﬁed metho dology applicable across all three structural scenarios, alongside an oracle-inequality-based mo del selection step that selects among candidate H and regularization lev els to adapt to unkno wn contrast complexity . 1.2 Related W ork RKHS in Causal Inference. W ang and Kim ( 2023 ); Mou et al. ( 2023 ) studied eﬃcien t A TE estimation within the RKHS framew ork. F or CA TE, Nie and W ager ( 2021 ); F oster and Syrgk anis ( 2023 ) analyzed KRR-based estimators but required consistent propensity score estimation and suﬃcien tly fast learning b ounds for the nuisance comp onen ts. Singh et al. ( 2024 ) prop osed KRR- based metho ds for CA TE, yet it remains unclear whether their rates adapt to the complexity of h ⋆ . Our w ork diﬀerentiates itself by establishing optimal adaptive rates for CA TE under general RKHS settings without relying on prop ensit y score consistency . Structural A daptivity . Kennedy ( 2020 ); Kennedy et al. ( 2022 ); Gao and Han ( 2020 ) established optimal rates for Hölder classes using doubly robust estimators and U-statistics. Recent works hav e explored sp eciﬁc low-complexit y structures: Kato and Imaizumi ( 2023 ) inv estigated settings where the CA TE p ossesses a sparse linear structure, and Kim et al. ( 2025 ) analyzed RKHS-v alued CA TEs where the contrast function has a smaller Hilb ert norm than the nuisance. Standard DML/meta- learner approaches fo cus on orthogonalization and nuisance-rate pro ducts in semiparametric settings ( Oprescu et al. , 2019 ; Künzel et al. , 2019 ). 1.3 Notation The constan ts c, C, c 1 , C 1 , . . . ma y v ary from line to line. W e use the symbol [ n ] as a shorthand for { 1 , 2 , . . . , n } . F or nonnegative sequences { a n } ∞ n =1 and { b n } ∞ n =1 , we write a n ≲ b n or a n = O ( b n ) if there exists a p ositiv e constant C suc h that a n ≤ C b n for all n . W e use ˜ O to denote b ounds up to p olylogarithmic factors. Additionally , we write a n ≍ b n if a n ≲ b n and b n ≲ a n . F or a b ounded linear op erator A , w e use ∥ A ∥ op to denote its op erator norm. F or any u and v in a Hilb ert space H , their inner and outer pro ducts are denoted b y ⟨ u, v ⟩ H and u ⊗ v , resp ectiv ely . When the context is clear, we denote the inner pro duct as u ⊤ v := ⟨ u, v ⟩ H . 3 2 Problem Setup 2.1 T reatment Regime and CA TE W e consider the problem of learning from n i.i.d. observ ational data p oin ts D = { ( x i , a i , y i ) } n i =1 , where x i ∈ X denotes the cov ariates, a i ∈ { 0 , 1 } is a binary treatment indicator, and y i ∈ R is the resp onse. F or generic ( x, a, y ) , w e denote the marginal distribution of x o ver the cov ariate space X b y P x . W e assume X is a regular domain in R d . Let F b e an RKHS containing the nuisance functions f ⋆ 0 , f ⋆ 1 , with k ernel K F ( · , · ) . Our resp onse mo del is giv en by: (Nuisances) E [ y | x, a = 0] = f ⋆ 0 ( x ) , E [ y | x, a = 1] = f ⋆ 1 ( x ) . Throughout, we assume that the norms ∥ f ⋆ 0 ∥ F and ∥ f ⋆ 1 ∥ F are bounded by a universal constan t M > 0 . Let H b e another function space. Our estimand of interest is the CA TE function, deﬁned as: (CA TE) h ⋆ ( x ) := f ⋆ 1 ( x ) − f ⋆ 0 ( x ) . Our primary fo cus is the regime where h ⋆ ∈ H and the hypothesis class H exhibits lower c omplexity than the am bient space F . In this regime, we aim to derive sharp estimation error b ounds that are adaptiv e to the complexity of H . Sp eciﬁcally , we seek rates that are indep enden t of the complexity of F , matching the minimax optimal rates for a regression problem restricted to H . W e ev aluate p erformance using t wo metrics: the (squared) L 2 ( P x ) -error and the (squared) p oin twise ev aluation error at a ﬁxed query p oin t x 0 ∈ X : E L 2 ( ˆ h ) := E x ∼P x   ˆ h ( x ) − h ⋆ ( x )   2 , E x 0 ( ˆ h ) :=   ˆ h ( x 0 ) − h ⋆ ( x 0 )   2 . These are standard measures in nonparametric estimation ( W ainwrigh t , 2019 ; Bühlmann and V an De Geer , 2011 ). 2.2 Three Mo dels of Structural Simplicity W e delineate three distinct structural scenarios where the CA TE is simpler than the nuisance comp onen ts. In all cases, H represents a strictly less complex hypothesis class than F . The ﬁrst mo del considers the case where H is an RKHS subspace of F with faster eigenv alue deca y . Mo del 1 (Subspace) . H is an RKHS with H ⊂ F , and H exhibits faster sp ectral decay than F . Under Mo del 1 , H is an RKHS; we denote its asso ciated k ernel by K H . Examples of Mo del 1 include: • P arametric vs. Nonparametric: F is an inﬁnite-dimensional RKHS (e.g., a Sob olev class), while H is a ﬁnite-dimensional linear subspace (e.g., p olynomials). • Diﬀeren tial Sobolev Smo othness: H corresp onds to a smo other function class than F . F or instance, F = H β ( X ) and H = H γ ( X ) with γ ≥ β > d 2 . • Mixed Smo othness: W e can consider mixed Sob olev spaces ( Kühn et al. , 2015 ; Suzuki , 2018 ) F = H β mix ( X ) and H = H γ mix ( X ) with γ ≥ β > 1 / 2 . Notably , mixed Sob olev spaces signiﬁcantly relax the standard smo othness constraint ( β > d/ 2 ) required for isotropic Sob olev classes. 4 Relaxation via Mixed Smo othness. A standard limitation of isotropic Sob olev spaces H β ( X ) is the requiremen t β > d/ 2 to guaran tee the RKHS prop ert y . Mixed Sobolev spaces H β mix ( X ) oﬀer a p o werful alternativ e by relaxing this condition to β > 1 / 2 , indep enden t of the dimension d . F or example, H 1 mix consists of functions whose mixed ﬁrst-order deriv atives (e.g., the full cross-deriv ative ∂ x 1 ∂ x 2 · · · ∂ x d f ) are square-integrable, but it do es not require higher-order deriv ativ es in any single co ordinate, suc h as ∂ 2 x 1 f . By adopting H β mix , we can mo del high-dimensional CA TE functions that are smo oth in co ordinate-wise in teractions without the stringent isotropic smo othness condition. Next, we introduce our second mo del based on sp ectral source conditions. Mo del 2 (Source Condition) . W e assume h ⋆ satisﬁes a source condition with parameter 0 < ν < 1 with resp ect to the kernel of F . F ormally , let T F : L 2 ( P x ) → L 2 ( P x ) b e the integral op erator asso ciated with the k ernel of F . The source condition assumption is deﬁned as h ⋆ ∈ Range ( T 1+ ν 2 F ) . Source conditions, widely utilized in RKHS theory ( Singh et al. , 2024 ; Jun et al. , 2019 ; Chen et al. , 2024 ), c haracterize functions whose sp ectral co eﬃcien ts decay rapidly relativ e to the eigen v alues of the kernel integral op erator. This assumption implies that h ⋆ lies in a fractional p o wer space [ F ] ν , which is strictly smaller and "smo other" than F . This concept is also central to the analysis of ov erparameterized neural netw orks via the NTK ( Zhang et al. , 2025 ; GHORBANI et al. , 2021 ). Finally , we presen t a mo del where h ⋆ dep ends on a lo wer-dimensional set of co v ariates. Mo del 3 (Low-Dimensional Structure) . W e assume there exists a known lo w-dimensional transformation ˜ x i of x i and a function ˜ h ⋆ ∈ ˜ H , where ˜ H is an RKHS, suc h that: h ⋆ ( x i ) = ˜ h ⋆ ( ˜ x i ) . Common examples include v ariable selection (where ˜ x i is a subset of co ordinates) and linear pro jections (where ˜ x i = P x i for a pro jection P ). As an illustration, one can consider F = H β ( X ) and H = H β ( ˜ X ) , where ˜ X is the space of ˜ x i . How ever, b ecause dim ( ˜ x ) < dim ( x ) , the statistical complexit y of estimating ˜ h ⋆ is strictly lo wer. Our goal is to achiev e rates adapting to this low er in trinsic dimension. Under Mo del 3 , ˜ H is an RKHS; we denote its asso ciated kernel by K ˜ H . W e set dim( ˜ x i ) = ˜ d < d . 2.3 Standard Assumptions W e imp ose the following standard assumptions common to causal inference and nonparametric regression ( Künzel et al. , 2019 ; Kennedy , 2020 ; W ainwrigh t , 2019 ). Assumption 1 (Consistency and Unconfoundedness) . L et y 0 and y 1 b e the p otential outc omes. W e observe y = y a (c onsistency), and ( y 0 , y 1 ) ⊥ ⊥ a | x (unc onfounde dness). Assumption 2 (Overlap) . The pr op ensity sc or e π ( x ) := P [ a i = 1 | x i = x ] satisﬁes the overlap c ondition: ther e exists κ > 0 such that κ < π ( x ) < 1 − κ, ∀ x ∈ X . The parameter κ quan tiﬁes the degree of ov erlap. W e explicitly track the dep endence of our b ounds on κ , aiming for optimality in terms of the eﬀe ctive sample size nκ . 5 Assumption 3 (Sub-Gaussian Noise) . Conditione d on x i and a i , the noise ε i is σ -sub-Gaussian. W e assume σ is b ounde d by an absolute c onstant. Assumption 4 (Boundedness of Kernels) . Ther e exists a universal c onstant ξ > 0 such that sup x ∈X K F ( x, x ) ≤ ξ . Assumptions 1 to 3 are standard in the causal inference literature ( Künzel et al. , 2019 ; Curth and V an der Schaar , 2021 ; Kennedy et al. , 2022 ; Kennedy , 2020 ). Assumption 4 is common in kernel metho ds analysis ( W ain wright , 2019 ; Singh et al. , 2024 ; W ang , 2023 ). One k ey diﬀerence from prior w ork is ho w o verlap is handled: most analyses assume strong ov erlap and treat κ in Assumption 2 as a ﬁxed constan t. W e also allow κ to shrink, treat it as a k ey parameter, and derive non-asymptotic b ounds accordingly . In other words, we study the we ak-overlap regime. 2.4 F undamental Limits of CA TE Estimation T o contextualize our results, we c haracterize the fundamental limits of estimating h ⋆ . W e establish these limits b y b enc hmarking against minimax low er b ounds for standard nonparametric regression o ver H . W e ﬁrst deﬁne these regression baselines. Deﬁnition 1. L et LB-L2 ( N ; H ) denote the minimax squar e d L 2 -err or lower b ound for r e gr ession over H with N samples. Similarly, let LB-PE ( N ; H ) denote the c orr esp onding lower b ound for squar e d p ointwise evaluation err or. F or instance, if H = H γ ( X ) (Sob olev space), the rates are given by LB-L2 ( N ; H ) ≍ N − 2 γ / ( d +2 γ ) and LB-PE ( N ; H ) ≍ N − (2 γ − d ) / 2 γ ( T uo and Zou , 2024 ; W ainwrigh t , 2019 ). Building on these deﬁnitions, we present the low er b ounds for CA TE estimation. Lemma 1 (Informal: Lo wer Bounds) . The minimax squar e d L 2 -err or for estimating h ⋆ is lower- b ounde d by LB-L2 ( nκ ; H ) , up to c onstant factors. Similarly, the minimax squar e d p ointwise evaluation err or at x 0 is lower-b ounde d by LB-PE ( nκ ; H ) , up to c onstant factors. Pr o of. Consider a simpliﬁed oracle setting where the baseline function f ⋆ 0 is kno wn exactly and the prop ensit y score is constant, π ( x ) ≡ κ . In this scenario, estimating h ⋆ is equiv alen t to estimating f ⋆ 1 using only the treated subp opulation, which has an exp ected sample size of nκ . Consequently , the CA TE estimation problem reduces to a standard regression problem ov er H . Any algorithm ac hieving a rate faster than the regression lo wer b ound with sample size nκ w ould violate the minimax optimality of regression in H . Consequen tly , for Sob olev spaces H = H γ ( X ) , the L 2 lo wer bound is ( nκ ) − 2 γ d +2 γ , and the p oin twise ev aluation low er b ound is ( nκ ) − 2 γ − d 2 γ . 3 Metho dology: A Uniﬁed Approac h In this section, we present a uniﬁed meta-algorithm designed to achiev e minimax optimalit y under Mo dels 1 to 3 . Our approach explicitly decouples the estimation of nuisance parameters from the estimation of the target estimand (the CA TE), allowing the ﬁnal estimator to adapt solely to the in trinsic complexity of the con trast function h ⋆ . 6 3.1 Algorithm Structure W e prop ose a t wo-stage pro cedure in v olving undersmo othed nuisance estimation follow ed by a switc h-imputation based regression. The ov erall pro cedure is summarized in Algorithm 1 . 1. Nuisance Estimation via Undersmo othed KRR. First, w e estimate the conditional mean functions f ⋆ 0 and f ⋆ 1 using the observ ational data D . T o ensure that the bias from nuisance estimation do es not dominate the CA TE estimation error, we employ undersmo othed KRR: ˆ f a := arg min f ∈F n 1 n n X i =1 ( y i − f ( x i )) 2 1 ( a i = a ) + ¯ λ ∥ f ∥ 2 F o , a ∈ { 0 , 1 } . (1) Here, ¯ λ denotes the regularization parameter for the nuisance stage. Crucially , ¯ λ m ust b e chosen suﬃcien tly small to mitigate regularization bias. While this ma y inﬂate the v ariance of the nuisance estimates, this v ariance is eﬀectively controlled in the second-stage regression. F or generic b ounded k ernels, the scaling ¯ λ ≍ log ( n ) /n t ypically suﬃces. When K F is a Sob olev k ernel (implying F = H β ( X ) ), an y choice within the range log n · n − 2 β d ≲ ¯ λ ≲ log n/n yields the desired guarantees; w e provide a rigorous justiﬁcation in the App endix. 2. Generating Pseudo-outcomes via Switch-Imputation. W e construct pseudo-outcomes { m i } n i =1 that serve as approximately unbiased pro xies for the unobserved p oten tial outcomes: m i := ( y i − ˆ f 0 ( x i ) , if a i = 1 , ˆ f 1 ( x i ) − y i , if a i = 0 . (2) This construction isolates the treatmen t eﬀect h ⋆ b y centering the resp onse with the estimated baseline function. 3. Regression Oracle for the Final Estimator. Finally , w e apply a regression oracle O to the dataset of pseudo-pairs { ( x i , m i ) } n i =1 . The oracle returns the estimator: ˆ h = O ( { ( x i , m i ) } n i =1 ) . The explicit form of O dep ends on the structural assumptions imp osed on h ⋆ , as detailed in Section 3.2 . Algorithm 1 Optimal CA TE Learner with KRR 1: Input: Dataset D = { ( x i , a i , y i ) } n i =1 , regression oracle O , nuisance regularizer ¯ λ ≍ log n n , main regularizer λ . 2: Step 1: Nuisance Estimation 3: Compute ˆ f 0 , ˆ f 1 via eq. ( 1 ) using D and nuisance regularizer ¯ λ . 4: Step 2: Pseudo-outcome Generation 5: F or i = 1 , . . . , n , compute m i via eq. ( 2 ) using ˆ f 0 , ˆ f 1 . 6: Step 3: T arget Estimation 7: Return ˆ h = O ( { ( x i , m i ) } n i =1 ) with main regularizer λ . 7 3.2 Regression Oracle O W e now instantiate the regression oracle O for the three structural mo dels in tro duced in Section 2 . In all scenarios, the oracle p erforms a v ariant of KRR on the pseudo-outcomes, tailored to the sp eciﬁc complexit y of h ⋆ . Oracle for Mo del 1 (Subspace). Under Mo del 1 , where h ⋆ resides in a strictly simpler RKHS H ⊂ F , the oracle p erforms KRR directly within H : ˆ h := arg min h ∈H n 1 n n X i =1 ( m i − h ( x i )) 2 + λ ∥ h ∥ 2 H o . Here, the regularization parameter λ is tuned to the sp ectral decay of H , indep enden t of the am bient space F . Oracle for Mo del 2 (Source Condition). Under Mo del 2 , where h ⋆ ∈ F satisﬁes a source condition, the oracle p erforms KRR in the ambien t space F : ˆ h := arg min h ∈F n 1 n n X i =1 ( m i − h ( x i )) 2 + λ ∥ h ∥ 2 F o . Although the optimization is ov er F , the c hoice of λ exploits the source condition to achiev e faster con vergence rates. Oracle for Mo del 3 (Lo w-Dimensional Structure). Under Mo del 3 , where h ⋆ dep ends only on a lo w-dimensional feature pro jection ˜ x i , the oracle p erforms KRR in the corresp onding space ˜ H : ˆ h := arg min ˜ h ∈ ˜ H n 1 n n X i =1 ( m i − ˜ h ( ˜ x i )) 2 + λ ∥ ˜ h ∥ 2 ˜ H o . This eﬀectively reduces the estimation problem to the in trinsic dimension of h ⋆ . In all three cases, we refer to λ as the main regularizer . In practice, since the optimal H and λ are unknown, one must select among candidate h yp othesis classes {H k } ; we address this via the mo del selection pro cedure b elo w. 3.3 Mo del Selection Pro cedure In practical applications, the optimal hypothesis space H and asso ciated hyperparameters are rarely kno wn a priori. While the n uisance class F can be v alidated via standard cross-v alidation on observed outcomes, selecting the b est mo del for the unobserved CA TE h ⋆ is non trivial due to the absence of coun terfactuals. T o address this, we prop ose a dedicated mo del selection pro cedure that allows the learner to adaptively select the b est estimator from a collection of candidates. Our procedure employs three disjoin t data splits to ensure indep endence b et w een candidate generation, v alidation pro xy construction, and the ﬁnal selection step. 1. Candidate Generation ( D 1 ). W e partition the dataset D in to three disjoin t subsets D 1 , D 2 , D 3 with sizes n 1 , n 2 , n 3 suc h that n 1 + n 2 + n 3 = n (e.g., n k ≍ n/ 3 ). Using D 1 , w e run Algorithm 1 with v arious h yp erparameter conﬁgurations (e.g., diﬀerent h yp othesis classes {H 1 , . . . , H K } and regularization parameters) to generate a set of candidate estimators M = { ˆ h 1 , . . . , ˆ h L } . W e denote the conﬁguration library by Π , where each elemen t is a pair ( H , λ ) consisting of a hypothesis class and a regularization level. 8 2. T runcation. T o con trol the v ariance during the selection phase, we enforce a b oundedness constrain t. Let B > 0 b e the truncation level. F or each ˆ h j ∈ M , we deﬁne its truncated version ¯ h j ( x ) := min(max( ˆ h j ( x ) , − B ) , B ) . Let ¯ M = { ¯ h 1 , . . . , ¯ h L } denote the set of truncated candidates. 3. Proxy Construction ( D 2 ). Using D 2 , we construct a high-qualit y proxy for the unobserved CA TE. W e estimate the nuisance functions ˜ f 0 , ˜ f 1 using KRR with undersmo othed regularization ˜ λ ≍ log( n ) /n : ˜ f a := arg min f ∈F n 1 n 2 n 2 X i =1 ( y 2 i − f ( x 2 i )) 2 1 ( a 2 i = a ) + ˜ λ ∥ f ∥ 2 F o , a ∈ { 0 , 1 } . (3) Note that these estimates rely solely on D 2 and are thus indep enden t of the candidates generated from D 1 . 4. Empirical Risk Minimization ( D 3 ). Finally , using D 3 , we ev aluate the candidates against the proxy constructed from D 2 . F or each ( x 3 i , a 3 i , y 3 i ) ∈ D 3 , deﬁne the proxy lab els: ˜ m 3 i := ( y 3 i − ˜ f 0 ( x 3 i ) , if a 3 i = 1 , ˜ f 1 ( x 3 i ) − y 3 i , if a 3 i = 0 . (4) W e select the estimator that minimizes the empirical squared error with resp ect to these proxies: ˆ h ms := arg min ¯ h ∈ ¯ M 1 n 3 n 3 X i =1 ( ¯ h ( x 3 i ) − ˜ m 3 i ) 2 . Algorithm 2 CA TE Mo del Selection 1: Input: Data D , set of candidate conﬁgurations Π . 2: P artition D into D 1 , D 2 , D 3 . 3: Step 1: T rain candidates M = { ˆ h 1 , . . . , ˆ h L } on D 1 b y running Algorithm 1 o ver multiple conﬁgurations ( H , λ ) ∈ Π . 4: Step 2: T runcate candidates to [ − B , B ] to obtain ¯ M = { ¯ h 1 , . . . ¯ h L } . 5: Step 3: Estimate nuisance functions ˜ f 0 , ˜ f 1 on D 2 via eq. ( 3 ). 6: Step 4: F or ( x 3 i , a 3 i , y 3 i ) ∈ D 3 , compute proxies ˜ m 3 i using ˜ f 0 , ˜ f 1 b y eq. ( 4 ). 7: Return ¯ h ms = arg min ¯ h ∈ ¯ M 1 n 3 P n 3 i =1 ( ¯ h ( x 3 i ) − ˜ m 3 i ) 2 . 4 General Theory for the Analysis In this section, w e present a uniﬁed theory cov ering Mo dels 1 to 3 . Across all three mo dels, the second stage p erforms KRR on pseudo-outcomes m i obtained from ﬁrst-stage nuisance estimation (regression oracle O ); the only diﬀerence is the underlying function space. T o capture all cases sim ultaneously , w e analyze a generic second-stage KRR problem in an abstract Hilb ert space X and deriv e a high-probability error b ound. W e b egin by formalizing the RKHS representation and then present the generic setup and main theorem. 9 4.1 RKHS F orm ulation and Notation RKHS F ormulation. By standard theory ( Aronsza jn , 1950 ), there exists a Hilb ert space F , along with a feature map ϕ : Z → F , satisfying the reproducing prop ert y: ⟨ ϕ ( z ) , ϕ ( z ′ ) ⟩ F = K F ( z , z ′ ) . A ccordingly , the hypothesis space is deﬁned as: F = { f θ ( · ) = ⟨ ϕ ( · ) , θ ⟩ F | θ ∈ F } . W e denote the Hilb ertian elemen t corresp onding to f θ b y θ . In particular, θ ⋆ 0 and θ ⋆ 1 denote the Hilb ertian elemen ts corresp onding to f ⋆ 0 and f ⋆ 1 , resp ectiv ely . F or Mo del 1 , the subspace H is also an RKHS. In this case, there exists a Hilb ert space H , along with a feature map ψ : X → H , satisfying the repro ducing prop ert y: ⟨ ψ ( x ) , ψ ( x ′ ) ⟩ H = K H ( x, x ′ ) . A ccordingly , the hypothesis space is deﬁned as: H = { h η ( · ) = ⟨ ψ ( · ) , η ⟩ H | η ∈ H } . Similarly , we denote the Hilb ertian elemen t corresp onding to h η b y η . W e write η ⋆ for the Hilb ertian elemen t corresp onding to h ⋆ . Under Mo del 2 , w e use the same notation η ⋆ for the Hilb ertian elemen t corresp onding to h ⋆ in F . F or Mo del 3 , the space ˜ H is also an RKHS. In this case, there exists a Hilb ert space ˜ H , along with a feature map ˜ ψ : X → ˜ H , satisfying the repro ducing prop ert y: ⟨ ˜ ψ ( x ) , ˜ ψ ( x ′ ) ⟩ ˜ H = K ˜ H ( x, x ′ ) . A ccordingly , the hypothesis space is deﬁned as: ˜ H = { ˜ h η ( · ) = ⟨ ˜ ψ ( · ) , η ⟩ ˜ H | η ∈ ˜ H } . W e denote the Hilb ertian element corresp onding to ˜ h η b y η . W e write η ⋆ for the Hilb ertian element corresp onding to h ⋆ in ˜ H . Design Operators. Consider N generic elemen ts { v 1 , . . . , v N } for some N > 1 in a Hilb ert space X . W e deﬁne the design op er ator of { v 1 , . . . , v N } as V : X → R N , which, for all θ ∈ X , satisﬁes: V θ = ( ⟨ v 1 , θ ⟩ , ⟨ v 2 , θ ⟩ , . . . , ⟨ v N , θ ⟩ ) ⊤ . Similarly , we deﬁne the adjoint of V , denoted as V ⊤ : R N → X , as the op erator such that for all a = ( a 1 , . . . , a N ) ∈ R N , V ⊤ a = N X i =1 a i v i ∈ X . 4.2 Setup W e consider a generic setup for p erforming KRR on the pseudo-outcomes m i . W e ﬁx a Hilb ert space X and a design op erator W built from i.i.d. features { w i } n i =1 for w i ∈ X . W e provide a generalized framew ork for the regression oracle O in Algorithm 1 . Across Mo dels 1–3, O p erforms KRR on the pseudo-outcome vector M := ( m 1 , . . . , m n ) . T o unify the analysis, w e therefore study KRR in a generic Hilb ert space X . 10 Regression Oracle O as KRR on X . W e abstract the second-stage regression in Algorithm 1 as k ernel ridge regression in a generic Hilb ert space X , which can b e H , F or ˜ H . W e write the p opulation second moment as Σ := E [ w i ⊗ w i ] . This setup sp ecializes to our mo dels as follows: • Mo del 1 : X = H , w i = ψ ( x i ) (feature map of K H ). • Mo del 2 : X = F , w i = ϕ ( x i ) (feature map of K F ). • Mo del 3 : X = ˜ H , w i = ˜ ψ ( ˜ x i ) (feature map of K ˜ H ). In all cases, the target function is represented b y an element η ⋆ ∈ X such that ⟨ w i , η ⋆ ⟩ X = h ⋆ ( x i ) . Recall the pseudo-outcomes m i from eq. ( 2 ) and their v ector form M . Therefore the second-stage estimator from oracle O can b e written as ˆ η = ( W ⊤ W + nλ I ) − 1 W ⊤ M . (5) Here ˆ η denotes the Hilb ertian element corresp onding to the function estimator ˆ h returned by Algorithm 1 . Ev aluation. W e ev aluate with a (p ositiv e) trace-class op erator Σ ref on X . F or any estimator ˆ η , deﬁne E ref ( ˆ η ) := ∥ ˆ η − η ⋆ ∥ 2 Σ ref = ⟨ ˆ η − η ⋆ , Σ ref ( ˆ η − η ⋆ ) ⟩ X . These choices reco ver familiar error metrics. F or instance, in Model 1 , Σ is the second-momen t op erator for K H ; taking Σ ref = Σ yields the L 2 ( P x ) -error, while Σ ref = ψ ( x 0 ) ⊗ ψ ( x 0 ) gives the p oin twise error at x 0 (and similarly with ϕ ( x 0 ) or ˜ ψ ( ˜ x 0 ) under Mo dels 2 and 3 ). W e summarize these cases b elo w: • L 2 -error: Σ ref = Σ . • P oint wise ev aluation at x 0 : Σ ref = ψ ( x 0 ) ⊗ ψ ( x 0 ) (or ϕ ( x 0 ) ⊗ ϕ ( x 0 ) / ˜ ψ ( ˜ x 0 ) ⊗ ˜ ψ ( ˜ x 0 ) in Mo dels 2 – 3 ), so E ref ( ˆ η ) = | ˆ h ( x 0 ) − h ⋆ ( x 0 ) | 2 . F or the op erator Σ ref on X , deﬁne S λ := Σ 1 2 ref ( Σ + λ I ) − 1 Σ 1 2 ref . This op erator captures the eﬀective dimension of the estimator under the ev aluation metric induced b y Σ ref . A dditional notation used only in the app endix pro ofs is collected in App endix A.1 (T able 7 ). T able 2 summarizes the notation used in this section. 11 T able 2: Notation used in this section. Sym b ol Deﬁnition Sp ac es and fe atur e maps F Hilb ert space asso ciated with k ernel K F and feature map ϕ : X → F . H Hilb ert space asso ciated with k ernel K H (under Mo del 1 ). ˜ H Hilb ert space asso ciated with K ˜ H (under Mo del 3 ). ϕ, ψ , ˜ ψ F eature maps for K F , K H , K ˜ H , resp ectiv ely . Op er ators and ve ctors θ ⋆ 0 , θ ⋆ 1 Hilb ertian elemen ts corresp onding to nuisance functions f ⋆ 0 , f ⋆ 1 in F . η ⋆ Hilb ertian elemen t corresp onding to the target contrast function h ⋆ in X . w i F eature vector in X for the second-stage regression. W design op erator of { w i } n i =1 . Σ p opulation second momen t E [ w i ⊗ w i ] . Σ ref ev aluation op erator deﬁning the loss. S λ sandwic h op erator Σ 1 / 2 ref ( Σ + λ I ) − 1 Σ 1 / 2 ref . m i pseudo-outcome deﬁned by switch imputation in eq. ( 2 ). M pseudo-outcome vector ( m i ) i ∈ [ n ] . 4.3 Upp er Bound Analysis Theorem 1 (General error b ound) . Supp ose we run Algorithm 1 with main r e gularizer λ and O as KRR under Hilb ert sp ac e X , under the assumptions in Se ction 2.3 . L et ˆ h b e the estimator pr o duc e d by the algorithm and let ˆ η b e the c orr esp onding Hilb ertian element in X (deﬁne d in e q. ( 5 ) ). Then, with pr ob ability at le ast 1 − n − 10 , we have E ref ( ˆ η ) ≲ 1 nκ T r( S λ ) + λ 2 ∥ S λ ∥ op ∥ ( Σ + λ I ) − 1 / 2 η ⋆ ∥ 2 X + 1 nκ ∥ S λ ∥ op . Discussion. The b ound matches the standard KRR error b ound in X , up to an eﬀective noise inﬂation b y a factor 1 /κ due to treatmen t imbalance. In particular, the rate is go verned entirely by the op erators Σ and Σ ref on X ; the nuisance space do es not aﬀect the b ound b ey ond this eﬀectiv e noise inﬂation. Consequently , once w e plug in the appropriate sp ectral decay and trace b ounds, Theorem 1 directly yields the L 2 -error and p oin twise rates for eac h of Mo dels 1 – 3 . Instan tiations. F or L 2 -error, S λ = Σ 1 2 ( Σ + λ I ) − 1 Σ 1 2 and T r ( S λ ) = P j ρ j ρ j + λ , where { ρ j } are eigen v alues of Σ . F or p oin t ev aluation, S λ is rank-one with T r ( S λ ) = ∥ S λ ∥ op = ⟨ ψ ( x 0 ) , ( Σ + λ I ) − 1 ψ ( x 0 ) ⟩ (or with ϕ / ˜ ψ under Mo dels 2 – 3 ), i.e., the leverage score. In Sob olev RKHS H = H γ ( X ) , Lemma 4 yields the b ound T r( S λ ) ≲ λ − d/ (2 γ ) under Mo del 1 . 5 Upp er Bounds for L 2 -Error In this section, we presen t minimax-optimal upp er b ounds on the L 2 ( P x ) -error of the estimator ˆ h pro duced b y our meta-algorithm (Algorithm 1 ). These results are direct corollaries of Theorem 1 once we plug in the appropriate sp ectral decay and trace b ounds. W e provide separate guaran tees for the three structural mo dels deﬁned in Section 2 . 12 5.1 Results under Mo del 1 (Subspace) W e deﬁne the kernel in tegral op erator T H : L 2 ( P x ) → L 2 ( P x ) asso ciated with the kernel K H and the marginal distribution P x ( Fisc her and Steinw art , 2020 ; Zhang et al. , 2023 ). Concretely , T H f = Z x ∈X K ( x, x ′ ) f ( x ′ )d P x ( x ′ ) . Let ρ H , 1 ≥ ρ H , 2 ≥ · · · > 0 denote the eigenv alues of T H . The deca y rate of these eigenv alues c haracterizes the complexity of the hypothesis space H . F or instance, for the Sob olev space H β ( X ) , it is well established that ρ H ,j ≍ j − 2 β /d ( Zhang et al. , 2023 ). W e also assume the kernel is b ounded, i.e., sup x ∈X K H ( x, x ) = O (1) . Corollary 1 ( L 2 -Error Bounds for Mo del 1 ) . Supp ose the assumptions in Se ction 2.3 and Mo del 1 hold. Then, with pr ob ability at le ast 1 − n − 10 , the estimator ˆ h fr om Algorithm 1 with or acle O for Mo del 1 and main r e gularizer λ satisﬁes the fol lowing b ounds: 1. Polynomial De c ay: If ρ H ,j ≲ j − 2 ℓ H for some ℓ H > 1 / 2 , setting λ ≍ ( nκ ) − 2 ℓ H 2 ℓ H +1 yields E L 2 ( ˆ h ) ≲ ( nκ ) − 2 ℓ H 2 ℓ H +1 . 2. Exp onential De c ay: If ρ H ,j ≲ exp( − cj ) for some c > 0 , setting λ ≍ ( nκ ) − 1 yields E L 2 ( ˆ h ) ≲ 1 nκ . 3. Finite R ank: If rank( T H ) ≤ D , setting λ ≍ ( nκ ) − 1 yields E L 2 ( ˆ h ) ≲ D nκ . The notation ≲ omits absolute c onstants and p olylo garithmic factors. Discussion. W e defer the pro of to App endix B.1 . Corollary 1 shows that our estimator adapts to the sp ectral complexit y of the contrast space H , eﬀectiv ely decoupling the rate from the (p oten tially slo wer) decay of the nuisance space F . Comparing these rates with the low er b ounds in Lemma 1 , w e conclude that the estimator is minimax optimal with resp ect to b oth the sample size n and the o verlap parameter κ . T able 1 is a direct corollary of this result. 5.2 Results under Mo del 2 (Source Condition) W e deﬁne the integral op erator T F : L 2 ( P x ) → L 2 ( P x ) asso ciated with the kernel K F and the marginal distribution P x . Let ρ F , 1 ≥ ρ F , 2 ≥ · · · > 0 denote the eigenv alues of T F . Corollary 2 ( L 2 -Error Bounds for Mo del 2 ) . Supp ose the assumptions in Se ction 2.3 hold and ρ F ,j ≲ j − 2 ℓ F for some ℓ F > 1 / 2 . Under Mo del 2 , running Algorithm 1 with or acle O for Mo del 2 and main r e gularizer λ ≍ ( nκ ) − 2 ℓ F 1+2 ℓ F (1+ ν ) yields E L 2 ( ˆ h ) ≲ ( nκ ) − 2 ℓ F (1+ ν ) 1+2 ℓ F (1+ ν ) , with pr ob ability at le ast 1 − n − 10 . The notation ≲ omits absolute c onstants and p olylo garithmic factors. 13 Discussion. W e defer the pro of to App endix B.2 . Crucially , our result holds for general RKHSs, co vering cases such as NTKs where such adaptivit y is nontrivial. F or the sp eciﬁc case of NTK on the sphere S d − 1 , the eigenv alue decay is characterized b y ℓ F = d +1 2 d ( Li et al. , 2024 ). Substituting this into our theorem, we achiev e the adaptive rate: E L 2 ( ˆ h ) ≲ ( nκ ) − ( d +1)(1+ ν ) d +( d +1)(1+ ν ) . This is substantially faster than the nuisance NTK learning b ound n − d +1 2 d +1 . 5.3 Results under Mo del 3 (Low-Dimensional Structure) Finally , w e consider the case where h ⋆ dep ends on a lo w-dimensional pro jection of the cov ariates. Let ˜ T b e the integral op erator asso ciated with the kernel on the low er-dimensional space ˜ H , deﬁned with resp ect to the distribution of ˜ x i , with eigenv alues ˜ ρ 1 ≥ ˜ ρ 2 ≥ . . . . Corollary 3 ( L 2 -Error Bounds for Mo del 3 ) . Supp ose the assumptions in Se ction 2.3 and Mo del 3 hold, and ˜ ρ j ≲ j − 2 ˜ ℓ for some ˜ ℓ > 1 / 2 . Running Algorithm 1 with or acle O for Mo del 3 and main r e gularizer λ ≍ ( nκ ) − 2 ˜ ℓ 1+2 ˜ ℓ yields E L 2 ( ˆ h ) ≲ ( nκ ) − 2 ˜ ℓ 1+2 ˜ ℓ with pr ob ability at le ast 1 − n − 10 . The notation ≲ omits absolute c onstants and p olylo garithmic factors. Discussion. W e defer the pro of to App endix B.3 . F or instance, if ˜ H = H β ( ˜ X ) for dim ( ˜ X ) = ˜ d , the decay rate is ˜ ℓ = β / ˜ d . Consequen tly , the rate simpliﬁes to ( nκ ) − 2 β ˜ d +2 β . This rate dep ends only on the intrinsic dimension ˜ d , eﬀectively treating the problem as if the data were generated in the lo wer-dimensional space. The complexity of the n uisance functions f ⋆ 0 , f ⋆ 1 (whic h ma y dep end on the full dimension d ) do es not degrade the conv ergence rate of ˆ h . 5.4 A daptivit y Guaran tees W e prop osed a mo del selection pro cedure in Algorithm 2 . W e no w state its adaptivit y guarantee and the corresp onding oracle inequality . T o facilitate theoretical analysis, w e in tro duce a standard boundedness assumption on the con trast function. Assumption 5. Ther e exists a known c onstant B ⋆ > 0 such that sup x ∈X | h ⋆ ( x ) | ≤ B ⋆ . This assumption is mild and widely adopted in the classiﬁcation and regression literature. A dditionally , we fo cus on the nonparametric regime where the oracle rate satisﬁes R ⋆ := min ¯ h ∈ ¯ M E L 2 ( ¯ h ) ≫ 1 n . Equiv alently , we fo cus on the nonparametric regime where nR ⋆ → ∞ . The following theorem establishes that our pro cedure selects a mo del that p erforms as well as the b est candidate in the library , up to a negligible error term. 14 Theorem 2 (Oracle Inequality) . Supp ose Assumptions 1 to 5 hold. Run Algorithm 2 with B > B ⋆ , and let L = |M| b e p olynomial in n . Then, with pr ob ability at le ast 1 − n − 10 , the sele cte d (trunc ate d) estimator ¯ h ms satisﬁes: E L 2 ( ¯ h ms ) ≲ min ¯ h ∈ ¯ M E L 2 ( ¯ h ) + log n nκ . Here, the notation ≲ omits absolute constan ts and p olylogarithmic factors. This result conﬁrms that we do not pa y a price for the complexity of the n uisance class F ; the selection ov erhead is go verned solely by the sample size and ov erlap. Note that the ov erhead term is of parametric order (on the 1 /n scale), and is therefore mild. 5.5 Connection to Other W ork Closest to our w ork is the recent study b y Kim et al. ( 2025 ), whic h also inv estigates CA TE estimation using a t wo-stage KRR approach. Ho wev er, there is a fundamental diﬀerence in the problem setup and the nature of the adaptivit y . Kim et al. ( 2025 ) fo cus on the "T ransfer Learning" regime where the CA TE h ⋆ resides in the same function space as the n uisance functions ( H = F ), but p ossesses a signiﬁcan tly smaller RKHS norm ( ∥ h ⋆ ∥ F ≪ ∥ f ⋆ 0 ∥ F , ∥ f ⋆ 1 ∥ F ). Consequently , their conv ergence rate with resp ect to the sample size n remains gov erned by the complexit y of the ambien t space F . In our setting, the n -rate itself improv es b ecause it is controlled by the low er-complexity space H . F urthermore, our results are fully consistent with the fundamen tal limits established for Hölder classes studied in Kennedy et al. ( 2022 ). 6 Upp er Bounds for P oint Ev aluation In this section, we establish minimax optimal p oin t wise error b ounds for our estimator. 6.1 Setup and Assumptions W e primarily fo cus on the setting where H is a Sob olev space for the purp ose of p oin twise ev aluation analysis. These results extend naturally to mixed Sob olev spaces. Throughout this section, we assume the density of x i is b ounded ab o ve and b elo w by p ositiv e constants. P oin t Ev aluation under Models 1 and 2 . W e assume that the CA TE function h ⋆ resides in a Sob olev space H = H γ ( X ) with smo othness index γ > d/ 2 . The nuisance functions are assumed to lie in a p oten tially rougher space F . Note that under this setting, Mo del 2 (source condition on F ) is subsumed by Mo del 1 , as a source condition of order ν relativ e to a Sob olev kernel of smo othness β implies membership in H β (1+ ν ) . Th us, w e analyze these cases jointly assuming h ⋆ ∈ H γ ( X ) . Consequen tly , we restrict our analysis to Mo del 1 and assume the algorithm op erates under Mo del 1 . P oin t Ev aluation under Mo del 3 . W e also fo cus on Sob olev classes and, for simplicit y , consider the case where F = H β ( X ) and H = H β ( ˜ X ) for some β > d 2 . Recall that dim( ˜ X ) = ˜ d . 6.2 Main Results W e no w present the point wise ev aluation error b ounds, which follow from Theorem 1 by taking Σ ref = ψ ( x 0 ) ⊗ ψ ( x 0 ) (or ˜ ψ ( ˜ x 0 ) ⊗ ˜ ψ ( ˜ x 0 ) under Mo del 3 ) and inv oking Sob olev embedding. 15 Corollary 4 (P oint Ev aluation: Mo del 1 & 2 ) . Supp ose H = H γ ( X ) with γ > d/ 2 . Under the assumptions of Se ction 2.3 and Mo del 1 , the estimator ˆ h pr o duc e d by Algorithm 1 with main r e gularizer λ ≍ 1 nκ satisﬁes the fol lowing p ointwise err or b ound for any ﬁxe d x 0 ∈ X with pr ob ability at le ast 1 − n − 10 : E x 0 ( ˆ h ) = | ˆ h ( x 0 ) − h ⋆ ( x 0 ) | 2 ≲ ( nκ ) − 2 γ − d 2 γ . (6) Discussion. The rate in eq. ( 6 ) matches the minimax optimal rate for p oin twise estimation of a function in H γ ( X ) given nκ samples ( T uo and Zou , 2024 ). Crucially , this b ound is indep enden t of the smo othness of the n uisance functions f ⋆ 0 , f ⋆ 1 , provided they lie in the ambien t RKHS F . The same conclusion holds for mixed Sob olev spaces H = H γ mix ( X ) , yielding the rate ( nκ ) − 2 γ − 1 2 γ . By the lo wer b ound in Lemma 1 , this rate is also optimal. W e defer the pro of to App endix B.4 . Next, we address the low-dimensional setting. Corollary 5 (Poin t Ev aluation: Mo del 3 ) . Under the assumptions of Se ction 2.3 and Mo del 3 , supp ose ˜ H = H β ( ˜ X ) and F = H β ( X ) with β > d/ 2 . Then the estimator ˆ h pr o duc e d by Algorithm 1 with main r e gularizer λ ≍ 1 nκ satisﬁes: E x 0 ( ˆ h ) = | ˆ h ( x 0 ) − h ⋆ ( x 0 ) | 2 ≲ ( nκ ) − 2 β − ˜ d 2 β (7) with pr ob ability at le ast 1 − n − 10 . The notation ≲ omits absolute c onstants and p olylo garithmic factors. Discussion. This result shows that the rate dep ends on the in trinsic dimension ˜ d rather than the am bient dimension d , formally establishing that our metho d escap es the curse of dimensionality in n uisance estimation. W e defer the pro of to App endix B.5 . 7 Numerical Exp erimen ts 7.1 Syn thetic Data T o v alidate our theory , we conduct simulation studies on syn thetic data. W e compare our algorithm (“Ours”) to tw o baselines: (i) a Plug-in KRR metho d, whic h estimates ˆ f 1 and ˆ f 0 separately and computes ˆ h ( x ) = ˆ f 1 ( x ) − ˆ f 0 ( x ) ; and (ii) a Doubly Robust (DR) Learner ( Kennedy , 2020 ), where the nuisance functions and prop ensit y scores are estimated via KRR, and the second-stage KRR is p erformed on the pseudo-outcome. Implemen tation Details and Hyperparameter T uning. F or all synthetic exp erimen ts, we use ¯ λ = 0 . 01 /n in our algorithm and set the noise lev el to σ = 1 . The regularization candidates are Λ = { 1 n 2 j − 1 | j = 1 , 2 . . . 10 } . In the implementation, these v alues are passed directly as ridge p enalties in KRR. In multiv ariate settings, kernel length scales are c hosen separately for full-feature and subset-feature kernels. F or Matérn kernels, ν con trols smo othness and ℓ is the length-scale parameter; for RBF kernels, ℓ is the length scale. W e choose ℓ b y a median heuristic: ℓ is set so that the k ernel correlation at the median pairwise distance equals 0 . 5 , i.e., k ( r med ; ℓ ) = 0 . 5 . • Ours: W e use the three-split structure ( D 1 , D 2 , D 3 ) in Algorithm 2 . In the univ ariate case, n uisance KRR uses the unanchored Sob olev k ernel of order 1 , while the second stage KRR uses the 16 unanc hored Sob olev kernel of order 2 , so selection is ov er λ only . In multiv ariate cases, n uisance KRR uses Matérn ( ν, ℓ ) = (1 . 5 , 2 . 6) , and the CA TE stage selects ov er: Matérn ( ν, ℓ ) = (1 . 5 , 2 . 6) and (2 . 5 , 2 . 4) on all co ordinates; Matérn (1 . 5 , 1 . 6) and (2 . 5 , 1 . 5) on the ﬁrst four co ordinates; and RBF kernels with ℓ = 2 . 1 (all co ordinates) and ℓ = 1 . 3 (ﬁrst four co ordinates). • Plug-in baseline: W e use the same nuisance k ernel as in our metho d ( K F ), select λ separately for ˆ f 0 and ˆ f 1 b y 3-fold CV, and reﬁt on the full training sample. • DR baseline: W e use a 2-wa y split. On the ﬁrst half, ˆ f 0 , ˆ f 1 , ˆ π are ﬁt by KRR with 3-fold CV using K F . On the second half, CA TE KRR is tuned by 3-fold CV on the DR pseudo-outcome. In the univ ariate case, the stage-2 kernel is Sob olev order 2 ; in multiv ariate cases, it is Matérn ( ν, ℓ ) = (2 . 5 , 2 . 4) for the dense setting and Matérn ( ν, ℓ ) = (2 . 5 , 1 . 5) on the ﬁrst four co ordinates for the sparse setting. The p erformance metric is the MSE computed on a held-out test set of 3,000 p oin ts. F or the univ ariate case, w e av erage ov er 100 runs for each n ∈ { 500 , 1000 } ; for multiv ariate cases, we av erage o ver 100 runs for eac h n ∈ { 1000 , 2000 } . Univ ariate Case. In this setting, cov ariates x i are drawn uniformly from [0 , 1] . W e sp ecify the n uisance functions to b e in a rougher Sob olev space ( H 1 ) while the CA TE function lies in a smo other space ( H 2 or higher). Sp eciﬁcally , we set: f ⋆ 0 ( x ) = 5 ( | x − 0 . 4 | + | x − 0 . 8 | ) , h ⋆ ( x ) = x 2 . W e also set the prop ensit y score to π ( x ) = Clip ( sin (5 ∥ x ∥ 2 ) , 0 . 1 , 0 . 9) . The true CA TE h ⋆ is a smo oth quadratic function, while the baseline f ⋆ 0 con tains non-diﬀerentiable p oin ts (kinks), placing it in a lo wer-order Sob olev space. In this univ ariate setting, we use unanchored Sob olev k ernels (order 1 for the nuisance and order 2 for the CA TE), so our mo del selection v aries only the regularization parameter λ . The results for Gaussian noise σ = 1 are rep orted in T able 3 . Our metho d signiﬁcantly outp erforms the baselines by eﬀectively adapting to the simpler structure of h ⋆ . T able 3: Comparison of MSE o ver rep eated runs for the Univ ariate Case. Standard errors (SE of the mean) are in parentheses. F or n = 1000 and n = 500 , we use 100 runs. Metho d n = 1000 n = 500 Ours 0.0562 (0.0040) 0.1127 (0.0078) Plug-in KRR 0.0825 (0.0035) 0.1581 (0.0087) DR-Learner KRR 0.0872 (0.0146) 0.9935 (0.6646) Multiv ariate Case (Mo dels 1 & 2). W e consider a 10-dimensional setting ( d = 10 ) where co v ariates are uniform on [ − 1 , 1] 10 . The n uisance function in volv es high-frequency comp onen ts across all dimensions: f ⋆ 0 ( x ) = 2 d P d j =1 sin ( x j ) . The CA TE is a smo other, linear function inv olving all features: h ⋆ ( x ) = 0 . 5 d d X j =1 x j . W e use the same prop ensit y sp eciﬁcation, π ( x ) = Clip ( sin (5 ∥ x ∥ 2 ) , 0 . 1 , 0 . 9) , as in the univ ariate setting. This setup corresp onds to a scenario where the CA TE has a simpler sp ectral decay (Model 1) 17 or satisﬁes a source condition (Mo del 2) relative to the nuisance. Intuitiv ely , the nuisance functions f ⋆ 0 , f ⋆ 1 are rough and need not lie in a higher-order Sob olev ball with b ounded norm, whereas the CA TE is smo other and do es lie in such a ball. As shown in T ables 4 and 5 , our metho d achiev es the lo west estimation error, demonstrating sp ectral adaptivit y . Multiv ariate Sparse Case (Mo del 3). W e examine a low-dimensional structure where the am bient dimension is d = 10 , but the CA TE dep ends only on the ﬁrst p = 4 cov ariates. The contrast function is quadratic on this subspace: h ⋆ ( x ) = 0 . 3 p p X j =1 x 2 j . Our algorithm includes k ernel candidates for the second-stage H that are deﬁned on diﬀeren t subsets of v ariables (e.g., kernels using only the ﬁrst 4 features as well as kernels using all 10 co ordinates). W e select among these candidates during mo del selection. By choosing the kernel that op erates on the relev an t subspace, our metho d eﬀectively escap es the curse of dimensionality asso ciated with the am bient dimension d = 10 , yielding sup erior p erformance compared to baselines that regress on the full feature set. In these multiv ariate exp erimen ts, n uisance KRR uses Matérn ( ν, ℓ ) = (1 . 5 , 2 . 6) for all metho ds, while our CA TE stage uses the k ernel dictionary ab o ve and the DR stage uses Matérn ν = 2 . 5 with ℓ = 1 . 5 on the ﬁrst four co ordinates. Results are summarized in T ables 4 and 5 . T able 4: Comparison of MSE ov er 100 runs ( n = 2 , 000 , σ = 1 ) for Multiv ariate cases ( d = 10 ). Dense (Model 1/2) Sparse (Model 3) Metho d ( h ⋆ is linear) ( h ⋆ is quadratic, p = 4 < d ) Ours 0.0101 (0.0009) 0.0113 (0.0009) Plug-in KRR 0.0295 (0.0010) 0.0307 (0.0010) DR-Learner KRR 0.0403 (0.0143) 0.0214 (0.0036) T able 5: Comparison of MSE ov er 100 runs ( n = 1 , 000 , σ = 1 ) for Multiv ariate cases ( d = 10 ). Dense (Model 1/2) Sparse (Model 3) Metho d ( h ⋆ is linear) ( h ⋆ is quadratic, p = 4 < d ) Ours 0.0137 (0.0013) 0.0149 (0.0014) Plug-in KRR 0.0478 (0.0018) 0.0491 (0.0018) DR-Learner KRR 0.0463 (0.0189) 0.0367 (0.0090) 7.2 Semi-real Data (RealCause) W e ev aluate our metho d on the lalonde_cps dataset from the RealCause b enc hmark ( Neal et al. , 2020 ). This benchmark generates semi-syn thetic counterfactuals based on real-world cov ariate distributions ( d = 9 ), allowing ground-truth ev aluation (MSE) under realistic co v ariate distributions with selection bias. W e use 100 realizations, eac h with sample size n ≈ 16 , 000 , and an 80 / 20 train/test split. Co v ariates are rescaled to [0 , 1] d , and outcomes are rescaled to [0 , 1] (with CA TE scaled accordingly) 18 using training-sample statistics. F or all metho ds, nuisance KRR uses Matérn ( ν, ℓ ) = (1 . 5 , 1 . 3) and regularization grid Λ = { 1 n 2 j − 1 | j = 1 , 2 . . . 10 } . F or the CA TE stage in Ours and DR, w e use Matérn ( ν, ℓ ) = (2 . 5 , 1 . 2) . Kernel length scales are selected using the same median-based heuristic as in the synthetic exp erimen ts. Our metho d applies 3-w ay cross-ﬁtting ov er random KF old rotations and a verages the three CA TE predictors; the plug-in baseline uses 3-fold CV with reﬁtting; and the DR baseline uses 2-wa y cross-ﬁtting with 3-fold CV in b oth n uisance and stage-2 regressions. As sho wn in T able 6 , our metho d consistently achiev es low er error than the baselines. T able 6: A v erage MSE on RealCause ( lalonde_cps ) ov er 100 dataset realizations. Standard errors (SE of the mean) are in parentheses. Metho d Mean MSE Ours 0.01783 (0.00075) Plug-in KRR 0.02178 (0.00074) DR-Learner KRR 0.02452 (0.00073) 8 Pro of Intuition In this section, w e outline the pro of strategy for our main results. Our pro of relies on a simple observ ation: second-stage regression using pseudo-outcomes { ( x i , m i ) } reduces to a regression problem m i = h ⋆ ( x i ) + ξ i , where ξ i is mean-zero and sub-Gaussian, with v ariance proxy O (1 /κ ) , plus a negligible missp eciﬁcation term of order ˜ O (1 /n ) . Consequently , the learning rate is go verned solely by the complexity of h ⋆ . Let M ⋆ := ( h ⋆ ( x 1 ) , . . . , h ⋆ ( x n )) ⊤ denote the vector of true CA TE v alues. Recall the deﬁnition of the pseudo-outcome m i . F ormally , w e show that the vector of pseudo-outcomes M = ( m 1 , . . . , m n ) ⊤ in Algorithm 1 satisﬁes: M ≈ M ⋆ + Mean-Zero Noise | {z } proxy ≈ σ 2 /κ + Bias | {z } MSE ≈ n − 1 . In v ector notation, let y and ε denote the resp onse and noise vectors, and let X , X 0 , X 1 denote the design op erators restricted to the full, con trol, and treated subsamples, resp ectiv ely . F or example, X 1 is the design op erator of { ϕ ( x i ) 1 ( a i = 1) } n i =1 . Then we can write: M = ( X 0 ˆ θ 1 − X 0 θ ⋆ 0 − ε 0 ) + ( X 1 θ ⋆ 1 + ε 1 − X 1 ˆ θ 0 ) = X ( θ ⋆ 1 − θ ⋆ 0 ) + ε ′ + X 0 ( ˆ θ 1 − θ ⋆ 1 ) − X 1 ( ˆ θ 0 − θ ⋆ 0 ) , (8) where ε ′ has entries ε i ( − 1) a i +1 , θ ⋆ 0 , θ ⋆ 1 are the Hilb ertian elements corresponding to f ⋆ 0 , f ⋆ 1 , and ˆ θ 0 , ˆ θ 1 are Hilbertian elements corresp onding to the ﬁrst-stage KRR estimators ˆ f 0 , ˆ f 1 . A precise deﬁnition app ears in T able 7 . Substituting the closed-form KRR solution for ˆ θ 0 , ˆ θ 1 , the error M − M ⋆ decomp oses in to: M − M ⋆ = ε ′ + X 0 ( X ⊤ 1 X 1 + n ¯ λ I ) − 1 X ⊤ 1 ε 1 − X 1 ( X ⊤ 0 X 0 + n ¯ λ I ) − 1 X ⊤ 0 ε 0 | {z } (I) Propagated V ariance + n ¯ λ X 1 ( X ⊤ 0 X 0 + n ¯ λ I ) − 1 θ ⋆ 0 − n ¯ λ X 0 ( X ⊤ 1 X 1 + n ¯ λ I ) − 1 θ ⋆ 1 | {z } (II) Propagated Bias . 19 In voking the ov erlap assumption, we utilize the second-momen t concentration inequalities X ⊤ 0 X 0 ⪯ c κ ( X ⊤ 1 X 1 + ¯ λ I ) and X ⊤ 1 X 1 ⪯ c κ ( X ⊤ 0 X 0 + ¯ λ I ) with high probabilit y for some constan t c > 1 . W e analyze these t wo terms conditional on the co v ariates { x i } , on the high-probability even t where the ab o ve concen trations hold. Analysis of T erm (I I): Bias Via Undersmo othing. The bias term represents the regularization error from the nuisance estimation. Using second-moment concen tration and the p ositivit y condition, w e derive:   n ¯ λ X 0 ( X ⊤ 1 X 1 + n ¯ λ I ) − 1 θ ⋆ 1   2 ≤ p n ¯ λ ·   X 0 ( X ⊤ 1 X 1 + n ¯ λ I ) − 1 2   op · ∥ θ ⋆ 1 ∥ ≲ p n ¯ λ · 1 / √ κ · ∥ θ ⋆ 1 ∥ = ˜ O (1 / √ κ ) . By choosing the undersmo othed parameter ¯ λ ≍ log n/n , this term b ecomes ˜ O ( 1 √ κ ) in the ℓ 2 -norm. Since this is a length- n v ector, its contribution to the MSE is ˜ O (1 /nκ ) , which is negligible compared to the nonparametric estimation error of h ⋆ . Analysis of T erm (I): Self-Normalized Noise. The v ariance term consists of the intrinsic noise ε ′ and the noise pr op agate d fr om nuisanc e estimation . Conditional on the cov ariates { x i } , the propagated noise b eha ves like a sub-Gaussian vector with an inﬂated v ariance proxy . Consider the term v := X 0 ( X ⊤ 1 X 1 + n ¯ λ I ) − 1 X ⊤ 1 ε 1 . On the same high-probability even t where the second-moment concen trations hold, this term is sub-Gaussian with a v ariance pro xy b ounded b y the op erator norm squared: σ 2 v ≲   X 0 ( X ⊤ 1 X 1 + n ¯ λ I ) − 1 X ⊤ 1   2 op ≲ 1 /κ. This implies that the eﬀective noise level is scaled by 1 / √ κ . Combined with the small bias term, the pseudo-outcome regression b eha v es like a clean regression problem with inﬂated mean-zero noise, yielding the oracle rate asso ciated with H . 9 Discussion W e ha ve established minimax-optimal learning rates for CA TE estimation in RKHS that adapt solely to the structural simplicit y of the contrast function (e.g., sp ectral decay or low-dimensionalit y). Crucially , our tw o-stage framew ork achiev es these fast rates without requiring consistent prop ensit y score estimation, eﬀectiv ely decoupling the inference of the treatment eﬀect from the complexity of n uisance parameters. Imp ortan t directions for future research include extending these guaran tees to missp eciﬁed regimes where the nuisance functions lie outside the assumed RKHS, and dev eloping v alid uncertaint y quan tiﬁcation measures, such as p oin twise conﬁdence interv als or uniform conﬁdence bands. 20 References Aronsza jn, N. (1950). Theory of repro ducing kernels. T r ansactions of the Americ an mathematic al so ciety , 68(3):337–404. Bühlmann, P . and V an De Geer, S. (2011). Statistics for high-dimensional data: metho ds, the ory and applic ations . Springer Science & Business Media. Chen, Y., Liu, F., Suzuki, T., and Cevher, V. (2024). High-dimensional kernel methods under co v ariate shift: Data-dep enden t implicit regularization. arXiv pr eprint arXiv:2406.03171 . Cinelli, C., F eller, A., Imbens, G., Kennedy , E., Magliacane, S., and Zubizarreta, J. (2025). Challenges in statistics: A dozen c hallenges in causality and causal inference. arXiv pr eprint arXiv:2508.17099 . Curth, A. and V an der Sc haar, M. (2021). Nonparametric estimation of heterogeneous treatment eﬀects: F rom theory to learning algorithms. In International Confer enc e on Artiﬁcial Intel ligenc e and Statistics , pages 1810–1818. PMLR. Fisc her, S. and Stein wart, I. (2020). Sob olev norm learning rates for regularized least-squares algorithms. Journal of Machine L e arning R ese ar ch , 21(205):1–38. F oster, D. J. and Syrgk anis, V. (2023). Orthogonal statistical learning. The Annals of Statistics , 51(3):879–908. Gao, Z. and Han, Y. (2020). Minimax optimal nonparametric estimation of heterogeneous treatment eﬀects. A dvanc es in Neur al Information Pr o c essing Systems , 33:21751–21762. GHORBANI, B., MEI, S., MISIAKIEWICZ, T., and MONT ANARI, A. (2021). Linearized tw o-la yers neural netw orks in high dimension. THE ANNALS , 49(2):1029–1054. Jun, K.-S., Cutkosky , A., and Orab ona, F. (2019). Kernel truncated randomized ridge regression: Optimal rates and low noise acceleration. A dvanc es in neur al information pr o c essing systems , 32. Kato, M. and Imaizumi, M. (2023). Cate lasso: conditional av erage treatment eﬀect estimation with high-dimensional linear regression. arXiv pr eprint arXiv:2310.16819 . Kennedy , E. H. (2020). T ow ards optimal doubly robust estimation of heterogeneous causal eﬀects. arXiv pr eprint arXiv:2004.14497 . Kennedy , E. H., Balakrishnan, S., Robins, J. M., and W asserman, L. (2022). Minimax rates for heterogeneous causal eﬀect estimation. arXiv pr eprint arXiv:2203.00837 . Kim, S.-J., Liu, H., Liu, M., and W ang, K. (2025). T ransfer learning of cate with k ernel ridge regression. arXiv pr eprint arXiv:2502.11331 . Kühn, T., Sick el, W., and Ullric h, T. (2015). Approximation of mixed order sob olev functions on the d-torus: asymptotics, preasymptotics, and d-dep endence. Constructive Appr oximation , 42(3):353–398. Künzel, S. R., Sekhon, J. S., Bic kel, P . J., and Y u, B. (2019). Metalearners for estimating heteroge- neous treatment eﬀects using machine learning. Pr o c e e dings of the national ac ademy of scienc es , 116(10):4156–4165. 21 Li, Y., Y u, Z., Chen, G., and Lin, Q. (2024). On the eigen v alue deca y rates of a class of neural- net work related k ernel functions deﬁned on general domains. Journal of Machine L e arning R ese ar ch , 25(82):1–47. Ma, C., P athak, R., and W ain wright, M. J. (2023). Optimally tac kling cov ariate shift in rkhs-based nonparametric regression. The A nnals of Statistics , 51(2):738–761. Minsk er, S. (2017). On some extensions of b ernstein’s inequalit y for self-adjoint op erators. Statistics & Pr ob ability L etters , 127:111–119. Mou, W., Ding, P ., W ainwrigh t, M. J., and Bartlett, P . L. (2023). Kernel-based oﬀ-p olicy esti- mation without ov erlap: Instance optimality b ey ond semiparametric eﬃciency . arXiv pr eprint arXiv:2301.06240 . Neal, B., Huang, C.-W., and Raghupathi, S. (2020). Realcause: Realistic causal inference b enc h- marking. arXiv pr eprint arXiv:2011.15007 . Nie, X. and W ager, S. (2021). Quasi-oracle estimation of heterogeneous treatmen t eﬀects. Biometrika , 108(2):299–319. Oprescu, M., Syrgk anis, V., and W u, Z. S. (2019). Orthogonal random forest for causal inference. In International Confer enc e on Machine L e arning , pages 4932–4941. PMLR. Singh, R., Xu, L., and Gretton, A. (2024). Kernel metho ds for causal functions: dose, heterogeneous and incremental resp onse curves. Biometrika , 111(2):497–516. Suzuki, T. (2018). Adaptivit y of deep relu netw ork for learning in b eso v and mixed smo oth b eso v spaces: optimal rate and curse of dimensionalit y . arXiv pr eprint arXiv:1810.08033 . T uo, R. and Zou, L. (2024). Asymptotic theory for linear functionals of k ernel ridge regression. arXiv pr eprint arXiv:2403.04248 . W ainwrigh t, M. J. (2019). High-dimensional statistics: A non-asymptotic viewp oint , volume 48. Cam bridge universit y press. W ang, H. and Kim, J. K. (2023). Statistical inference using regularized m-estimation in the repro ducing k ernel hilb ert space for handling missing data. Annals of the Institute of Statistic al Mathematics , 75(6):911–929. W ang, K. (2023). Pseudo-lab eling for kernel ridge regression under cov ariate shift. arXiv pr eprint arXiv:2302.10160 . Zhang, H., Li, Y., Lu, W., and Lin, Q. (2023). On the optimalit y of missp eciﬁed k ernel ridge regression. In International Confer enc e on Machine L e arning , pages 41331–41353. PMLR. Zhang, H., Li, Y., Lu, W., and Lin, Q. (2025). Optimal rates of kernel ridge regression under source condition in large dimensions. Journal of Machine L e arning R ese ar ch , 26(219):1–63. 22 Supplemen tary Materials Con ten ts A Pro of of the Theorem 1 (General Theory) 24 A.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 A.2 Go od Even t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 A.3 Pro of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 B Pro ofs of the Main Corollaries from General Theory 27 B.1 Pro of of Corollary 1 (Mo del 1, L 2 -error) . . . . . . . . . . . . . . . . . . . . . . . . . 27 B.2 Pro of of Corollary 2 (Mo del 2, L 2 -error) . . . . . . . . . . . . . . . . . . . . . . . . . 27 B.3 Pro of of Corollary 3 (Mo del 3, L 2 -error) . . . . . . . . . . . . . . . . . . . . . . . . . 28 B.4 Pro of of Corollary 4 (Mo dels 1–2, p oin t ev aluation) . . . . . . . . . . . . . . . . . . . 28 B.5 Pro of of Corollary 5 (Mo del 3, p oin t ev aluation) . . . . . . . . . . . . . . . . . . . . . 28 C Proof of Theorem 2 (Oracle Inequalit y) 28 C.1 Oracle Inequality for Empirical L 2 -Error . . . . . . . . . . . . . . . . . . . . . . . . . 28 C.2 Oracle Inequality for L 2 -Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 C.3 Pro of of Lemma 2 (Oracle Inequality for Empirical L 2 -Error) . . . . . . . . . . . . . 30 D T ec hnical Lemmas 32 23 A Pro of of the Theorem 1 (General Theory) A.1 Notation W e collect additional notation used in the app endix. T able 7: Common notation used in the app endix pro ofs. Sym b ol Deﬁnition Se c ond moment op er ators Σ F E [ ϕ ( x i ) ϕ ( x i ) ⊤ ] . Σ F , 0 E [ ϕ ( x i ) ϕ ( x i ) ⊤ 1 ( a i = 0)] . Σ F , 1 E [ ϕ ( x i ) ϕ ( x i ) ⊤ 1 ( a i = 1)] . RKHS elements θ ⋆ 0 Hilb ertian elemen t corresp onding to f ⋆ 0 in F . θ ⋆ 1 Hilb ertian elemen t corresp onding to f ⋆ 1 in F . Design op er ators and ve ctors X design op erator of { ϕ ( x i ) } n i =1 . X 1 design op erator of { ϕ ( x i ) 1 ( a i = 1) } n i =1 . X 0 design op erator of { ϕ ( x i ) 1 ( a i = 0) } n i =1 . Y 1 v ector of { y i 1 ( a i = 1) } n i =1 . Y 0 v ector of { y i 1 ( a i = 0) } n i =1 . ε 1 v ector of { ε i 1 ( a i = 1) } n i =1 . ε 0 v ector of { ε i 1 ( a i = 0) } n i =1 . m i pseudo-outcome deﬁned by switch imputation in eq. ( 2 ). M pseudo-outcome v ector ( m i ) i ∈ [ n ] with decomp osition M = X ( θ ⋆ 1 − θ ⋆ 0 ) + ε ′ + X 0 ( ˆ θ 1 − θ ⋆ 1 ) − X 1 ( ˆ θ 0 − θ ⋆ 0 ) . ε ′ v ector with entries ε i ( − 1) a i +1 , i ∈ [ n ] . A.2 Go od Ev en t By the ov erlap (p ositivit y) assumption, we hav e the dominance relations Σ F ⪯ 1 κ Σ F , 0 , Σ F ⪯ 1 κ Σ F , 1 . (9) Applying Lemma 5 and Remark 2 , for any λ ′ ∈ { λ, ¯ λ } and with probabilit y at least 1 − n − 11 , there exists an absolute constant c 2 > 1 such that 1 c 2 ( n Σ F , 1 + λ ′ I ) ⪯ X ⊤ 1 X 1 + λ ′ I ⪯ c 2 ( n Σ F , 1 + λ ′ I ) 1 c 2 ( n Σ F , 0 + λ ′ I ) ⪯ X ⊤ 0 X 0 + λ ′ I ⪯ c 2 ( n Σ F , 0 + λ ′ I ) 1 c 2 ( n Σ F + λ ′ I ) ⪯ X ⊤ X + λ ′ I ⪯ c 2 ( n Σ F + λ ′ I ) . (10) Moreo ver, for any λ ′ ∈ { λ, ¯ λ } , with probability at least 1 − n − 11 , 1 c ( n Σ + nλ ′ I ) ⪯ W ⊤ W + nλ ′ I ⪯ c ( n Σ + nλ ′ I ) (11) 24 for some absolute constant c > 1 . Let E 0 b e the ev ent on which the ab o ve concentration inequalities hold. Then P [ E 0 ] ≥ 1 − 3 n − 11 . Com bining eq. ( 9 ) and eq. ( 10 ), on the even t E 0 w e obtain 1 cκ ( X ⊤ 0 X 0 + λ ′ I ) ⪯ X ⊤ 1 X 1 + λ ′ I ⪯ c κ ( X ⊤ 0 X 0 + λ ′ I ) 1 cκ ( X ⊤ 0 X 0 + λ ′ I ) ⪯ X ⊤ X + λ ′ I ⪯ c κ ( X ⊤ 0 X 0 + λ ′ I ) 1 cκ ( X ⊤ 1 X 1 + λ ′ I ) ⪯ X ⊤ X + λ ′ I ⪯ c κ ( X ⊤ 1 X 1 + λ ′ I ) (12) for some constant c > 1 . Remark 1 (Nuisance Regularizer Choice for Sob olev Class) . When F = H β ( X ) , we can choose ¯ λ as an y v alue that guarantees the concentrations in eq. ( 10 ). By Lemma 6 , log n · n − 2 β d ≲ ¯ λ ≲ log n/n suﬃces; this range ensures the concentration b ounds while preserving the stated upp er rates. A.3 Pro of W e work on the ev ent E 0 , which controls X 0 , X 1 , X as deﬁned ab o v e. Error Decomposition. Recall that by eq. ( 8 ), ˆ η = ( W ⊤ W + nλ I ) − 1 W ⊤  W η ⋆ + ε ′ + X 0 ( ˆ θ 1 − θ ⋆ 1 ) − X 1 ( ˆ θ 0 − θ ⋆ 0 )  . The estimation error can b e written as ˆ η − η ⋆ = ( W ⊤ W + nλ I ) − 1  W ⊤ X 0 ( ˆ θ 1 − θ ⋆ 1 ) − W ⊤ X 1 ( θ ⋆ 0 − ˆ θ 0 ) − nλη ⋆  + ( W ⊤ W + nλ I ) − 1 W ⊤ ε ′ . A ccordingly , the error decomp oses as ∥ Σ 1 2 ref ( ˆ η − η ⋆ ) ∥ X ≲ ∥ Σ 1 2 ref ( W ⊤ W + nλ I ) − 1 W ⊤ X 1 ( θ ⋆ 0 − ˆ θ 0 ) ∥ X + ∥ Σ 1 2 ref ( W ⊤ W + nλ I ) − 1 W ⊤ X 0 ( ˆ θ 1 − θ ⋆ 1 ) ∥ X + ∥ Σ 1 2 ref ( W ⊤ W + nλ I ) − 1 nλη ⋆ ∥ X + ∥ Σ 1 2 ref ( W ⊤ W + nλ I ) − 1 W ⊤ ε ′ ∥ X . W e ﬁrst b ound the term ∥ Σ 1 2 ref ( W ⊤ W + nλ I ) − 1 W ⊤ X 0 ( ˆ θ 1 − θ ⋆ 1 ) ∥ X = ∥ Σ 1 2 ref ( W ⊤ W + nλ I ) − 1 W ⊤ X 0 ( X ⊤ 1 X 1 + n ¯ λ I ) − 1 ( X ⊤ 1 ε 1 − n ¯ λθ ⋆ 1 ) ∥ X ≤ ∥ Σ 1 2 ref ( W ⊤ W + nλ I ) − 1 W ⊤ X 0 ( X ⊤ 1 X 1 + n ¯ λ I ) − 1 X ⊤ 1 ε 1 ∥ X + ∥ Σ 1 2 ref ( W ⊤ W + nλ I ) − 1 W ⊤ X 0 ( X ⊤ 1 X 1 + n ¯ λ I ) − 1 n ¯ λθ ⋆ 1 ∥ X 25 Bias T erm. Under the even t E 0 , we hav e B 1 := ∥ Σ 1 2 ref ( W ⊤ W + nλ I ) − 1 W ⊤ X 0 ( X ⊤ 1 X 1 + n ¯ λ I ) − 1 ( n ¯ λθ ⋆ 1 ) ∥ X ≲ ∥ Σ 1 2 ref ( W ⊤ W + nλ I ) − 1 2 ∥ op ∥ ( W ⊤ W + nλ I ) − 1 2 W ⊤ ∥ op × ∥ X 0 ( X ⊤ 1 X 1 + n ¯ λ I ) − 1 2 ∥ op ∥ ( X ⊤ 1 X 1 + n ¯ λ I ) − 1 2 ( n ¯ λθ ⋆ 1 ) ∥ F ( i ) ≲ ∥ Σ 1 2 ref ( n Σ + nλ I ) − 1 2 ∥ op × 1 × 1 √ κ × p n ¯ λ ∥ θ ⋆ 1 ∥ F ≲ 1 √ n ∥ S λ ∥ 1 2 op × 1 × 1 √ κ × p n ¯ λ ∥ θ ⋆ 1 ∥ F ≲ 1 √ κ ∥ S λ ∥ 1 2 op p ¯ λ ∥ θ ⋆ 1 ∥ F ≲ √ log n √ nκ ∥ S λ ∥ 1 2 op ∥ θ ⋆ 1 ∥ F . In step (i), we used the deﬁnition of the go od even t E 0 (sp eciﬁcally eq. ( 12 ) and eq. ( 11 )). V ariance T erm. By the Hanson–W righ t inequality , on the even t E 0 , w e obtain the following upp er b ound with probabilit y at least 1 − n − 11 : V 2 1 := ∥ Σ 1 2 ref ( W ⊤ W + nλ I ) − 1 W ⊤ X 0 ( X ⊤ 1 X 1 + n ¯ λ I ) − 1 X ⊤ 1 ε 1 ∥ 2 X ≲ log n · T r( Σ 1 2 ref ( W ⊤ W + nλ I ) − 1 W ⊤ X 0 ( X ⊤ 1 X 1 + n ¯ λ I ) − 1 X ⊤ 1 X 1 ( X ⊤ 1 X 1 + n ¯ λ I ) − 1 X ⊤ 0 W ( W ⊤ W + nλ I ) − 1 Σ 1 2 ref ) ≲ T r( Σ 1 2 ref ( W ⊤ W + nλ I ) − 1 W ⊤ X ( X ⊤ 1 X 1 + n ¯ λ I ) − 1 X ⊤ W ( W ⊤ W + nλ I ) − 1 Σ 1 2 ref ) log n ( i ) ≲ 1 κ T r( Σ 1 2 ref ( W ⊤ W + nλ I ) − 1 W ⊤ X ( X ⊤ X + n ¯ λ I ) − 1 X ⊤ W ( W ⊤ W + nλ I ) − 1 Σ 1 2 ref ) log n ≲ 1 κ T r( Σ 1 2 ref ( W ⊤ W + nλ I ) − 1 W ⊤ W ( W ⊤ W + nλ I ) − 1 Σ 1 2 ref ) log n ≲ 1 κ T r( Σ 1 2 ref ( W ⊤ W + nλ I ) − 1 Σ 1 2 ref ) log n ( ii ) ≲ log n nκ T r( Σ ref ( Σ + λ I ) − 1 ) = log n nκ T r( S λ ) . Here, in step (i) we used eq. ( 12 ), and in step (ii) we used eq. ( 11 ). The remaining propagated term is b ounded similarly: ∥ Σ 1 2 ref ( W ⊤ W + nλ I ) − 1 W ⊤ X ( θ ⋆ 0 − ˆ θ 0 ) ∥ 2 X ≲ log n nκ T r( S λ ) + log n nκ ∥ θ ⋆ 0 ∥ 2 F . Ridge Bias T erm. Finally , we b ound the ridge bias term: ∥ Σ 1 2 ref ( W ⊤ W + nλ I ) − 1 nλη ⋆ ∥ X ≲ ∥ Σ 1 2 ref ( W ⊤ W + nλ I ) − 1 2 ∥ op · ∥ ( W ⊤ W + nλ I ) − 1 2 nλη ⋆ ∥ X ( i ) ≲ ∥ Σ 1 2 ref ( n Σ + nλ I ) − 1 2 ∥ op · ∥ ( n Σ + nλ I ) − 1 2 nλη ⋆ ∥ X ≲ 1 √ n ∥ S λ ∥ 1 2 op × √ nλ × ∥ ( Σ + λ I ) − 1 2 η ⋆ ∥ X 26 = λ ∥ S λ ∥ 1 2 op ∥ ( Σ + λ I ) − 1 2 η ⋆ ∥ X where in step (i) we used eq. ( 11 ). Com bining the v ariance and bias b ounds ab o ve with the decomp osition yields the stated b ound in Theorem 1 . B Pro ofs of the Main Corollaries from General Theory W e no w sho w ho w eac h main corollary follo ws from Theorem 1 by sp ecializing Σ ref , Σ , and X , b ound- ing ∥ S λ ∥ op and T r ( S λ ) , and choosing λ . Polylogarithmic factors are absorb ed in to ≲ . Throughout this section, w i , X , Σ , and Σ ref are as deﬁned in Section 4 and T able 2 . B.1 Pro of of Corollary 1 (Mo del 1 , L 2 -error) Set Σ ref = Σ H and Σ = Σ H , where Σ H = E [ ψ ( x ) ⊗ ψ ( x )] . Let Σ H = P j ≥ 1 ρ H ,j u j ⊗ u j b e the eigendecomp osition. Then S λ = Σ 1 / 2 H ( Σ H + λ I ) − 1 Σ 1 / 2 H has eigenv alues ρ H ,j / ( ρ H ,j + λ ) , so ∥ S λ ∥ op = max j ρ H ,j ρ H ,j + λ ≤ 1 , T r( S λ ) = X j ρ H ,j ρ H ,j + λ . F or the ridge bias, note that λ 2 ∥ ( Σ H + λ I ) − 1 / 2 η ⋆ ∥ 2 H ≤ λ ∥ η ⋆ ∥ 2 H , so the bias term is O ( λ ) . Plugging these in to Theorem 1 yields E L 2 ( ˆ h ) ≲ 1 nκ X j ρ H ,j ρ H ,j + λ + λ. If ρ H ,j ≲ j − 2 ℓ H , the eﬀective-dimension b ound P j ρ H ,j ρ H ,j + λ ≲ λ − 1 / (2 ℓ H ) holds ( Zhang et al. , 2023 ; Ma et al. , 2023 ), and choosing λ ≍ ( nκ ) − 2 ℓ H 2 ℓ H +1 giv es the rate in Corollary 1 . The ﬁnite-rank case follows b y the b ound P j ρ H ,j ρ H ,j + λ ≤ D . B.2 Pro of of Corollary 2 (Mo del 2 , L 2 -error) Set Σ ref = Σ F and Σ = Σ F , where Σ F = E [ ϕ ( x ) ⊗ ϕ ( x )] . W rite Σ F = P j ≥ 1 ρ F ,j u j ⊗ u j . Then ∥ S λ ∥ op ≤ 1 and T r( S λ ) = P j ρ F ,j ρ F ,j + λ . F or the ridge bias, we use the source condition in its sp ectral form: there exists γ ∈ F suc h that γ = Σ − ν / 2 F η ⋆ . Since Σ ref = Σ F , we can sharp en the generic b ound by tracking the Σ F factor λ 2 ∥ Σ 1 / 2 F ( Σ F + λ I ) − 1 η ⋆ ∥ 2 ≲ λ 2 ∥ ( Σ F + λ I ) − 1 − ν 2 ( Σ F + λ I ) − ν 2 η ⋆ ∥ 2 ≲ λ 1+ ν . Th us the ridge bias term is O ( λ 1+ ν ) under the source condition. Therefore, E L 2 ( ˆ h ) ≲ 1 nκ X j ρ F ,j ρ F ,j + λ + λ 1+ ν . If ρ F ,j ≲ j − 2 ℓ F , then P j ρ F ,j ρ F ,j + λ ≲ λ − 1 / (2 ℓ F ) ( Ma et al. , 2023 ), and c ho osing λ ≍ ( nκ ) − 2 ℓ F 1+2 ℓ F (1+ ν ) yields the rate in Corollary 2 . 27 B.3 Pro of of Corollary 3 (Mo del 3 , L 2 -error) Set Σ ref = Σ ˜ H and Σ = Σ ˜ H , where Σ ˜ H = E [ ˜ ψ ( x ) ⊗ ˜ ψ ( x )] , and write Σ ˜ H = P j ≥ 1 ˜ ρ j ˜ u j ⊗ ˜ u j . Then ∥ S λ ∥ op ≤ 1 and T r ( S λ ) = P j ˜ ρ j ˜ ρ j + λ . As in Mo del 1 , the ridge bias satisﬁes λ 2 ∥ ( Σ ˜ H + λ I ) − 1 / 2 ˜ η ⋆ ∥ 2 ≤ λ ∥ ˜ η ⋆ ∥ 2 ˜ H , so it is O ( λ ) when ∥ ˜ η ⋆ ∥ ˜ H is b ounded. Hence, E L 2 ( ˆ h ) ≲ 1 nκ X j ˜ ρ j ˜ ρ j + λ + λ. If ˜ ρ j ≲ j − 2 ˜ ℓ , then P j ˜ ρ j ˜ ρ j + λ ≲ λ − 1 / (2 ˜ ℓ ) ( Ma et al. , 2023 ), and choosing λ ≍ ( nκ ) − 2 ˜ ℓ 1+2 ˜ ℓ giv es the rate in Corollary 3 . B.4 Pro of of Corollary 4 (Mo dels 1 – 2 , p oin t ev aluation) F or a ﬁxed x 0 , let Σ ref = ψ ( x 0 ) ⊗ ψ ( x 0 ) and Σ = Σ H , where Σ H = E [ ψ ( x ) ⊗ ψ ( x )] ; then S λ =  ( Σ + λ I ) − 1 / 2 ψ ( x 0 )  ⊗  ( Σ + λ I ) − 1 / 2 ψ ( x 0 )  is rank-one with ∥ S λ ∥ op = T r( S λ ) = Q 2 λ := ⟨ ψ ( x 0 ) , ( Σ + λ I ) − 1 ψ ( x 0 ) ⟩ , the standard leverage score. Theorem 1 yields E x 0 ( ˆ h ) ≲  1 nκ + λ  Q 2 λ . When H = H γ ( X ) with γ > d/ 2 , Lemma 4 (Sob olev embedding / lev erage score control) gives Q 2 λ ≲ λ − d/ (2 γ ) ( T uo and Zou , 2024 ). T aking λ ≍ ( nκ ) − 1 yields E x 0 ( ˆ h ) ≲ ( nκ ) − (2 γ − d ) / (2 γ ) , as in Corollary 4 . B.5 Pro of of Corollary 5 (Mo del 3 , p oin t ev aluation) Let Σ ref = ˜ ψ ( ˜ x 0 ) ⊗ ˜ ψ ( ˜ x 0 ) and Σ = Σ ˜ H , where Σ ˜ H = E [ ˜ ψ ( x ) ⊗ ˜ ψ ( x )] , so that S λ =  ( Σ ˜ H + λ I ) − 1 / 2 ˜ ψ ( ˜ x 0 )  ⊗  ( Σ ˜ H + λ I ) − 1 / 2 ˜ ψ ( ˜ x 0 )  and ∥ S λ ∥ op = T r( S λ ) = ˜ Q 2 λ := ⟨ ˜ ψ ( ˜ x 0 ) , ( Σ ˜ H + λ I ) − 1 ˜ ψ ( ˜ x 0 ) ⟩ . Then E x 0 ( ˆ h ) ≲  1 nκ + λ  ˜ Q 2 λ . If ˜ H = H β ( ˜ X ) with β > ˜ d/ 2 , Sob olev embedding Lemma 4 yields ˜ Q 2 λ ≲ λ − ˜ d/ (2 β ) ( T uo and Zou , 2024 ). Cho osing λ ≍ ( nκ ) − 1 giv es E x 0 ( ˆ h ) ≲ ( nκ ) − (2 β − ˜ d ) / (2 β ) , matching Corollary 5 . C Pro of of Theorem 2 (Oracle Inequalit y) C.1 Oracle Inequality for Empirical L 2 -Error Let ¯ h a ⋆ b e the candidate in ¯ M with the smallest p opulation L 2 -error. F ormally , conditional on D 2 , deﬁne the empirical L 2 -error on D 3 b y E in ( h ) := 1 n 3 n 3 X i =1  h ( x 3 i ) − ˜ m 3 i  2 , 28 whic h is precisely the empirical ob jective minimized in Algorithm 2 . W e ﬁrst state an oracle inequalit y for empirical L 2 -error. Lemma 2 (Oracle Inequality for empirical L 2 -error) . E in ( ¯ h ms ) ≲ min h ∈ ¯ M E in ( h ) + log n nκ ≲ E in ( ¯ h a ⋆ ) + log n nκ . W e set E L 2 ( ¯ h a ⋆ ) := R ⋆ . Next, using Bernstein’s inequality , with probability at least 1 − n − 11 , we hav e E in ( ¯ h a ⋆ ) ≲ R ⋆ + √ R ⋆ · B log n √ n + B 2 log n n ≲ R ⋆ + log n nκ . Th us, we obtain the relation E in ( ¯ h ms ) ≤ c 1 ( R ⋆ + log n nκ ) for some constant c 1 > 0 . Additionally , set r ⋆ := (2 c 1 + c 2 ) R ⋆ + (2 c 1 + c 2 ) log n nκ for suﬃciently large c 2 > 1 . C.2 Oracle Inequality for L 2 -Error F or the selected mo del ¯ h ms , deﬁne E L 2 ( ¯ h ms ) := ¯ R. F or any ¯ h i ∈ ¯ M with E L 2 ( ¯ h i ) > r ⋆ , deﬁne the even t F i := {E L 2 ( ¯ h i ) > r ⋆ ∩ E in ( ¯ h i ) ≤ c 1 ( R ⋆ + log n nκ ) } and set F = [ ¯ h i ∈ ¯ M , E L 2 ( ¯ h i ) >r ⋆ F i . F or suﬃcien tly large c 2 , Bernstein’s inequality for b ounded v ariables (conditional on D 1 and D 2 ) implies that each F i has probability at most n − 12 : when E L 2 ( ¯ h i ) > r ⋆ , the gap E L 2 ( ¯ h i ) − E in ( ¯ h i ) is at least of order R ⋆ + log n/ ( nκ ) , while the empirical ﬂuctuation of E in ( ¯ h i ) is at most of order B p log n/n + B 2 log n/n . Since candidates are truncated to [ − B , B ] , this ﬂuctuation is dominated b y the gap. A union b ound ov er L = p oly( n ) candidates yields P [ F ] ≤ X P [ F i ] ≤ n − 11 On the complement of F , ev ery candidate with E L 2 ( ¯ h i ) > r ⋆ has empirical risk larger than c 1 ( R ⋆ + log n/ ( nκ )) , whereas ¯ h ms satisﬁes the opp osite inequality by the previous display . Therefore, with probability at least 1 − n − 11 , we obtain ¯ R ≤ r ⋆ . This completes the pro of. 29 C.3 Pro of of Lemma 2 (Oracle Inequalit y for Empirical L 2 -Error) Let M 3 := ( ˜ m 3 i ) n 3 i =1 and M ⋆ 3 := ( h ⋆ ( x 3 i )) n 3 i =1 . Let ˜ θ a denote the Hilb ertian element corresp onding to ˜ f a for a ∈ { 0 , 1 } . M 3 := ( X 3 , 0 ˜ θ 1 − X 3 , 0 θ ⋆ 0 − ε 0 ) + ( X 3 , 1 θ ⋆ 1 + ε 1 − X 3 , 1 ˜ θ 0 ) = X 3 ( θ ⋆ 1 − θ ⋆ 0 ) + ε ′ + X 3 , 0 ( ˜ θ 1 − θ ⋆ 1 ) − X 3 , 1 ( ˜ θ 0 − θ ⋆ 0 ) , (13) Th us, M 3 − M ⋆ 3 = ε ′ + X 3 , 0 ( X ⊤ 2 , 1 X 2 , 1 + n ¯ λ I ) − 1 X ⊤ 2 , 1 ε 1 − X 3 , 1 ( X ⊤ 2 , 0 X 2 , 0 + n ¯ λ I ) − 1 X ⊤ 2 , 0 ε 0 | {z } := V ms + n ¯ λ X 3 , 1 ( X ⊤ 2 , 0 X 2 , 0 + n ¯ λ I ) − 1 θ ⋆ 0 − n ¯ λ X 3 , 0 ( X ⊤ 2 , 1 X 2 , 1 + n ¯ λ I ) − 1 θ ⋆ 1 | {z } := B ms . Go od Ev ent. Applying Lemma 5 and Remark 2 , for any λ ′ ∈ { λ, ¯ λ } and with probability at least 1 − n − 11 , the following b ounds hold for some absolute constant c 2 > 1 : 1 c 2 ( n Σ F , 1 + λ ′ I ) ⪯ X ⊤ 2 , 1 X 2 , 1 + λ ′ I ⪯ c 2 ( n Σ F , 1 + λ ′ I ) 1 c 2 ( n Σ F , 0 + λ ′ I ) ⪯ X ⊤ 2 , 0 X 2 , 0 + λ ′ I ⪯ c 2 ( n Σ F , 0 + λ ′ I ) 1 c 2 ( n Σ F + λ ′ I ) ⪯ X ⊤ 2 X 2 + λ ′ I ⪯ c 2 ( n Σ F + λ ′ I ) . (14) 1 c 2 ( n Σ F , 1 + λ ′ I ) ⪯ X ⊤ 3 , 1 X 3 , 1 + λ ′ I ⪯ c 2 ( n Σ F , 1 + λ ′ I ) 1 c 2 ( n Σ F , 0 + λ ′ I ) ⪯ X ⊤ 3 , 0 X 3 , 0 + λ ′ I ⪯ c 2 ( n Σ F , 0 + λ ′ I ) 1 c 2 ( n Σ F + λ ′ I ) ⪯ X ⊤ 3 X 3 + λ ′ I ⪯ c 2 ( n Σ F + λ ′ I ) . (15) W e deﬁne E ms as the even t on whic h the ab o ve concen tration inequalities hold. Thus, P [ E ms ] ≥ 1 − 6 n − 11 . Com bining eqs. ( 9 ), ( 14 ) and ( 15 ), under the even t E ms it follows that: 1 cκ ( X ⊤ 2 , 0 X 2 , 0 + λ ′ I ) ⪯ X ⊤ 3 , 1 X 3 , 1 + λ ′ I ⪯ c κ ( X ⊤ 2 , 0 X 2 , 0 + λ ′ I ) 1 cκ ( X ⊤ 3 , 0 X 3 , 0 + λ ′ I ) ⪯ X ⊤ 2 , 1 X 2 , 1 + λ ′ I ⪯ c κ ( X ⊤ 3 , 0 X 3 , 0 + λ ′ I ) . (16) for some constant c > 1 . T o apply Lemma 3 , it suﬃces to b ound ∥ M 3 − E [ M 3 ] ∥ ψ 2 = ∥ V ms ∥ ψ 2 , as well as ∥ E [ M 3 ] − M ⋆ 3 ∥ 2 = ∥ B ms ∥ 2 . 30 Bounding ∥ V ms ∥ ψ 2 . Observ e that ∥ V ms ∥ ψ 2 ≲ 1 + ∥ X 3 , 0 ( X ⊤ 2 , 1 X 2 , 1 + n ¯ λ I ) − 1 X ⊤ 2 , 1 ∥ op + ∥ X 3 , 1 ( X ⊤ 2 , 0 X 2 , 0 + n ¯ λ I ) − 1 X ⊤ 2 , 0 ∥ op . W e b ound one term; the other is analogous: ∥ X 3 , 0 ( X ⊤ 2 , 1 X 2 , 1 + n ¯ λ I ) − 1 X ⊤ 2 , 1 ∥ 2 op ≲ ∥ X 2 , 1 ( X ⊤ 2 , 1 X 2 , 1 + n ¯ λ I ) − 1 X ⊤ 3 , 0 X 3 , 0 ( X ⊤ 2 , 1 X 2 , 1 + n ¯ λ I ) − 1 X ⊤ 2 , 1 ∥ op ( i ) ≲ 1 κ ∥ X 2 , 1 ( X ⊤ 2 , 1 X 2 , 1 + n ¯ λ I ) − 1 X ⊤ 2 , 1 ∥ op ≲ 1 κ . where step (i) holds by eq. ( 16 ). Hence, ∥ V ms ∥ ψ 2 ≲ 1 √ κ . Bounding ∥ B ms ∥ 2 . W e hav e ∥ B ms ∥ 2 ≲ ∥ n ¯ λ X 3 , 1 ( X ⊤ 2 , 0 X 2 , 0 + n ¯ λ I ) − 1 θ ⋆ 0 ∥ 2 + ∥ n ¯ λ X 3 , 0 ( X ⊤ 2 , 1 X 2 , 1 + n ¯ λ I ) − 1 θ ⋆ 1 ∥ 2 W e b ound the ﬁrst term as ∥ n ¯ λ X 3 , 1 ( X ⊤ 2 , 0 X 2 , 0 + n ¯ λ I ) − 1 θ ⋆ 0 ∥ 2 ≲ n ¯ λ × ∥ X 3 , 1 ( X ⊤ 2 , 0 X 2 , 0 + n ¯ λ I ) − 1 θ ⋆ 0 ∥ 2 ≲ n ¯ λ · ∥ X 3 , 1 ( X ⊤ 2 , 0 X 2 , 0 + n ¯ λ I ) − 1 2 ∥ op · ∥ ( X ⊤ 2 , 0 X 2 , 0 + n ¯ λ I ) − 1 2 θ ⋆ 0 ∥ 2 ( i ) ≲ 1 √ κ p n ¯ λ ≲ r log n κ where step (i) holds by eq. ( 16 ). Conditioning on { x 2 i , a 2 i } and { x 3 i , a 3 i } , we can apply Lemma 3 , whic h yields the desired in-sample oracle inequality . Lemma 3 (Theorem 5.2 from W ang 2023 ) . L et { z i } n i =1 b e deterministic elements in a set Z , let g ⋆ and { g j } m j =1 b e deterministic functions on Z , and let e g b e a r andom function on Z . Deﬁne L ( g ) = 1 n n X i =1 | g ( z i ) − g ⋆ ( z i ) | 2 for any function g on Z . Assume that the r andom ve ctor e y = ( e g ( z 1 ) , e g ( z 2 ) , · · · , e g ( z n )) ⊤ satisﬁes ∥ e y − E e y ∥ ψ 2 ≤ V < ∞ . Cho ose any b j ∈ argmin j ∈ [ m ] ( 1 n n X i =1 | g j ( z i ) − e g ( z i ) | 2 ) . 31 Ther e exists a universal c onstant C such that for any δ ∈ (0 , 1] , with pr ob ability at le ast 1 − δ we have L  g b j  ≤ inf γ > 0  (1 + γ ) min j ∈ [ m ] L ( g j ) + C  1 + γ − 1   L ( E e g ) + V 2 log( m/δ ) n  . Conse quently, E L  g b j  ≤ inf γ > 0  (1 + γ ) min j ∈ [ m ] L ( g j ) + C  1 + γ − 1   L ( E e g ) + V 2 (1 + log m ) n  . D T ec hnical Lemmas Lemma 4 (Key Inequality for Poin t Ev aluation: Sob olev Space) . L et K b e a Sob olev kernel on H β ( X ) with fe atur e map ϕ ( · ) , and let Σ = R x ∈X ϕ ( x ) ϕ ( x ) ⊤ d x . Supp ose X ⊂ R d . F or any x 0 ∈ X and any λ > 0 , we have ϕ ( x 0 ) ⊤ ( Σ + λ I ) − 1 ϕ ( x 0 ) ≲ λ − d 2 β . Henc e, ϕ ( x 0 ) ϕ ( x 0 ) ⊤ ≲ λ − d 2 β ( Σ + λ I ) . Pr o of. Let Q = ϕ ( x 0 ) ⊤ ( Σ + λ I ) − 1 ϕ ( x 0 ) . Our goal is to ﬁnd an upp er b ound for Q . Let θ ∈ H b e deﬁned as θ = ( Σ + λ I ) − 1 ϕ ( x 0 ) . The quantit y Q can b e written as an inner pro duct in the RKHS H : Q =  ϕ ( x 0 ) , ( Σ + λ I ) − 1 ϕ ( x 0 )  H = ⟨ ϕ ( x 0 ) , θ ⟩ H . F rom the deﬁnition of θ , w e hav e ( Σ + λ I ) θ = ϕ ( x 0 ) . T aking the inner pro duct with θ on b oth sides giv es: ⟨ ( Σ + λ I ) θ, θ ⟩ H = ⟨ ϕ ( x 0 ) , θ ⟩ H . This simpliﬁes to: ⟨ Σ θ , θ ⟩ H + λ ⟨ θ , θ ⟩ H = Q. Let g θ ( x ) = ⟨ ϕ ( x ) , θ ⟩ H b e the function in the Sob olev space H β ( X ) corresp onding to θ ∈ H . By the repro ducing prop ert y , g θ ( x 0 ) = ⟨ ϕ ( x 0 ) , θ ⟩ H = Q . The norms of g θ can b e related to the expression for Q : 1. The squared H β norm is: ∥ g θ ∥ 2 H β ( X ) = ∥ θ ∥ 2 H = ⟨ θ , θ ⟩ H . 2. The squared L 2 norm is: ∥ g θ ∥ 2 L 2 ( X ) = R X | g θ ( x ) | 2 dx = R X ⟨ ϕ ( x ) , θ ⟩ 2 H dx =  R X ϕ ( x ) ϕ ( x ) ⊤ dx  θ , θ  H = ⟨ Σ θ , θ ⟩ H . Substituting these into the expression for Q , w e get: ∥ g θ ∥ 2 L 2 ( X ) + λ ∥ g θ ∥ 2 H β ( X ) = Q. Since b oth norm terms are non-negative, we can establish b ounds for eac h: ∥ g θ ∥ 2 L 2 ( X ) ≤ Q = ⇒ ∥ g θ ∥ L 2 ( X ) ≤ p Q 32 λ ∥ g θ ∥ 2 H β ( X ) ≤ Q = ⇒ ∥ g θ ∥ H β ( X ) ≤ r Q λ No w apply the interpolation inequalit y to g θ . Let q = d 2 β . Q = g θ ( x 0 ) ≤ ∥ g θ ∥ L ∞ ( X ) ≤ C ∥ g θ ∥ 1 − q L 2 ( X ) ∥ g θ ∥ q H β ( X ) . Substitute the b ounds w e found for the norms: Q ≤ C  p Q  1 − q r Q λ ! q and hence Q 1 2 ≤ C · λ − q 2 whic h completes the ﬁrst inequalit y . ϕ ( x 0 ) ⊤ ( Σ + λ I ) − 1 ϕ ( x 0 ) ≲ λ − d 2 β . F or the second inequality , let v ∈ H . Then ⟨ ϕ ( x 0 ) ϕ ( x 0 ) ⊤ v , v ⟩ H = |⟨ ϕ ( x 0 ) , v ⟩ H | 2 =     ( Σ + λ I ) − 1 / 2 ϕ ( x 0 ) , ( Σ + λ I ) 1 / 2 v  H    2 ≤ ϕ ( x 0 ) ⊤ ( Σ + λ I ) − 1 ϕ ( x 0 ) · ⟨ ( Σ + λ I ) v , v ⟩ H ≲ λ − d/ (2 β ) ⟨ ( Σ + λ I ) v , v ⟩ H . Since this holds for all v , w e conclude ϕ ( x 0 ) ϕ ( x 0 ) ⊤ ≲ λ − d/ (2 β ) ( Σ + λ I ) . Lemma 5 (Corollary E.1 from W ang 2023 ) . L et { x i } n i =1 b e i.i.d. r andom elements in a sep ar able Hilb ert sp ac e H with Σ := E ( x i ⊗ x i ) tr ac e class. Deﬁne ˆ Σ = 1 n P n i =1 x i ⊗ x i . Cho ose any c onstant γ ∈ (0 , 1) and deﬁne the event A = { (1 − γ )( Σ + λ I ) ⪯ ˆ Σ + λ I ⪯ (1 + γ )( Σ + λ I ) } . 1. If ∥ x i ∥ H ≤ ξ holds almost sur ely for some c onstant ξ , then ther e exists a c onstant C ≥ 1 determine d by γ such that P ( A ) ≥ 1 − δ holds so long as δ ∈ (0 , 1 / 14] and λ ≥ C ξ log ( n/δ ) n . Henc e, with pr ob ability at le ast 1 − δ , we have 1 C ξ ( b Σ + log( n/δ ) I ) ⪯ Σ + log ( n/δ ) I ⪯ C ξ ( b Σ + log( n/δ ) I ) . Remark 2 (A useful observ ation) . F or any tw o trace-class op erators S 1 , S 2 , supp ose that S 1 ⪯ c 1 ( S 2 + c 2 log n I ) for some absolute constant c 1 > 1 , c 2 > 0 . Then, for an y c 3 > 0 , for c = max(1 , c 2 /c 3 ) , we hav e S 1 ⪯ c ( S 2 + c 3 log n I ) . The following lemma is closely related to results in T uo and Zou ( 2024 ); w e include a self-con tained pro of for completeness. 33 Lemma 6 (Concen tration of Second Moments: Sob olev) . L et K b e a Sob olev kernel for H β ( X ) , wher e X ⊂ R d , and let H b e the c orr esp onding RKHS. L et { x i } n i =1 b e i.i.d. r andom elements in H with Σ = E ( x i ⊗ x i ) tr ac e class. Deﬁne ˆ Σ = 1 n P n i =1 x i ⊗ x i . Fix a failur e pr ob ability δ ∈ (0 , 1) . If n is suﬃciently lar ge, ther e exists an absolute c onstant c > 1 such that with pr ob ability at le ast 1 − δ , 1 c ( Σ + log( n/δ ) n − 2 β d I ) ⪯ ( b Σ + log( n/δ ) n − 2 β d I ) ⪯ c ( Σ + log ( n/δ ) n − 2 β d I ) . Pr o of. T o prov e the sp ectral equiv alence, it suﬃces to show that ∥ ( Σ + λ I ) − 1 2 ( b Σ − Σ )( Σ + λ I ) − 1 2 ∥ op ≤ 1 2 , for λ ≍ log ( n/δ ) n − 2 β d . W e verify this via the Matrix Bernstein inequality ( Minsk er , 2017 ) applied to the sum of indep enden t zero-mean random op erators P n i =1 B i , where B i := 1 n ( Σ + λ I ) − 1 2 ( ϕ ( x i ) ϕ ( x i ) ⊤ − Σ )( Σ + λ I ) − 1 2 . Note that b Σ − Σ = P n i =1 B i . W e b ound the op erator norm and v ariance of B i . 1. Op erator Norm Bound Let C i = ( Σ + λ I ) − 1 2 ϕ ( x i ) ϕ ( x i ) ⊤ ( Σ + λ I ) − 1 2 . By Lemma 4 , the eﬀectiv e dimension (or leverage score) is b ounded as N ∞ ( λ ) := sup x ∈X ∥ ( Σ + λ I ) − 1 2 ϕ ( x ) ∥ 2 2 ≲ λ − d 2 β . Th us, ∥ C i ∥ op = tr( C i ) ≲ λ − d 2 β . Since ∥ B i ∥ op ≤ 1 n max( ∥ C i ∥ op , ∥ E [ C i ] ∥ op ) , we hav e ∥ B i ∥ op ≲ 1 n λ − d 2 β . 2. V ariance Bound W e b ound the second momen t. Using E [ B 2 i ] ⪯ 1 n 2 E [ C 2 i ] (v ariance is b ounded b y the second moment), w e hav e E [ C 2 i ] = E h ( Σ + λ I ) − 1 2 ϕ ( x i ) ϕ ( x i ) ⊤ ( Σ + λ I ) − 1 ϕ ( x i ) ϕ ( x i ) ⊤ ( Σ + λ I ) − 1 2 i = E      ϕ ( x i ) ⊤ ( Σ + λ I ) − 1 ϕ ( x i )  | {z } Scalar ≤N ∞ ( λ ) ( Σ + λ I ) − 1 2 ϕ ( x i ) ϕ ( x i ) ⊤ ( Σ + λ I ) − 1 2     ⪯ N ∞ ( λ ) · E h ( Σ + λ I ) − 1 2 ϕ ( x i ) ϕ ( x i ) ⊤ ( Σ + λ I ) − 1 2 i = N ∞ ( λ ) · ( Σ + λ I ) − 1 2 Σ ( Σ + λ I ) − 1 2 . Since Σ ⪯ Σ + λ I , the matrix term is b ounded by I . Therefore, the v ariance parameter σ 2 for Bernstein inequality satisﬁes ∥ n X i =1 E [ B 2 i ] ∥ op ≤ n · 1 n 2 · N ∞ ( λ ) ≲ 1 n λ − d 2 β . 34 3. Applying Bernstein Inequalit y With λ = c 0 log ( n/δ ) n − 2 β d , we hav e N ∞ ( λ ) ≲ n c 0 log( n/δ ) . Applying the Matrix Bernstein inequality yields, with high probabilit y , ∥ ( Σ + λ I ) − 1 2 ( b Σ − Σ )( Σ + λ I ) − 1 2 ∥ op ≲ r N ∞ ( λ ) log( n/δ ) n + N ∞ ( λ ) log( n/δ ) n ≲ 1 √ c 0 + 1 c 0 ≤ 1 2 , after choosing c 0 suﬃcien tly large. This implies − 1 2 I ⪯ ( Σ + λ I ) − 1 2 ( b Σ − Σ )( Σ + λ I ) − 1 2 ⪯ 1 2 I , whic h rearranges to the stated result with c ≈ 2 (sp eciﬁcally 1 / 2( Σ + λ I ) ⪯ b Σ + λ I ⪯ 3 / 2( Σ + λ I ) ). Lemma 7 (Lemma E.1 from W ang 2023 ) . Supp ose that x ∈ R d is a zer o-me an r andom ve ctor with ∥ x ∥ ψ 2 ≤ 1 . Ther e exists a universal c onstant C > 0 such that for any symmetric and p ositive semi-deﬁnite matrix Σ ∈ R d × d , P  x ⊤ Σ x ≤ C T r( Σ ) t  ≥ 1 − e − r ( Σ ) t , ∀ t ≥ 1 . Her e r ( Σ ) = T r( Σ ) / ∥ Σ ∥ 2 is the eﬀe ctive r ank of Σ . 35

Optimal and Structure-Adaptive CATE Estimation with Kernel Ridge Regression

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment