Causal Inference with High-Dimensional Treatments

Causal Inference with High-Dimensional T reatmen ts P atrick Kramer 1 , Edward H. Kennedy 1 , Isaac M. Opp er 2 1 Departmen t of Statistics & Data Science, Carnegie Mellon Universit y 2 Luskin School of Public Aﬀairs, Univ ersity of California, Los Angeles pkramer@andrew.cmu.edu, edward@stat.cmu.edu, iopper@rand.org Abstract In this w ork, we consider causal inference in v arious high-dimensional treatment set- tings, including for single m ulti-v alued treatments and vector treatmen ts with binary or con tinuous comp onen ts, when the n um b er of treatmen ts can be comparable to or even larger than the num b er of observ ations. These settings bring unique challenges: ﬁrst, the treatmen t eﬀects of interest are a high-dimensional vector rather than a lo w-dimensional scalar; second, p ositivity violations are often unav oidable; and third, estimation can b e based on a smaller eﬀective sample size. W e ﬁrst discuss fundamen tal limits of estimating eﬀects here, showing that consisten t estimation is imp ossible without further assumptions. W e go on to prop ose a nov el sparse pseudo-outcome regression framework for arbitrary high-dimensional statistical functionals, which includes generic constrained regression es- timators and error guarantees. W e use the framework to derive new doubly robust esti- mators for mean p oten tial outcomes of high-dimensional treatmen ts, though it can also b e applied to other scenarios. W e analyze the prop osed estimators under exact and approx- imate sparsit y assumptions, giving ﬁnite-sample risk b ounds. Finally , w e deriv e minimax lo wer b ounds to characterize optimal rates of conv ergence and show our risk b ounds are unimpro v able. Keywor ds: c ausal infer enc e, inﬂuenc e function, lasso, minimax r ate, sp arsity. 1 In tro duction Man y causal inference problems inv olv e a larger n umber of treatmen ts relativ e to the num- b er of observ ations, which we refer to as a high-dimensional treatmen t. F or example, high- dimensional treatmen ts naturally o ccur in medicine, e.g., for cancer radiation therap y , where the treatment consists of diﬀerent p ossible lo cations and in tensities ( Nabi et al. [ 2022 ]) or when comparing diﬀerent healthcare providers ( Susmann et al. [ 2024 ]). Other examples of high- dimensional treatmen ts include genetic v arian ts and en vironmental exp osures ( Mitra et al. [ 2022 ]), text and image embeddings ( F eder et al. [ 2022 ]), mark eting campaigns ( Sharma et al. [ 2020 ]), or schools in a dataset con taining information ab out student p erformance. W e can study the eﬀects of suc h high-dimensional treatments by estimating the mean p oten tial out- comes for each p ossible treatment level, i.e., the outcome that we would observe if we inter- v ened and applied this particular treatmen t lev el to all units. Comparing these mean potential outcomes across diﬀerent treatmen t levels enables a comparison of the eﬀectiveness of these treatmen ts, for example, allowing for the identiﬁcation of the b est p ossible treatment option. There are unique challenges when estimating the eﬀects of high-dimensional treatments, where the n umber of treatments k is large relative to the sample size n . These include an increasingly complex target parameter, inevitable positivity violations when the treatmen t lev els are discrete, and p oten tially smaller eﬀectiv e sample sizes due to few er observ ations at eac h treatmen t lev el. F or these reasons, consistent estimation and fast conv ergence rates are imp ossible without further structure when k is large in prop ortion to n . This naturally suggests a sparsity-based approach to enable eﬃcien t estimation, in particular, using estimators that ha ve the ﬂa vor of a Lasso or b est subset selection estimator [ Chatterjee , 2014 , Hastie et al. , 2020 , Rigollet and Hütter , 2023 , Tibshirani , 1996 , W ain wright , 2019 ]. 1.1 Our Contributions Our w ork addresses the fundamental challenges of high-dimensional treatments by prop osing a framework for eﬃcien t estimation of mean p otential outcomes in v arious high-dimensional treatmen t settings, in com bination with pro v able theoretical guarantees and optimalit y results. In particular, this pap er presents the following main con tributions: • In Section 3 , we il lustr ate fundamental limits of estimating me an p otential outc omes in the high-dimensional treatmen t regime when no further structure is assumed, whic h motiv ates our developed theory for eﬃcient estimation under additional sparsity . • In Section 4 , w e pr op ose a novel sp arse pseudo-outc ome r e gr ession fr amework for estima- tion of general high-dimensional statistical functionals. W e pr op ose eﬃcient estimators and derive fast c onver genc e r ates in Theorem 1 under sparsit y assumptions. This uni- ﬁed framew ork and the presented general estimators can b e used for a v ariety of high- dimensional target parameters, including mean p oten tial outcomes for several types of high-dimensional treatments. • In Section 5 and Section 6 , w e apply the prop osed pseudo-outcome regression framework to estimate me an p otential outc omes for high-dimensional single multi-value d tr e atments and ve ctor tr e atments , resp ectively , under a sparse mo del. W e pr esent eﬃcient doubly r obust estimators and pr ovide err or guar ante es in Theorems 2 and 3 under sparsity . • In Section 7 , w e pr ove minimax optimality of our prop osed estimators for estimating mean potential outcomes in the single multi-v alued treatmen t regime. More sp eciﬁcally , w e present the minimax risk under a sparse and structure-agnostic mo del. • Section 8 concludes with a discussion of the presented results and pro vides p ossible directions for future w ork. 1.2 Related work Eﬃcien t estimation of mean p oten tial outcomes has b een extensiv ely studied in the literature for high-dimensional co v ariates ( Belloni et al. [ 2014 ], F arrell [ 2015 ], D’Amour et al. [ 2017 ], 2 A they et al. [ 2018 ], Bradic et al. [ 2019 ], Sm ucler et al. [ 2019 ], W ang and Shah [ 2020 ], Jiang et al. [ 2022 ], Chernozhuk o v et al. [ 2022 ], Liu et al. [ 2023 ], Zeng et al. [ 2024 ]), as w ell as for high-dimensional outcomes ( Du et al. [ 2025 ]), and contin uous treatments (see Díaz and v an der Laan [ 2013 ], Kennedy et al. [ 2017 ], Bonvini and Kennedy [ 2022 ], Sc hindl et al. [ 2024 ] and references therein). In comparison, high-dimensional treatments ha ve receiv ed less attention. Some recen t w ork on high-dimensional treatmen ts has fo cused on dimension reduction us- ing parametric mo dels [ Andreu et al. , 2024 , Goplerud et al. , 2022 , Lin et al. , 2025 , Nabi et al. , 2022 ]. The idea is to ﬁnd a lo wer-dimensional representation of the high-dimensional treat- men t that preserv es the causal relationship. Another approac h in the literature is a plug-in regression estimator based on a deep neural netw ork, as prop osed by Sharma et al. [ 2020 ]. While these approaches address high-dimensionality and are v alidated with exp eriments, we aim to contribute an alternativ e sparsity-based approach that main tains the full treatment structure, achiev es prov able theoretical guarantees, and is even minimax optimal. Susmann et al. [ 2024 ] study treatmen t-group-sp eciﬁc eﬀects, including both direct and in- direct standardization parameters, motiv ated by the application of pro vider proﬁling. They allo w for m ultiple treatmen t levels and derive doubly robust eﬃcient estimators for the target parameters in this setting. In another recent pap er, Xiang et al. [ 2025 ] consider the estimation of mean p otential outcomes in v arious treatmen t regimes that allo w for multiple treatmen t lev els, including single multi-v alued and vector treatments. They prop ose doubly robust es- timators and show asymptotic normality . In b oth pap ers, the estimators are analyzed under the regime in which the num b er of treatmen ts k is treated as ﬁxed and n → ∞ . Ho wev er, such error guaran tees are unrealistic for high-dimensional treatmen ts where k is large relativ e to n . In our work, we ﬁrst demonstrate that accoun ting for the dep endence of error b ounds on k is crucial in this setting, b y showing that estimating mean p oten tial outcomes exhibits a slow k /n rate in mean squared error. Ignoring this k -dep endence obscures actual error rates and can severely o verstate accuracy . In our pap er, the num b er of treatments k and the sample size n can b e an y t wo n umbers; in particular, we require no sp eciﬁc asymptotics, and k need not b e prop ortional to or scale with n . W e presen t ﬁnite-sample error guaran tees for our prop osed estimators, which dep end on n and k and are v alid for any suc h v alues, rev ealing ho w a large k w orsens error guarantees. Finally , we note that our prop osed pseudo-outcome regression framework is similar in spirit to the procedure suggested by F oster and Syrgk anis [ 2019 ], who presen t an empirical risk min- imization framew ork for general loss functions that inv olve n uisances. Their presen ted results are general and work for a wide range of loss functions. Ho wev er, in our high-dimensional treatmen t setting, it is helpful to exploit the sp eciﬁc treatmen t structure more explicitly for sev eral reasons: First, strong p ositivit y is necessarily violated in the high-dimensional treat- men t case, whic h w ould be required to obtain suﬃcien tly fast rates from F oster and Syrgk anis [ 2019 ]. Second, b y exploiting additional structure, w e obtain a doubly robust nuisance error term, rather than an L 4 -error that is not doubly robust, an essen tial feature of our prop osed estimators. Third, the error rates presented here are derived using a simpler pro of technique whic h may b e of indep endent interest. In particular, although we analyze a constrained v ersion of our regression estimators, our pro ofs can be straightforw ardly adapted to p enalized versions (whereas the pro of of Example 3 in F oster and Syrgk anis [ 2019 ] relies on b ounding the critical radius of the function class of high-dimensional linear predictors with sparsity constraint). 3 2 Setup & Notation W e consider an indep enden t and identi cally distributed (iid) sample ( Z 1 , ..., Z n ) where Z = ( X , A, Y ) ∼ P for cov ariates X ∈ R d , treatmen t A ∈ A , and outcome Y ∈ R . The set A is a generic set for no w and is deﬁned explicitly once we consider sp eciﬁc treatmen t settings in tro duced in Section 3 , suc h as single multi-v alued treatments or v ector treatmen ts. Let π a ( x ) = P ( A = a | X = x ) µ a ( x ) = E ( Y | X = x, A = a ) denote the treatmen t probabilities and outcome regression, resp ectiv ely . Throughout w e as- sume the p ositivity condition that π a ( x ) > 0 for all a ∈ A (see Section 3.2 for discussion). F urther, let b ϖ a = 1 n P n i =1 1 ( A = a ) b e the empirical prop ortions of the treatment, and, for discrete A , let ϖ a = P ( A = a ) b e the treatmen t probabilities. Our goal is to study estimation of the treatment-speciﬁc mean Z µ a ( x ) d P ( x ) (1) across man y treatments a ∈ A , in particular, when the treatment is high-dimensional (which will b e made rigorous in Section 3 ). Letting Y a denote the p otential outcome under treatmen t A = a , the quantit y ( 1 ) also equals the counterfactual mean E ( Y a ) , if w e make the additional causal assumptions of consistency ( Y = Y A ) and no unmeasured confounding ( A ⊥ ⊥ Y a | X ). Ho wev er, all our results apply to the statistical quantit y deﬁned in ( 1 ), regardless of whether the causal assumptions hold. W e use ∥ v ∥ 1 = P j | v j | and ∥ v ∥ 0 = P j 1 ( v j  = 0) for the ℓ 1 - and ℓ 0 -norms for v ectors v . F or a function f we let ∥ f ∥ 2 P, 2 = R f ( z ) 2 dP ( z ) denote the squared L 2 ( P ) norm. W e write sample a verages with the empirical distribution shorthand P n ( f ) = 1 n P n i =1 f ( Z i ) . When we use sample splitting with separate folds D s , we denote the empirical measure o ver D s b y P s n . 3 High-Dimensional T reatmen ts In this section, we present v arious t yp es of high-dimensional treatmen ts considered in this pa- p er. W e c haracterize eac h of the diﬀerent treatment settings, highlight their unique c hallenges, compare them with one another, and ﬁnally displa y the fundamental limits of how well one can p ossibly estimate mean p otential outcomes in a high-dimensional regime. 3.1 T yp es of High-Dimensional T reatments In this subsection, w e distinguish b et ween t wo structurally diﬀeren t types of high-dimensional treatmen ts for whic h we wan t to estimate mean p oten tial outcomes. W e rely on a dic hotomy distinguishing single multi-v alued and v ector treatmen ts, whic h w as also discussed, for exam- ple, in Xiang et al. [ 2025 ]. Single Multi-V alued T reatments are treatmen ts that can take v alues from a set of un- ordered discrete v alues. More sp eciﬁcally , A ∈ { 1 , . . . , k } , i.e., A can b e one of k possible 4 treatmen t levels. Note that it do es not matter how we denote the elements of the set of p os- sible treatmen t levels since the treatments are unordered. F or example, the set of p ossible v alues can b e k diﬀerent medical treatmen ts denoted by 1 , . . . , k , and if patient i receiv ed treatmen t num b er j , we observe A i = j . A ve ctor tr e atment , also called multiple tr e atment in Xiang et al. [ 2025 ], is a v ector of diﬀeren t treatments, i.e., a combination of multiple individual treatmen ts. More sp eciﬁcally , A = ( A 1 , . . . , A k ) for individual treatments A j . Depending on which v alues each individual treatmen t can take, w e w ant to distinguish vector treatments in to t w o categories further: binary and contin uous vector treatmen ts. Binary V ector T reatments W e call A a binary ve ctor tr e atment , or also binary multiple tr e atment , if eac h individual treatmen t of the treatmen t combination is binary , i.e., A = ( A 1 , . . . , A k ) and A 1 , . . . , A k ∈ { 0 , 1 } . In other words, A ∈ { 0 , 1 } k . Intuitiv ely , this means that for eac h unit, we observe a combination that indicates whic h of the treatmen ts A 1 , . . . , A k w as received by the unit. Note that this can b e seen as a sp ecial case of single multi-v alued treatmen ts, whic h allows for mo deling additional structure compared to the single m ulti-v alued treatmen t case. Con tinuous V ector T reatments W e call A a c ontinuous ve ctor tr e atment , or also c on- tinuous multiple tr e atment or just c ontinuous tr e atment , if A ∈ R k is a contin uous random v ariable. In tuitively , this means that for eac h unit we observ e a com bination of treatment doses. R emark 1 . W e note a subtlety in our notation. While A 1 , . . . , A k refer to the treatment com- p onen ts of an observed treatment vector A , the term A i can also refer to the i -th of the n observ ations of the treatmen t v ariable. The meaning of the subindex will b e clear from the con text. Moreov er, the j -th treatment comp onen t of observ ation i is denoted by A i,j . Through- out, we also refer to discr ete v ersus c ontinuous treatmen ts at times. By discr ete tr e atments we refer to either the single multi-v alued or binary v ector treatment setting, whereas c ontinuous tr e atments refer to contin uous vector treatments. In this pap er, for b oth single m ulti-v alued and v ector treatments, our results allo w the n umber of treatmen ts k and sample size n to be any tw o v alues, enabling the analysis of arbi- trary treatment dimensions. In particular, we require no sp eciﬁc asymptotics, and, although co vered by our theory , k do es not hav e to scale with n . In this work, by high-dimensional tr e atment , w e refer to the setting where the num b er of treatmen t levels k can b e large relative to n (without relying on any particular asymptotic scaling). 3.2 P ositivit y Violation In standard settings with A binary , a common assumption for iden tiﬁcation and estimation of mean p oten tial outcomes is strong p ositivity , i.e., P ( A = 1 | X ) ≥ ε > 0 with probabilit y one for some ε that is indep enden t of the sample size n . This assumption says that each unit has a non-trivial chance of receiving the treatment. This assumption may b e untenable for high- dimensional treatments, particularly those that are discrete, where near-violations of p ositivity are inevitable. That is, strong p ositivit y must necessarily b e violated for most treatmen t levels. 5 By ne ar-violation of p ositivity , we refer to the situation when weak p ositivit y π a ( X ) > 0 is satisﬁed (so that the mean p otential outcomes can still b e identiﬁed), but π a ( X ) cannot b e lo wer b ounded by a strictly p ositiv e constan t that is indep enden t of n and k , i.e., strong p ositivit y is violated. Intuitiv ely , this describ es the situation where propensity scores are near zero due to a large n umber of treatmen ts; how ever, they are nev er exactly zero. W e note that it is p ossible to extend our results to address even violations of w eak p ositivity , b y handling diﬀeren t target parameters, such as incremental eﬀects [ Kennedy , 2019 , Schindl et al. , 2024 ]. Note here a fundamental diﬀerence b etw een discrete and contin uous high-dimensional treatmen ts. F or discrete high-dimensional treatments, where k = k ( n ) is an un b ounded se- quence in n , near-violation of p ositivit y is unav oidable. Ho wev er, near-p ositivity violations need not o ccur in contin uous treatmen ts. Ev en with large k (the n umber of individual treat- men ts of the con tinuous treatment vector), it may b e reasonable to assume that the densit y π a ( x ) can b e b ounded aw ay from zero by a constan t indep enden t of n and k . F urthermore, there is an imp ortant distinction b etw een single m ulti-v alued and vector treatmen ts to b e made. F or single multi-v alued treatments, ev en if it is uniformly distributed, w e only hav e roughly n/k observ ations at each level as a consequence of the near-violations of p ositivit y , whereas, for vector treatments, we hav e an eﬀectiv e sample size of n . This im- plies that, when estimating mean p oten tial outcomes, we can tolerate smaller sample sizes in comparison to the num b er of treatments for v ector treatmen ts than for single multi-v alued treatmen ts (for details refer to Sections 5 and 6 ). These inevitable near-violations of p ositivity make estimation of mean p oten tial outcomes c hallenging. It is p ossible to show that without any further structural assumptions, estimation cannot b e done with an accuracy b etter than k /n (in a minimax sense), which is problematic esp ecially when the n umber of treatments k is large in comparison to the sample size n . F or details on this and rigorous minimax lo wer b ounds, we refer to App endix A.1 . 4 Pseudo-Outcome Regression on High-Dimensional T reatmen ts In this section, we prop ose a general sparse pseudo-outcome regression framework whic h can b e used to obtain fast conv ergence rates for the estimation of an arbitrary high-dimensional statistical functional ψ ∈ R k asso ciated with lev el-sp eciﬁc outcomes f ( Z ; a ) for some (ﬁxed) function f , where the lev els a are given by the p ossible v alues of the observ ed random v ariable A , i.e., a ∈ A = supp( A ) . W e prop ose constrained Lasso and b est subset selection estimators and provide error guarantees under sparsit y , showing that eﬃcien t estimation is ac hiev able ev en in high-dimensional regimes. 4.1 Motiv ation Although this regression framework applies to arbitrary statistical functionals ψ associated with treatmen t-level-speciﬁc outcomes, it is b est motiv ated b y the example of estimating mean potential outcomes. In this case, w e relate ψ to these mean potential outcomes for each treatmen t level, e.g., E ( Y a ) = ψ 0 + ψ a for a single multi-v alued treatment a ∈ { 1 , . . . , k } or E ( Y a ) = ψ 0 + a T ψ for a vector treatmen t a = ( a 1 , . . . , a k ) for some in tercept ψ 0 . 6 T o achiev e faster conv ergence rates b ey ond the slo w k /n -rate when estimating mean p o- ten tial outcomes (see Prop ositions 1 and 2 in App endix A.1 ), we need to assume additional structure. A natural option is to in tro duce a sparse mo del for the mean p oten tial outcomes, i.e., to assume that ψ is sparse. Sparsit y is a very common assumption in high-dimensional regression when the num b er of features d is large relative to n , e.g., as discussed by W ain wright [ 2019 ], Rigollet and Hütter [ 2023 ], and many others. F or high-dimensional linear regression, without any further assumptions, the minimax risk is d/n , where d is the num b er of features, whic h is problematic when d is large relativ e to n . Then, to o vercome this slow rate, sparsity of the co eﬃcient v ector is assumed, and faster rates that dep end on d only logarithmically can b e deriv ed. The k /n minimax rate of Prop ositions 1 and 2 in the high-dimensional treat- men t regime can be view ed as analogues to the high-dimensional regression minimax rate d/n . Therefore, inspired by high-dimensional regression, we hop e for a logarithmic dep endence on k once we assume the mean p oten tial outcomes to b e sparse. While we could analyze each treatment setting individually , we prop ose a general sparse pseudo-outcome regression framework that is more broadly applicable and, in particular, can b e applied to the special cases of estimating mean p oten tial outcomes in the single and v ector treatmen t settings. The application to single treatments is the sub ject of Section 5 and to v ector treatmen ts of Section 6 . Although these are our primary applications of the prop osed framew ork, it is essential to note that it can be applied to many more high-dimensional target parameters, such as incremental eﬀects. Our prop osed framework can b e viewed as a tw o-stage pro cedure using sample splitting. In the ﬁrst stage, on the ﬁrst fold of the data, we estimate a so-called pseudo-outcome (that represen ts our estimation target). With the term pseudo-outc ome , we refer to a function of the observ ed data that acts as a replacemen t of the actual (unobserv ed) outcome and, in par- ticular, is equal to it in exp ectation. Resorting to a pseudo-outcome is necessary since the actual outcome is often unobserved, e.g., p oten tial outcomes. In the example of estimating E ( Y a ) , a suitable pseudo-outcome would be the uncen tered eﬃcien t inﬂuence function of this parameter. Often, such a pseudo-outcome dep ends on unkno wn n uisance parameters, suc h as regression functions or the prop ensit y score. Consequen tly , we m ust use an estimated version of the pseudo-outcome, suc h as the estimated uncen tered inﬂuence function obtained b y plug- ging in estimates of the n uisance parameters. Throughout this section, as the choice of the pseudo-outcome depends on the estimation target, w e assume the estimated pseudo-outcomes as given and w e are agnostic ab out how they are obtained (for an explicit construction refer to Sections 5 and 6 ). In the second stage of our pro cedure, we now use these estimated pseudo- outcomes in a sparse linear least squares regression on the second fold of the data. T wo-stage pro cedures with estimated pseudo-outcomes using sample splitting, whic h allo w for structure-agnostic estimation, ha v e b een studied in the literature b efore. Our prop osed framew ork is of a similar spirit to the t wo-stage pro cedures prop osed b y Kennedy [ 2023 ] for CA TE estimation, where estimated pseudo-outcomes are regressed on co v ariates, or F oster and Syrgk anis [ 2019 ], who presen t a general empirical risk minimization framew ork where the loss function can dep end on nuisance parameters; b oth these frameworks allo w for a structure- agnostic estimation in b oth stages. F urthermore, in previous lines of work, including Ai and Chen [ 2003 ] and Rubin and v an der Laan [ 2005 ], regression with general pseudo-outcomes has b een studied, how ev er, without the use of sample splitting, requiring restrictions on the esti- mation pro cedures in the ﬁrst or second stage. In our framew ork, we are agnostic ab out the 7 ﬁrst-stage pro cedure, how ever, explicitly set the second-stage pro cedure to b e a constrained linear least squares regression. In the follo wing subsection, w e introduce the sparse pseudo-outcome regression in its most general form, allowing for arbitrary estimated pseudo-outcomes and their related high- dimensional statistical functional ψ . 4.2 Prop osed Estimators Let ψ ∈ R k denote any high-dimensional statistical functional throughout this section, whic h is our estimation target. F urther, let A = supp( A ) . Let b φ 1 ( Z ) and b φ 2 ,a ( Z ) be generic estimated pseudo-outcomes for eac h a ∈ A . These could, for instance, b e obtained b y constructing pseudo-outcomes φ 1 ( Z ; η ) and φ 2 ,a ( Z ; η ) (e.g., based on the uncentered eﬃcien t inﬂuence function of the target parameter) that de- p end on nuisance parameters η , and deﬁning the estimated pseudo-outcome as plug-in es- timates b φ 1 ( Z ) = φ 1 ( Z ; b η ) and b φ 2 ,a ( Z ) = φ 2 ,a ( Z ; b η ) for a giv en estimate b η of the n uisance parameters. W e use sample splitting and assume that the estimation of the functions b φ 1 and b φ 2 ,a is done on a separate indep endent fold D 1 of size n ; in particular, indep endent of our sample ( Z 1 , . . . , Z n ) which we denote by D 2 . F urthermore, let V a ∈ R k denote deterministic v ectors for all a ∈ A . W e regress the esti- mated pseudo-outcomes on these predictors. F or now, we let V a b e generic predictors and do not sp ecify them explicitly . The explicit form of V a is determined by the mo del assumption that connects ψ to the level-speciﬁc outcome. Giv en the previous notation and setup, we now presen t our prop osed estimators. Deﬁnition 1 (Best subset selection and Lasso estimator) . Deﬁne the empirical risk b R ( β ) = P 2 n  − 2 Z A b φ a ( Z ) V T a β d P 1 n ( a ) + ( V T A β ) 2  . (2) Then, deﬁne the Lasso estimator b ψ lasso = arg min ∥ β ∥ 1 ≤ s b R ( β ) (3) and the b est subset selection estimator b ψ subset = arg min ∥ β ∥ 0 ≤ s b R ( β ) (4) as the minimizer of this empirical risk sub ject to an L 1 - and L 0 -constrain t, resp ectiv ely . By imp osing an L 0 - or L 1 -constrain t on the co eﬃcien t vector in this regression, we aim to lev erage sparsity of the v ector ψ and achiev e fast conv ergence rates even when k is large in prop ortion to n . Note that we could switc h to a p enalized (instead of constrained) regression in the second stage while main taining the same theoretical guarantees. F or a discussion of constrained versus p enalized Lasso and b est subset selection approaches, w e refer to W ain- wrigh t [ 2019 ] or Hastie et al. [ 2020 ]. In the follo wing subsection, we analyze the prop osed estimators under sparsity , providing error guarantees that v alidate the prop osed pseudo-outcome regression framework. 8 4.3 Master Theorem In this section, we pro vide an error guarantee for our prop osed estimators under sparsity . So far, we put no restrictions on how the giv en pseudo-outcomes are obtained. F or b ψ lasso and b ψ subset to b e accurate estimators, we require that the pseudo-outcomes are appropriately re- lated to the estimation target ψ . Therefore, we state an error guaran tee for our estimators that dep ends on individual error rates characterizing the qualit y of our c hosen pseudo-outcomes. Suc h an error guarantee can b e derived under b oth exact sparsity , i.e., ∥ ψ ∥ 0 ≤ s , or appro xi- mate sparsity , i.e., ∥ ψ ∥ 1 ≤ s , for some sparsit y constraint s . In the follo wing, we provide this error guaran tee under exact sparsit y . F or details on the analysis under approximate sparsity , w e refer to App endix A.3.1 . Exact sparsit y ∥ ψ ∥ 0 ≤ s means that at most s entries of the vector ψ are non-zero. The follo wing theorem shows how this sparsity is leveraged by the prop osed estimators. Theorem 1 (Error rate under exact sparsity) . L et b ψ ∈ { b ψ lasso , b ψ subset } b e either the L asso estimator deﬁne d in ( 3 ) or the b est subset sele ction estimator deﬁne d in ( 4 ) minimizing the empiric al risk in ( 2 ). Supp ose exact sp arsity, i.e., ∥ ψ ∥ 0 ≤ s for some sp arsity c onstr aint s . Mor e over, assume for al l j, l ∈ { 1 , . . . , k } and a ∈ A that (i) (A lmost sur e b ounde dness)   R A b φ a ( Z ) V a,j d P 1 n ( a )   , | V T A ψ V A,j | , | V A,j V A,l | ≤ C 1 r 1 ( k ) al- most sur ely, (ii) (Se c ond moment b ounde dness) P n  R A b φ a ( Z ) V a,j d P 1 n ( a )  2 o , ( R A V T A ψ V A,j d P 1 n ( a )) 2 ≤ C 1 b r 1 ( k ) and P  ( V T A ψ V A,j ) 2  , P ( V 2 A,j V 2 A,l ) ≤ C 1 r 1 ( k ) almost sur ely wher e b r 1 ( k ) c an dep end on D 1 and E D 1 { b r 1 ( k ) p } ≤ Rr 1 ( k ) p for p = 1 , 2 , (iii) (Nuisanc e estimation err or r ate)   P  b φ a ( Z ) − V T a ψ    ≤ C 2 r 2 ,a ( n, k ) , (iv) (R estricte d eigenvalue c ondition) C 3 ∥ v ∥ 2 2 ≤ v T Σ v for Σ = E ( V A V T A ) , and (v) (Sample c ovarianc e c ondition) sup ∥ v ∥ 2 =1 , ∥ v ∥ 1 ≤ 2 √ s    v T b Σ v − v T Σ v    ≤ C 3 / 2 with pr ob abil- ity at le ast 1 − α for α ≤ C 4 log k n wher e b Σ = 1 n A T A and A =    V T A 1 . . . V T A n    for r ates r 1 = r 1 ( k ) ≳ 1 , r 2 = r 2 ( n, k ) , and c onstants R, C 1 , C 2 , C 3 , C 4 > 0 that ar e indep en- dent of n and k . Then, E  n V T A ( b ψ − ψ ) o 2  ≲ s r 1 ( k ) log k n + s E max j ∈{ 1 ,...,k }  Z A r 2 ,a ( n, k ) | V a,j | d P 1 n ( a )  2 ! for max { s 4 /r 1 ( k ) 2 , r 1 ( k ) } log k ≲ n . The preceding theorem states an error rate that consists of tw o parts: the oracle rate s r 1 ( k ) log k n and the error from estimating the pseudo-outcomes giv en by the second summand. In tuitively , the oracle rate is the error that w e would ha ve if we had access to the oracle out- comes and could use those in the regression. Most imp ortan tly , this part of the rate dep ends only on log k instead of k b y leveraging the sparsity constrain t on the vector ψ , which is de- sirable esp ecially when k is large in prop ortion to n . The pseudo-outcome estimation error 9 rate is the price w e pay for not observing the actual pseudo-outcomes and ha ving to estimate them. Sp eciﬁcally , the rate corresp onds to the additional error that arises from the distance of b φ 1 ( Z ) and b φ 2 ,a ( Z ) to the outcome V T a ψ . If these estimated pseudo-outcomes are deﬁned as plug-in estimates φ 1 ( Z ; b η ) and φ 2 ,a ( Z ; b η ) , then this error term corresponds to the error arising from nuisance estimation. In this pap er, the main application of the previous theorem is the estimation of the func- tionals R µ a ( x ) d P ( x ) (which equal E ( Y a ) under the usual identiﬁcation assumptions) for dif- feren t high-dimensional treatmen t settings. In Section 5 , we apply the theorem to the setting of a single treatmen t that can take p otentially many discrete unordered v alues. In Section 6 , w e consider its application to the setting of vector treatments. Error rates of our prop osed estimators therein can then simply b e obtained b y verifying the conditions of Theorem 1 . Finally , the estimators prop osed ab o ve can be sho wn to yield minimax optimal error rates when tailored to the setting of estimating mean p oten tial outcomes in the single m ulti-v alued treatmen t regime. Section 7 presen ts these minimax results. W e pro ceed with remarks that discuss the assumptions of the previous theorem in more detail and provide guidance on the choice of a suitable pseudo-outcome and the tuning pa- rameter s in ( 3 ) and ( 4 ). R emark 2 (Discussion of the assumptions) . Assumptions (i) and (ii) are mild since the rate r 1 ( k ) can b e of an y order. Since r 1 ( k ) con tributes to the oracle part of the o verall rate, one should aim to verify these assumptions with r 1 ( k ) as small as p ossible. In the same wa y , Assumption (iii) is mild. When applied in the single or vector treatmen t setting later on, the rate r 2 ,a ( n, k ) corresp onds to a second-order nuisance p enalt y . More sp eciﬁcally , it will be the pro duct of the errors from estimating the regression function and the prop ensit y score. The restricted eigen v alue condition in Assumption (iv) and sample cov ariance condition in As- sumption (v) are standard assumptions in the high-dimensional regression literature to obtain so-called fast Lasso rates in the random design, i.e., a dep endence on log k n instead of q log k n . When w e apply the ab ov e theorem to single multi-v alued and vector treatmen ts, resp ectiv ely , w e show that these t wo assumptions are satisﬁed. R emark 3 (Exact v ersus approximate sparsity) . The ab o ve theorem assumes exact sparsit y of the functional ψ . Dep ending on the application, assuming exact sparsity might appear unrea- sonable. Instead, one might resort to approximate sparsit y , where w e assume that ∥ ψ ∥ 1 ≤ s . Also under appro ximate sparsity , we are able to deriv e an error guaran tee of the ab o ve ﬂav or; ho wev er, there are some imp ortant distinctions to be made. First of all, in this setting, we can only provide an error guarantee for the Lasso estimator b ψ lasso based on our analysis technique, not for the b est subset selection estimator. Second, by assuming only approximate sparsity , we ha ve to pay the price of obtaining a slow er conv ergence rate. Third, the fast rate in Theorem 1 is obtained by using exact sparsity in combination with the restricted eigenv alue condition and sample cov ariance condition. When we only assume approximate sparsity , there we no longer need to assume restricted eigenv alues and the sample cov ariance condition. F or details w e refer to App endix A.3.1 . R emark 4 (Error metric) . The previous theorem weigh ts the error according to the marginal distribution of the treatment v ariable. Ho w ever, it is p ossible to state the same error b ound 10 for the error w eighted according to the empirical prop ortions of the treatment, i.e., E  P n  n V T A ( b ψ − ψ ) o 2  ≲ s r 1 ( k ) log k n + s E max j ∈{ 1 ,...,k }  Z A r 2 ,a ( n, k ) | V a,j | d P 1 n ( a )  2 ! . This is sho wn along the wa y in the proof of the theorem. W e wish to p oint out that this b ound on the in-sample error do es no longer require the assumptions | V A,j V A,l | ≤ C 1 r 1 ( k ) and P ( V 2 A,j V 2 A,l ) ≤ C 1 r 1 ( k ) . Moreo ver, we only need to assume n ≳ r 1 ( k ) log k instead of n ≳ max { s 4 /r 1 ( k ) 2 , r 1 ( k ) } log k . R emark 5 (W eighing the treatmen ts) . In the empirical risk ( 2 ) used for the estimator in the ab o ve theorem, we w eigh the treatmen ts using the empirical measure P 1 n ( a ) of the treatment v ariable A on the sample D 1 . W e note that the estimated risk can instead use a general measure b P to weigh the treatments, as long as Assumptions (i) and (ii) can b e veriﬁed when replacing the measure P 1 n ( a ) with b P ( a ) , and ( b P − P )  V T A ψ V A,j  2 b eha ves similar to an em- pirical pro cess term in the sense that its maximum ov er j ∈ { 1 , . . . , k } can b e b ounded in exp ectation by r 1 ( k ) log k n . R emark 6 (Cho osing a suitable pseudo-outcome) . The pseudo-outcomes b φ 1 ( Z ) and b φ 2 ,a ( Z ) can often b e chosen based on the (uncentered) eﬃcien t inﬂuence function of the target pa- rameter. F or instance, suppose that the estimation target (oracle outcome) is E { µ a ( X ) } , then the uncen tered eﬃcien t inﬂuence function φ a ( Z ; π , µ ) (dep ending on the prop ensity score and regression function) equals the outcome in exp ectation. Using estimators b π and b µ of the nui- sance parameters, w e obtain an estimated eﬃcien t inﬂuence function φ a ( Z ; b π , b µ ) . W e can no w set b φ a ( Z ) = φ a ( Z ; b µ, b π ) . Using eﬃcient inﬂuence function-based pseudo-outcomes in our prop osed framework leads to a nuisance estimation error rate r 2 ,a ( n, k ) in Assumption (iii) that is a second-order term consisting of squares or pro ducts of the n uisance errors. R emark 7 (Choice of the constraint parameter) . The theorem uses the fact that the constraint tuning parameter s in the Lasso estimator ( 3 ) and the b est subset selection estimator ( 4 ) is c hosen as the true sparsity , that is, s = ∥ ψ ∥ 0 for b est subset selection and s = ∥ ψ ∥ 1 for the Lasso. Ho wev er, the sparsity of the vector ψ is often unknown, whic h makes this choice infeasible in practice. Instead, the constraint tuning parameter in the Lasso and b est subset selection estimator can b e chosen by cross-v alidation. W e note without further details that it is p ossible to state a cross-v alidation pro cedure based on minimizing an estimated risk and pro vide theoretical guaran tees in the form of oracle inequalities, demonstrating the v alidity of the pro cedure. 5 Single Multi-V alued T reatments In this section, we apply the general pseudo-outcome regression to the estimation of mean p oten tial outcomes for single multi-v alued treatments. W e state the prop osed Lasso and b est subset selection estimator of Section 4 for this sp eciﬁc use case, show double-robustness of these estimators, and provide error guaran tees under sparsit y . Finally , this reveals that faster con vergence rates, in particular, rates faster than the minimax rate in Proposition 1 , are ac hiev able with additional structure. 11 5.1 Setup & Mo del Assumption Throughout this section, we assume that the treatmen t v ariable A is a single multi-v alued treatmen t, i.e., it takes v alues in the set A = { 1 , . . . , k } , where the n umber of p ossible treat- men t levels k is p oten tially large. Our goal is to estimate the mean p oten tial outcomes E ( Y a ) , a = 1 , . . . , k . W e ha ve already seen that, without an y further assumptions, w e are stuc k with a slow conv ergence rate of k /n in MSE (see Prop osition 1 for details). T o achiev e faster conv ergence rates, we utilize the Lasso and b est subset selection estimator from Sec- tion 4 when assuming sparsity of the mean p oten tial outcomes. T o lev erage the pseudo-outcome regression framework, w e connect our oracle outcomes E ( Y a ) with a sparse statistical functional ψ . W e assume the sparse mo del E ( Y a ) = Z R d µ a ( x ) d P ( x ) = ψ 0 + ψ a = ψ 0 + k X j =1 ψ j 1 ( a = j ) where ∥ ψ ∥ ≤ s (5) for some in tercept ψ 0 that is assumed to b e known for simplicity and ∥·∥ either the L 0 - or L 1 -norm. Note that we require typical identiﬁcation assumptions for the ﬁrst equalit y to hold. Even without identiﬁcation assumptions, the following theory still applies to estimating R µ a ( x ) d P ( x ) . Intuitiv ely , the mo del assumption in ( 5 ) says that the mean p oten tial outcomes do not deviate to o m uch from some intercept v alue ψ 0 . Our goal is to estimate the functional ψ : P → R k using our pseudo-outcome regression framework and leveraging the introduced sparsit y to obtain fast conv ergence rates. In the next subsection, we prop ose our estimators. 5.2 Prop osed Estimators T o estimate ψ in ( 5 ), w e leverage the pseudo-outcome regression framew ork, namely the Lasso estimator in ( 3 ) and b est subset selection estimator in ( 4 ) for a sp eciﬁc choice of b φ a in the empirical risk ( 2 ). F or nuisance parameters η = ( µ, π ) , deﬁne pseudo-outcomes φ a ( Z ; η ) = 1 ( A = a ) π a ( X ) { Y − µ a ( X ) } + µ a ( X ) − ψ 0 . for all a ∈ A . Note that this pseudo-outcome is the diﬀerence of the uncentered eﬃcient inﬂuence function of E { µ a ( X ) } and the intercept ψ 0 , so E { φ a ( Z ; η ) } = E { µ a ( X ) } − ψ 0 = ψ a . Let b η = ( b µ, b π ) be a given estimator of the n uisance parameters, constructed on the fold D 1 of size n that is indep endent of our sample D 2 = ( Z 1 , . . . , Z n ) . Then, set b φ a ( Z ) = √ k φ a ( Z, b η ) = √ k φ a ( Z, b µ, b π ) (6) to b e the (scaled) plug-in estimate of the pseudo-outcome. It is imp ortan t to highligh t three asp ects. First, using the eﬃcien t inﬂuence function as pseudo-outcome comes with the beneﬁt that our estimated pseudo-outcome is close to the oracle outcome up to the estimation error from the nuisance parameters, as E { φ a ( Z ; η ) } = ψ a . Second, the error from n uisance estima- tion will turn out to b e doubly robust due to the deﬁnition of the eﬃcient inﬂuence function via the v on Mises expansion, yielding a second-order remainder term. More sp eciﬁcally , we sho w that the nuisance p enalt y has a second order dep endence on δ n ( a ) =     π a b π a − 1     and ϵ n ( a ) = ∥ µ a − b µ a ∥ . 12 Finally , scaling pseudo-outcomes by √ k is useful for v erifying a restricted eigen v alue condition. Next, we set the features of the regression to be the scaled vector of indicators V a = √ k ( 1 ( a = 1) , . . . , 1 ( a = k )) . Plugging these expressions into ( 2 ), the Lasso and b est subset selection estimator minimize the empirical risk b R ( β ) = P 2 n ( − 2 k k X a =1  1 ( A = a ) b π a ( X ) { Y − b µ a ( X ) } + b µ a ( X ) − ψ 0  β a b ϖ a + k β 2 A ) (7) sub ject to an L 1 - or L 0 -constrain t, respectively , where b ϖ j = P 1 n { 1 ( A = j ) } denotes the empir- ical prop ortion of treatmen t j on D 1 . Since we scaled b oth pseudo-outcomes and predictors b y √ k , it b ecomes apparen t that this do es not c hange our estimators, as minimizing ( 7 ) is equiv alent to minimizing this risk divided by k . R emark 8 (Connection to Gaussian sequence mo del) . Since the Lasso and b est subset selec- tion estimator are obtained through a constrained regression on indicators via the ab ov e risk function, they are heuristically v ery similar to soft- and hard-thresholding estimators in the Gaussian sequence mo del, where Y a ∼ N  ψ 0 + ψ a , 1 N a  with N a ∼ Bin( n, P ( A = a )) . How- ev er, this analogy is limited: First, the sample size N a is random; sp eciﬁcally , w e only hav e around n/k observ ations at each level instead of n . Second, w e do not observe Y a directly; instead, it has to b e estimated using the cov ariates. Lastly , w e do not assume Gaussianity of the outcome, nor are our (estimated) pseudo-outcomes sub-Gaussian. In the App endix A.2 , w e discuss an alternative risk function for which this connection can b e made rigorous. 5.3 Error Guarantee In this section, we present the error rate of our proposed Lasso and best subset selection estimator under exact sparsit y , i.e., when ∥ ψ ∥ 0 ≤ s . W e demonstrate that a fast Lasso rate can b e ac hieved and that fast con vergence rates are p ossible, even when the num b er of treatmen ts is large relativ e to the sample size. F or an analysis under approximate sparsity , w e refer to App endix A.3.2 . Theorem 2. L et b ψ ∈ { b ψ lasso , b ψ subset } b e either the L asso estimator deﬁne d in ( 3 ) or the b est subset sele ction estimator deﬁne d in ( 4 ) minimizing the empiric al risk in ( 7 ) for b φ a as in ( 6 ). Assume the mo del in ( 5 ) with exact sp arsity, i.e., ∥ ψ ∥ 0 ≤ s for some sp arsity c onstr aint s . Mor e over, assume that for al l a ∈ { 1 , . . . , k } (i) (Ne arly uniform tr e atment pr ob abilities) c k ≤ ϖ a = P ( A = a ) ≤ C k , (ii) (Estimate d pr op ensity sc or e close to empiric al weights) | b ϖ a / b π a ( X ) | ≤ B almost sur ely, and (iii) (Bounde dness) | b µ a ( X ) | , | Y | , | ψ 0 | ≤ B almost sur ely for c onstants c, C , B > 0 that ar e indep endent of k and n . Then, E k X a =1 ϖ a ( b ψ a − ψ a ) 2 ! ≲ s log k n + s k ( δ n ϵ n ) 2 for ϖ a = P ( A = a ) , k ≥ 2 , n ≥ max { γ k , s 4 /k 2 } log k for a lar ge enough c onstant γ , when assuming that δ ( a ) ≤ δ n and ϵ n ( a ) ≤ ϵ n for al l a ∈ { 1 , . . . , k } . 13 The error guarantee given in the theorem consists of tw o parts: the oracle rate s log k n and the nuisance estimation error s ( δ n ϵ n ) 2 . The oracle error rate can b e viewed as the estimation error that we would suﬀer even if w e used the unobserv ed p oten tial outcomes Y a in the re- gression. The nuisance estimation error arises from the need to estimate b oth the regression function and the prop ensity score, i.e., it is the price w e pay for not observing the p otential outcomes. Notably , the oracle rate s log k n rev eals that w e can signiﬁcantly impro ve on the slo w k /n con vergence rate stated in Prop osition 1 when assuming sparsity of the vector ψ . Note that our error now only dep ends on k logarithmically instead of linearly . This shows that consistent estimation is p ossible even in the high-dimensional treatment regime. The n uisance estimation error term contains a second-order product ( δ n ϵ n ) 2 of the regres- sion function and prop ensit y score estimation error. F or this reason, w e refer to our estimator as doubly r obust . When assuming parametric mo dels for the n uisance estimation, it is enough to correctly sp ecify one of the t wo n uisance mo dels for consisten t estimation. Ev en when ﬂexible nonparametric machine learning to ols are used to estimate the nuisance parameters, is it suﬃcient for the pro duct of the rates to b e “fast enough”. In particular, w e achiev e the oracle rate s log k n whenev er δ n ϵ n ≲ q k log k n . Hence, the individual nuisance error rates can b e slo wer as long as the pro duct of the tw o satisﬁes the rate requirement. F or instance, this is the case when δ n , ϵ n ≍  k log k n  1 / 4 . Note that the requiremen t δ n ϵ n ≲ q k log k n to achiev e the oracle error rate corresp onds to δ n ϵ n ≲ 1 / √ n in the classical doubly robust estimation setting up to a log k inﬂation factor. The diﬀerence arises from the fact that our eﬀective sample size is n/k instead of n . In the following remarks, we comment on the assumptions of Theorem 2 and provide an alternativ e upp er b ound for the error that is slightly more informative. R emark 9 (Exact sparsit y assumption and restricted eigenv alues) . Assuming exact sparsity ∥ ψ ∥ 0 ≤ s allows us to derive a fast Lasso rate, i.e., an oracle rate of the order s log k n instead of the slow er rate s q log k n . Note that we do not need to assume restricted eigen v alues on the design matrix to achiev e this fast Lasso rate, as it is automatically satisﬁed due to the sp eciﬁc form of the regression features and the rescaling in the estimated pseudo-outcome and features. Note that it is p ossible to deriv e a slow Lasso rate when we assume only appro ximate sparsit y instead of exact sparsity; details can b e found in App endix A.3.2 . R emark 10 (Assumptions on treatmen t probabilities) . Assuming that the treatmen t probabil- ities ϖ a are nearly uniform implies that each treatmen t has a roughly equal c hance of b eing applied. Essen tially , it rules out the p ossibility that some treatments are to o rare for mean- ingful estimation at these treatment lev els. W e further assume that the estimated prop ensit y score b π j ( X ) is close to the empirical weigh ts b ϖ a . This is a mild assumption since b π a ( X ) is also constructed on D 1 and can therefore b e thresholded from b elow to satisfy Assumption (ii) (p oten tially at the exp ense of a larger nuisance p enalt y δ n ( a ) ). R emark 11 (Boundedness assumptions) . Boundedness of the outcome Y and intercept ψ 0 is natural in man y practical applications, for instance, when the outcome is a test score or some 14 health outcome. Moreo v er, the b oundedness assumption on the estimated regression function b µ is mild, as w e can threshold the estimator to satisfy this condition. R emark 12 (Ultra-high-dimensional regime) . The theorem states the rate for k log k ≲ n . When k ≳ n , the same rate can b e derived, but now the nuisance estimation term is domi- nated by the oracle rate, as ( δ n ϵ n ) 2 ≲ 1 ≲ k log k n . How ev er, in that regime, the oracle rate is not optimal and can b e b eaten by the trivial estimator, which alwa ys estimates zero as the mean p otential outcome. This trivial estimator has a rate of order s/k , which will turn out to b e the minimax optimal rate in this regime. Consequently , meaningful estimation is not p ossible when k ≳ n , even when assuming sparsity . Suc h a phenomenon also o ccurs in the standard high-dimensional regression scenario whenever log d > N , whic h is called the ultra- high-dimensional setting ( V erzelen [ 2012 ]), where d is the num b er of features and N is the sample size. Analogously , w e can view the case k ≳ n as the ultra-high-dimensional regime for estimating k mean p otential outcomes. The parallel can be dra wn b y noting that the eﬀective sample size is n/k instead of n in this high-dimensional treatmen t setting, so log d > N reduces to k log k > n when setting N = n/k and d = k . R emark 13 (In tercept) . In ( 5 ), w e assumed the in tercept ψ 0 to be known. F or many practical applications, this ma y not hold, and the intercept must b e estimated. How ever, for many c hoices of the intercept, this estimation task is m uch easier in comparison to estimating the high-dimensional v ector ψ , and the error rate corresp onding to the estimation of ψ 0 is neg- ligible. F or instance, a reasonable choice of the intercept could b e the av erage of all mean p oten tial outcomes, i.e., ψ 0 = 1 k P k a =1 E ( Y a ) . An estimator b ψ 0 could b e the av erage of the estimated uncen tered eﬃcient inﬂuence functions of the mean p otential outcomes. The esti- mated pseudo-outcomes in ( 6 ) can then b e deﬁned using the estimate b ψ 0 instead of the true ψ 0 . In the pro of of Theorem 2 , we then additionally need to consider the estimation error arising from b ψ 0 , whic h, how ever, turns out to yield a negligible doubly robust rate with a straightfor- w ard calculation. Consequently , the same error guaran tee in Theorem 2 can be ac hieved when the intercept, chosen as the av erage mean p oten tial outcome, has to b e estimated. A similar b eha vior is exp ected for other choices of the in tercept, suc h as the median, the mean potential outcome at one sp eciﬁc level, or general linear com binations of the target parameter. R emark 14 (Alternative Upper Bound) . Instead of the upp er b ound stated in the theorem, it is p ossible to deriv e the follo wing upp er bound with a slightly more informative n uisance error term: E k X a =1 ϖ a ( b ψ a − ψ a ) 2 ! ≲ s log k n + sk E  max a ∈{ 1 ,...,k } { b ϖ a δ n ( a ) ϵ n ( a ) } 2  . This result is shown along the wa y in the pro of of Theorem 2 . This upp er b ound is more informativ e since it states the nuisance error in terms of the estimation error of the regression function and prop ensit y score at eac h sp eciﬁc treatment level a ∈ { 1 , . . . , k } . Hence, w e do not rely on upp er b ounding δ n ( a ) , ϵ n ( a ) by δ n , ϵ n across all treatment levels. Most imp ortan tly , this n uisance error term reveils that δ n ( a ) and ϵ n ( a ) are weigh ted b y the empirical prop ortions b ϖ a of the treatment v ariable on D 1 . Hence, if we only ha ve v ery few observ ations at a treatmen t lev el av ailable for n uisance estimation, then the term allo ws for greater nuisance estimation errors at this lev el. In particular, if a treatmen t level is en tirely unobserv ed, then b ϖ a = 0 and n uisance estimation at this level is allo w ed to b e arbitrarily inaccurate without increasing the o verall error. 15 6 V ector T reatmen ts In this section, w e apply the general pseudo-outcome regression to estimate mean p oten tial outcomes for binary or contin uous vector treatmen ts. W e present the proposed Lasso and best subset selection estimator of Section 4 for this sp eciﬁc scenario and provide error guarantees under sparsity . These results demonstrate that the prop osed estimators are doubly robust and can achiev e fast con vergence rates in the high-dimensional treatment regime, pro vided that additional sparsity is assumed. 6.1 Setup & Mo del Assumption Throughout this section, we assume that the treatmen t v ariable A is a vector treatment, i.e., is a v ector A = ( A 1 , . . . , A k ) consisting of k individual treatmen ts, where the num b er of treatments k is p oten tially large. Our goal is to eﬃciently estimate the mean p oten tial outcomes E ( Y a ) , i.e., the outcome if a unit received treatment combination a = ( a 1 , . . . , a k ) . T o achiev e fast conv ergence rates in this high-dimensional treatmen t regime, w e leverage our general sparse pseudo-outcome regression framework of Section 4 . T o leverage the pseudo-outcome regression fr amework, w e wan t to connect our oracle outcomes E ( Y a ) with a sparse statistical functional ψ . In this vector treatment setting, w e do this by assuming a linear marginal structural mo del of the form E  Y ( a 1 ,...,a k )  = Z R d E ( Y | X = x, A 1 = a 1 , . . . , A k = a k ) d P ( x ) = ψ 0 + k X j =1 ψ j a j where ∥ ψ ∥ ≤ s (8) for some in tercept ψ 0 that is assumed to b e known for simplicity and ∥·∥ either the L 0 - or L 1 -norm. Note that w e require t ypical identiﬁcation assumptions for the ﬁrst equality to hold. Even without identiﬁcation assumptions, the following theory still applies to estimating R µ a ( x ) d P ( x ) . Intuitiv ely , the mo del assumption in ( 8 ) sa ys that only a few treatmen ts of ev ery treatment com bination hav e a nontrivial eﬀect on the outcome, and also that the mean p oten tial outcomes do not deviate too m uc h from some in tercept v alue ψ 0 . Our goal is to estimate the statistical functional ψ : P → R k using our pseudo-outcome regression framework, lev eraging the introduced sparsity to achiev e fast conv ergence rates. 6.2 Prop osed Estimators T o estimate ψ in ( 8 ), we lev erage the pseudo-outcome regression framew ork, more speciﬁcally the Lasso estimator deﬁned in ( 3 ) and b est subset selection estimator deﬁned in ( 4 ) for a sp eciﬁc c hoice of b φ a in the empirical risk ( 2 ) that is minimized. F or n uisance parameters η = ( µ, π ) , deﬁne the pseudo-outcomes φ a ( Z ; η ) = 1 ( A = a ) π a ( X ) { Y − µ a ( X ) } + µ a ( X ) − ψ 0 . for all a ∈ A . Note that this is the diﬀerence of the uncen tered eﬃcien t inﬂuence function for E { µ a ( X ) } and the intercept, so E { φ a ( Z ; η ) } = E { µ a ( X ) } − ψ 0 = a T ψ . Then, set b φ a ( Z ) = φ a ( Z, b η ) = φ a ( Z, b µ, b π ) (9) 16 to b e the plug-in estimator of the pseudo-outcome based on estimated b µ, b π that are constructed on a separate sample D 1 that is independent of our sample D 2 = ( Z 1 , . . . , Z n ) . It is imp ortan t to highlight tw o asp ects: First, using the eﬃcien t inﬂuence function as pseudo-outcome comes with the b eneﬁt that our estimated pseudo-outcome is close to the oracle outcome up to the estimation error from the nuisance parameters, as E { φ a ( Z ; η ) } = ψ a . Second, the error pro duced by estimating the n uisance parameters will be doubly robust due to the deﬁnition of the eﬃcient inﬂuence function via the v on Mises expansion, yielding a second-order remainder term. More sp eciﬁcally , w e show that the n uisance p enalt y has a second-order dep endence on δ n ( a ) =     π a b π a − 1     and ϵ n ( a ) = ∥ µ a − b µ a ∥ . Next, w e set the features V a to b e the treatment combinations themselv es, i.e., V a = a . Plugging these expressions into ( 2 ), our prop osed Lasso and b est subset selection estimators in ( 3 ) and ( 4 ) then minimize the empirical risk b R ( β ) = P 2 n ( − 2 k X a =1  1 ( A = a ) b π a ( X ) { Y − b µ a ( X ) } + b µ a ( X ) − ψ 0  a T β b ϖ a + ( A T β ) 2 ) (10) sub ject to an L 1 - or L 0 -constrain t, resp ectively , where b ϖ j = P 1 n { 1 ( A = j ) } denotes the empirical prop ortions of treatment j on D 1 . 6.2.1 Error Guarantee In this section, w e present the error guaran tee for our prop osed Lasso and b est subset selection estimator for vector treatments under exact sparsity , i.e., when ∥ ψ ∥ 0 ≤ s . W e show a fast Lasso rate can b e achiev ed and that fast rates are p ossible ev en when the vector treatmen t is high-dimensional. F or an analysis under approximate sparsity , we refer to App endix A.3.3 . Theorem 3. L et b ψ ∈ { b ψ lasso , b ψ subset } b e either the L asso estimator deﬁne d in ( 3 ) or the b est subset sele ction estimator deﬁne d in ( 4 ) minimizing the empiric al risk in ( 10 ) for b φ a as in ( 9 ). Assume the mo del deﬁne d in ( 8 ) and supp ose exact sp arsity, i.e., ∥ ψ ∥ 0 ≤ s for some sp arsity c onstr aint s . Mor e over, assume that for al l a ∈ { 1 , . . . , k } (i) (Estimate d pr op ensity sc or e close to empiric al weights) | b ϖ a / b π a ( X ) | ≤ B , (ii) (R estricte d eigenvalues) v T E ( AA T ) v ≥ D ∥ v ∥ 2 2 for al l v  = 0 , (iii) (Sample c ovarianc e c ondition) sup ∥ v ∥ 2 =1 , ∥ v ∥ 1 ≤ 2 √ s   v T  E ( AA T ) − 1 n A T A  v   ≤ D 2 with pr ob ability at le ast 1 − α for α ≤ C α log k n wher e A =    A T 1 . . . A T n    , and (iv) (Bounde dness) | Y | , | b µ | , | ψ 0 | , ∥ A ∥ ∞ ≤ B almost sur ely for c onstants B , C α , D > 0 that ar e indep endent of k and n . Then, E  n A T ( b ψ − ψ ) o 2  ≲ s log k n + s ( δ n ϵ n ) 2 for s 4 log k ≲ n , when δ n ( a ) ≤ δ n and ϵ n ( a ) ≤ ϵ n for al l a ∈ { 1 , . . . , k } . 17 The error guarantee given b y the theorem consists of t wo parts: First, the oracle rate s log k n that one w ould obtain when using Y a as outcome in the regression. Second, the nuisance es- timation error that arises from estimating the propensity score and regression function, which is the price w e pay for not observing the p otential outcomes and having to estimate them. T o achiev e the oracle rate s log k n , we require nuisance estimation error to b e of a smaller or equal order. Since the nuisance estimation error is a second-order term, i.e., a pro duct of prop ensit y score and regression estimation error, eac h nuisance can b e estimated at a slow er rate as long as the pro duct of the errors is fast enough. Imp ortantly , when the oracle rate is ac hieved, the prediction error of the mean p oten tial outcomes dep ends on k only logarithmi- cally , showing that consistent estimation is p ossible even when k is large in prop ortion to n . R emark 15 (Alternative upp er b ound) . It is p ossible to sho w a more precise upp er b ound of the form E  n A T ( b ψ − ψ ) o 2  ≲ s log k n + s E   X a ∈A b ϖ a δ n ( a ) ϵ n ( a ) ! 2   . This upp er b ound is established during the pro of of the theorem. It shows that the nuisance error at a certain level is w eighted by its empirical prop ortion on D 1 (the fold used for nuisance estimation). This particularly shows that the o verall error do es not suﬀer from treatmen t levels that are p otentially unseen (i.e. b ϖ a = 0 ), as n uisance estimation can b e arbitrarily inaccurate at such levels. R emark 16 (Sparsity assumption) . F or many applications, exact sparsit y is a reasonable as- sumption. How ev er, in some cases this assumption migh t b e to o strong, and one must resort to appro ximate sparsity , i.e., ∥ ψ ∥ 1 ≤ s . It is possible to analyze the prop osed Lasso estimator under approximate sparsity , which is discussed in more detail in App endix A.3.3 . R emark 17 (Restricted eigenv alues and sample cov ariance condition) . The restricted eigen- v alue assumption and sample co v ariance condition are satisﬁed, e.g., when A = ( A 1 , . . . , A k ) ∈ { 0 , 1 } k where the comp onents are indep enden t and P ( A j = 1) = 1 / 2 for all j , as long as n ≳ s 3 log k . F or details on the veriﬁcation of the sample cov ariance condition, w e refer to App endix A.4 . R emark 18 (In tercept) . Although w e assume the in tercept to be kno wn, it is p ossible to consider the analysis throughout this section when ψ 0 is chosen as some functional, e.g., the mean or median of the mean p otential outcomes, and has to b e estimated. This can b e done similarly to the approac h discussed for single multi-v alued treatments in Remark 13 . 7 Minimax Lo w er Bounds In this section, we derive minimax lo wer b ounds for estimating mean p oten tial outcomes for single m ulti-v alued treatmen ts under sparsity . More sp eciﬁcally , we state an oracle mini- max lo wer b ound as w ell as a minimax low er b ound for the nuisance estimation error in the structure-agnostic minimax framework, b oth under exact sparsity . Finally , we claim optimalit y of our prop osed Lasso and b est subset selection estimator. 18 7.1 Motiv ation & Setup Minimax rates are generally deﬁned as R n = inf b ψ sup P ∈P E P ( ℓ ( b ψ − ψ )) for some statistical mo del P and loss function ℓ . In tuitiv ely , the minimax rate describ es the b est p ossible (w orst- case) error of an y estimator. Studying minimax rates is of crucial interest for several reasons: First, it helps answ er the question of whether a particular estimator can b e impro ved or is already optimal. If a deriv ed minimax low er b ound and the error rate of a particular estimator match, then this estimator is optimal; i.e., every other estimator would p erform worse for at least one distribution in the mo del. In case a low er b ound on the minimax risk is smaller than the error rate of a particular estimator, t w o scenarios are p ossible: Either we can improv e the minimax low er b ound and the estimator is optimal, or the low er b ound is tigh t, and a b etter estimator can b e found. Secondly , minimax rates reveal the fundamental limits of an estimation problem and determine what estimation error one can hop e for. Finally , minimax rates allo w for a comparison b etw een diﬀerent estimation problems in terms of their diﬃculty . It is w orth noting that when the goal is to prov e the optimalit y of a given estimator, it is appropriate to mak e stronger assumptions in the mo del for the minimax result than in the mo del used to derive the upp er b ound for the error of the giv en estimator. Intuitiv ely , this can b e explained by the fact that additional assumptions only make estimation easier, so every minimax low er b ound for a smaller mo del P ′ is also a minimax low er b ound for ev ery larger mo del P ′′ ⊇ P ′ . In this pap er, our primary motiv ation for studying minimax rates is to verify the optimalit y of our prop osed Lasso and best subset selection estimators for estimating mean potential outcomes in the exactly-sparse single multi-v alued treatmen t regime. Throughout this section, w e assume the setup of Section 5 as giv en, where A represents a single multi-v alued treatmen t. 7.2 Minimax Rate under Exact Sparsit y In the following, w e state the minimax rate for estimating mean p oten tial outcomes for high- dimensional single multi-v alued treatments under exact sparsity . Subsequen tly , we conclude optimalit y of the oracle rate ac hieved in Section 5 . Theorem 4. L et s ≥ 9 and k ≥ 8 s . F urther, let P denote the set of al l distributions for which: (i) C ′ k ≤ π a ( X ) ≤ C ′′ k , (ii) Y is binary, (iii) ( A, Y ) ⊥ ⊥ X , and (iv) the ve ctor ψ = ( ψ 1 , . . . , ψ k ) of the functionals ψ a = R µ a ( x ) d P ( x ) is sp arse in the sense that at most s entries ar e diﬀer ent fr om 1 / 2 . Then, inf b ψ sup P ∈P E ( k X a =1 ϖ a ( P )( b ψ a − ψ a ( P )) 2 ) ≥ C · min  s log( k /s ) n , s k  for ϖ a ( P ) = P ( A = a ) and C a c onstant dep ending only on C ′ and C ′′ . 19 Most importantly , note that the mo del assumptions made in the ab ov e theorem are strictly stronger than the ones used in the upp er b ound results for our Lasso and best subset selection estimator in Theorem 2 : Our minimax mo del assumes nearly uniform prop ensit y scores in (i), whic h implies nearly uniform treatment probabilities. Moreov er, Assumption (ii) implies the required b oundedness of the outcome in Theorem 2 . Assumption (iii) further shrinks the mo del. Assumption (iv) implies that the deviations from the in tercept are exactly sparse, i.e., ∥ ψ − ψ 0 ∥ 0 ≤ s for ψ 0 = 1 / 2 . Giv en that the ab o ve mo del in the minimax result is contained in the mo del in the upp er b ound result of Theorem 2 , we can conclude optimality of the oracle rate ac hieved by our estimators b ψ ∈ { b ψ lasso , b ψ subset } . It is worth mentioning that there is a slight mismatch b et ween the terms log( k /s ) in the minimax rate and log k in the upper b ound. How ev er, in the regime where k /s ≍ k γ for some γ > 0 , the rate log( k /s ) is equiv alent to log k (up to constants), in whic h case we can also write the minimax low er b ound as inf b ψ sup P ∈P E P ( k X a =1 ϖ a ( P )( b ψ a − ψ a ( P )) 2 ) ≳ min  s log k n , s k  , whic h now exactly matches the upp er b ound result in Theorem 2 . The main idea to prov e the theorem is to construct distributions that are close, but for whic h the target parameter ψ is separated as muc h as p ossible. Then, in tuitiv ely , no estimator can p erform better than this separation, since the distributions are statistically indistinguish- able. The classical Le Cam’s metho d constructs t wo such distributions, whic h suﬃce to derive minimax low er b ounds for nonparametric regression at a p oint ( T sybako v [ 2009 ]). How ever, for our purp oses, this metho d would give a low er b ound that is not tigh t enough, so we use m ultiple h yp otheses. More sp eciﬁcally , w e employ F ano’s metho d ov er a pruned and sparsiﬁed h yp ercub e, as describ ed in T sybak ov [ 2009 ]. The deriv ed minimax risk can be viewed as an analogue of the minimax risk for high- dimensional linear regression, whic h is of the order s log d/n (e.g., refer to Raskutti et al. [ 2011 ]), where our derived rate additionally reﬂects the eﬀectiv e sample size of n/k in the high-dimensional single multi-v alued treatmen t regime (since the error is w eighted by the treatmen t probabilities ϖ a ≈ 1 /k ). The ab o ve results prov e the optimalit y of the oracle rate of our prop osed estimators, and therefore the optimality of the estimators if they achiev e this oracle rate, i.e., the nuisance estimation error is of a smaller or equal order. How ev er, it do es not reveal whether the nuisance p enalt y in Theorem 2 is minimax optimal, whic h is co vered in the following subsection. 7.3 Minimax Nuisance Error Dep endence under Exact Sparsity In the follo wing, w e show the optimality of the nuisance error dep endence of our prop osed Lasso and b est subset selection estimator of Section 5 . This is done b y stating the minimax rate in the structure-agnostic minimax framework. In the classical minimax framework, one typically assumes the relev ant distribution is smo oth or otherwise structured to derive minimax optimal rates. F or instance, standard one- step estimators are known to b e sub-optimal in many smo othness classes [ Bick el and Ritov , 20 1988 , Birgé and Massart , 1995 ]. How ever, this approac h do es not allow for a structure-agnostic estimation of the nuisance comp onents, as optimal estimators carefully lev erage the assumed structure of the mo del. In the structur e-agnostic minimax framework, in tro duced by Balakrishnan et al. [ 2023 ], w e tak e nuisance estimators as giv en (in our case b π and b µ ), and they are treated as black- b o x estimators. More speciﬁcally , these estimators can b e obtained using an y estimation pro cedure applied to a separate fold of the data, whic h is indep endent of the data used to run the regression. W e further assume that those estimators ac hieve giv en error rates δ n and ϵ n , i.e., max a ∈{ 1 ,...,k }     π a b π a − 1     ≲ δ n and max a ∈{ 1 ,...,k } ∥ b µ a − µ a ∥ ≲ ϵ n . Imp ortan tly , these error rates are treated as unknown. When taking the structure-agnostic viewp oint, we aim to determine the optimal error for estimating our statistical functional ψ when w e hav e access to n uisance parameter estimates that achiev e the ab ov e error guarantees. Therefore, w e deﬁne the mo del of all distributions suc h that this error guarantee of the nuisance parameter estimators is satisﬁed, and derive the minimax risk under this mo del. In tuitively , we inv estigate optimalit y lo cally around the giv en nuisance parameter estimators. The following theorem states the minimax risk for this structure-agnostic mo del. Theorem 5. Supp ose that we have two pilot estimators b π and b µ that achieve the err or r ates max a ∈{ 1 ,...,k }     π a b π a − 1     ≤ C π δ n and max a ∈{ 1 ,...,k } ∥ b µ a − µ a ∥ ≤ C µ ϵ n (11) for some δ n , ϵ n = o (1) and c onstants C π , C µ > 0 such that ε ≤ b µ a ≤ 1 − ε, C 1 k ≤ b π a ′ ( X ) ≤ C 2 k , and b π k ( X ) ≥ ε almost sur ely for al l a ∈ { 1 , . . . , k } , a ′ ∈ { 1 , . . . , k − 1 } for some ε ∈ (0 , 1 / 2) , C 1 , C 2 > 0 indep endent of n and k , and b π k ( x ) = 1 − P k − 1 j =1 b π j ( x ) . L et P = P ( δ n , ϵ n ) denote the mo del wher e Y ∈ { 0 , 1 } is binary, X is uniformly distribute d on [0 , 1] d , C ′ 1 k ≤ π a ′ ( X ) ≤ C ′ 2 k , π k ( X ) ≥ ε ′ almost sur ely for al l a ′ ∈ { 1 , . . . , k − 1 } for c onstants C ′ 1 , C ′ 2 , ε ′ > 0 indep endent of n and k , the pilot estimators achieve the err or r ate in ( 11 ), and the ve ctor ψ with entries ψ a = R µ a ( x ) d P ( x ) , a = 1 , . . . , k is s -sp arse, i.e., only s entries ar e diﬀer ent fr om the value θ , wher e ψ k = θ . Then, the minimax r ate is lower b ounde d as inf b ψ sup P ∈P E P ( k X a =1 ϖ a ( P )  b ψ a − ψ a ( P )  2 ) ≳ s k ( δ n ϵ n ) 2 wher e ϖ a ( P ) = P ( A = a ) . 21 The previous theorem sho ws that, among all distributions for whic h the pilot estimators ac hieve the rates δ n and ϵ n , no estimator can do b etter than s k ( δ n ϵ n ) 2 . Consequently , the n uisance error dep endence of our Lasso and b est subset selection estimator in Theorem 2 is optimal and cannot b e improv ed. In combination with the minimax risk in the sparse regime sho wn in Theorem 4 , this allows us to claim optimalit y of the Lasso and b est subset selection estimator. Note that the assumptions in the ab ov e theorem are stronger than the ones used in the upp er b ound result in Theorem 2 . Hence, the mo del in the ab ov e minimax result is strictly smaller, allowing us to conclude optimalit y of our estimators. The near uniform treatment probabilities assumption of Theorem 2 is satisﬁed when the prop ensit y scores are nearly uni- form, where assuming π k ( X ) ≥ ϵ on the last treatmen t lev el can b e view ed as a stronger assumption since, intuitiv ely , it should make estimation for treatment level k easier. Since Y is chosen to b e binary , the outcome b oundedness assumption is also satisﬁed. Moreov er, the ab o ve theorem assumes a sligh tly stronger v ersion of exact sparsit y , as it additionally requires ψ k = θ . Finally , we can assume that the data used to construct the pilot estimators b π and b µ satisﬁes P 1 n { 1 ( A = a ) } ≍ 1 /k for all treatmen ts a ∈ { 1 , . . . , k } (since the ab o ve results treats D 1 as given and ﬁxed); then, the assumption of nearly uniform estimated prop ensity scores implies Assumption (ii) of Theorem 2 . The ov erall strategy to prov e the theorem is to construct distributions that are close, but suc h that the target parameter ψ is maximally separated under these distributions. Intu- itiv ely , when the distributions cannot b e distinguished statistically , but the target parameter is separated, no estimator can achiev e a smaller error than this functional separation. Our construction is similar to that of Jin and Syrgk anis [ 2025 ] but repeated across man y treatmen t lev els and ensuring the assumed sparsity is resp ected. These distributions are constructed using the metho d of fuzzy h yp othesis [ Birgé and Massart , 1995 , Ibragimov et al. , 1987 , Ingster et al. , 2003 , Jin and Syrgk anis , 2025 , Kennedy et al. , 2024 , Nemirovski , 2000 , Robins et al. , 2009 , T sybak ov , 2009 ]. More sp eciﬁcally , this entails constructing tw o sets of distributions, along with a prior ov er these distributions. This prior can b e view ed as a mixture of the distributions, providing us with a pair of such mixtures. Then, the main idea of the pro of is to v erify the assumptions of the following lemma, which is adapted from Theorem 2.15 in T sybako v [ 2009 ]. Lemma 1 ( T sybak ov [ 2009 ]) . L et P λ and Q λ denote distributions in P indexe d by a ve ctor λ = ( λ 1 , . . . , λ k ) , with n -fold pr o ducts denote d by P n λ and Q n λ , r esp e ctively. L et ϖ denote a prior distribution over λ . If H 2  Z P n λ dϖ ( λ ) , Z Q n λ dϖ ( λ )  ≤ α < 2 and d ( θ ( P λ ) , θ ( Q λ ′ )) ≥ ∆ > 0 for a semi-distanc e d , functional θ : P 7→ R k and for al l λ, λ ′ , then inf b θ sup P ∈P E P n ℓ  d  b θ , θ ( P ) o ≥ ℓ (∆ / 2) 1 − p α (1 − α/ 4) 2 ! for any monotonic non-ne gative loss function ℓ . 22 This lemma illustrates the intuition of the pro of. On the one hand, w e aim to verify that the constructed pair of mixture distributions is close in Hellinger distance, i.e., the mixtures are statistically indistinguishable. On the other hand, w e must pro ve that the statistical functional is still separated by the distance ∆ . Then, the minimax rate is giv en by the expression in the lemma, which directly dep ends on this functional separation ∆ . In summary , w e w ant the t wo mixture distributions to b e close; how ever, the statistical functional m ust b e maximally separated, providing us with a tight minimax low er b ound. 8 Discussion In this work, we studied optimal estimation of mean p otential outcomes for high-dimensional treatmen ts. W e prop osed estimators that achiev e fast conv ergence rates and allo w for a con- sisten t estimation under sparsity assumptions even when the n umber of treatments is large relativ e to the sample size. F urthermore, w e pro ved minimax optimalit y of the prop osed esti- mators in the sparse, structure-agnostic regime for single multi-v alued treatments. W e also shed ligh t on the diﬀerences betw een v arious high-dimensional treatmen t settings, including single multi-v alued treatments, as well as binary and contin uous vector treatments. W e identiﬁed fundamental challenges inheren t in these treatment settings: First, our target parameter is a high-dimensional v ector, making estimation increasingly diﬃcult for many treat- men ts. Second, for discrete high-dimensional treatmen ts, w e additionally encounter inevitable p ositivit y violations. Finally , for single multi-v alued treatmen ts, w e also face the challenge that the eﬀective sample size is n/k instead of n , as w e only hav e approximately n/k observ ations at each treatmen t level, even when the treatmen t is p erfectly uniform. Despite these three fundamen tal c hallenges, our prop osed estimators achiev e fast conv ergence rates under sparsit y . W e uniﬁed the estimation procedure across all treatment settings in to a single general sparse pseudo-outcome regression framew ork, which can b e applied to estimate mean p otential outcomes as a sp ecial case. This prop osed pseudo-outcome regression framew ork is applicable in general beyond our w ork here. It can be used to estimate arbitrary sparse high-dimensional statistical functionals by regressing an appropriately constructed estimated pseudo-outcome on features sub ject to a sparsit y constrain t, enabling eﬃcien t estimation with fast con v ergence rates. In future work, our framework could b e applied to other high-dimensional target param- eters, e.g., indirectly standardized excess risk or outcome ratio [ Susmann et al. , 2024 ]. This can b e ac hieved b y form ulating a sparse model for the excess risk or outcome ratio, and utiliz- ing the eﬃcien t inﬂuence functions of these parameters to construct suitable pseudo-outcomes. Finally , the assumptions of Theorem 1 can b e veriﬁed, conﬁrming that our framework yields a doubly robust eﬃcien t estimator in this high-dimensional indirect standardization setting. F or future work, our results hav e numerous other possible extensions: First, our paper pro ves the optimality of the proposed estimators for single m ulti-v alued treatmen ts under exact sparsity . Similarly , one could verify optimality under appro ximate sparsit y and in the v ector treatment setting. Second, one could inv estigate whether our prop osed pseudo-outcome regression framew ork can b e extended to sparse generalized linear mo dels, thereby enabling the mo deling of nonlinear relationships in the marginal structural mo del used in the vector treatmen t setting. Finally , v alid inference and hypothesis testing could b e of future interest. 23 A c kno wledgemen ts EHK was supp orted by NSF CAREER A w ard 2047444. References C. Ai and X. Chen. Eﬃcien t estimation of mo dels with conditional momen t restrictions con taining unknown functions. Ec onometric a , 71(6):1795–1843, 2003. O. C. Andreu, A. Vlontzos, M. O’Riordan, and C. M. Gilligan-Lee. Con trastive representations of high-dimensional, structured treatmen ts. arXiv pr eprint arXiv:2411.19245 , 2024. S. Athey , G. W. Im b ens, and S. W ager. Appro ximate residual balancing: debiased inference of a verage treatment eﬀects in high dimensions. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 80(4):597–623, 2018. S. Balakrishnan, E. H. Kennedy , and L. W asserman. The fundamental limits of structure- agnostic functional estimation. arXiv pr eprint arXiv:2305.04116 , 2023. A. Belloni, V. Chernozh uko v, and C. Hansen. Inference on treatment eﬀects after selection among high-dimensional controls. R eview of Ec onomic Studies , 81(2):608–650, 2014. P . J. Bick el and Y. Ritov. Estimating integrated squared densit y deriv atives: sharp best order of conv ergence estimates. Sankhy¯ a , pages 381–393, 1988. L. Birgé and P . Massart. Estimation of in tegral functionals of a density . The A nnals of Statistics , 23(1):11–29, 1995. M. Bonvini and E. H. Kennedy . F ast con vergence rates for dose-resp onse estimation. arXiv pr eprint arXiv:2207.11825 , 2022. J. Bradic, S. W ager, and Y. Zhu. Sparsit y double robust inference of a verage treatment eﬀects. arXiv pr eprint arXiv:1905.00744 , 2019. S. Chatterjee. Assumptionless consistency of the lasso. arXiv pr eprint arXiv:1303.5817 , 2014. V. Chernozh uk ov, W. K. Newey , and R. Singh. Debiased mac hine learning of global and lo cal parameters using regularized riesz representers. The Ec onometrics Journal , 25(3):576–601, 2022. A. D’Amour, P . Ding, A. F eller, L. Lei, and J. Sekhon. Over lap in observ ational studies with high-dimensional cov ariates. arXiv pr eprint arXiv:1711.02582 , 2017. I. Díaz and M. J. v an der Laan. T argeted data adaptive estimation of the causal dose-resp onse curv e. Journal of Causal Infer enc e , 1(2):171–192, 2013. J.-H. Du, Z. Zeng, E. H. Kennedy , L. W asserman, and K. Roeder. Causal inference for genomic data with multiple heterogeneous outcomes. Journal of the A meric an Statistic al Asso ciation , 2025. M. H. F arrell. Robust inference on a verage treatmen t eﬀects with p ossibly more cov ariates than observ ations. Journal of Ec onometrics , 189(1):1–23, 2015. 24 A. F eder, K. A. Keith, E. Manzo or, R. Pryzant, D. Sridhar, Z. W o od-Dought y , J. Eisenstein, J. Grimmer, R. Reichart, M. E. Rob erts, B. M. Stewart, V. V eitc h, and D. Y ang. Causal inference in natural language pro cessing: Estimation, prediction, in terpretation and b eyond. T r ansactions of the Asso ciation for Computational Linguistics , 10:1138–1158, 2022. D. J. F oster and V. Syrgk anis. Orthogonal statistical learning. arXiv pr eprint arXiv:1901.09036 , 2019. M. Goplerud, K. Imai, and N. E. Pashley . Estimating heterogeneous causal eﬀects of high- dimensional treatments: Application to conjoint analysis. arXiv pr eprint arXiv:2201.01357 , 2022. T. Hastie, R. Tibshirani, and R. Tibshirani. Best subset, forward stepwise or lasso? analysis and recommendations based on extensive comparisons. Statistic al Scienc e , 35(4), 2020. ISSN 0883-4237. doi: 10.1214/19- STS733. I. A. Ibragimov, A. S. Nemirovskii, and R. Khas’minskii. Some problems on nonparametric estimation in gaussian white noise. The ory of Pr ob ability & Its Applic ations , 31(3):391–406, 1987. Y. Ingster, J. I. Ingster, and I. Suslina. Nonp ar ametric go o dness-of-ﬁt testing under Gaussian mo dels , volume 169. Springer Science & Business Media, 2003. K. Jiang, R. Mukherjee, S. Sen, and P . Sur. A new central limit theorem for the aug- men ted ipw estimator: V ariance inﬂation, cross-ﬁt cov ariance and b eyond. arXiv pr eprint arXiv:2205.10198 , 2022. J. Jin and V. Syrgk anis. Structure-agnostic optimalit y of doubly robust learning for treatment eﬀect estimation. arXiv pr eprint arXiv:2402.14264 , 2025. E. H. Kennedy . Nonparametric causal eﬀects based on incremental propensity score in terv en- tions. Journal of the Americ an Statistic al Asso ciation , 114(526):645–656, 2019. E. H. Kennedy . T ow ards optimal doubly robust estimation of heterogeneous causal eﬀects. Ele ctr onic Journal of Statistics , 17(2):3008–3049, 2023. E. H. Kennedy , Z. Ma, M. D. McHugh, and D. S. Small. Nonparametric metho ds for doubly robust estimation of contin uous treatment eﬀects. Journal of the R oyal Statistic al So ciety: Series B , 79(4):1229–1245, 2017. E. H. Kennedy , S. Balakrishnan, J. M. Robins, and L. W asserman. Minimax rates for hetero- geneous causal eﬀect estimation. The Annals of Statistics , 52(2):793–816, 2024. S. Lin, H. Lan, and V. Syrgk anis. Learning treatmen t representations for downstream instru- men tal v ariable regression. arXiv pr eprint arXiv:2506.02200 , 2025. L. Liu, X. W ang, and Y. W ang. Ro ot-n consistent semiparametric learning with high- dimensional n uisance functions under minimal sparsity . arXiv pr eprint arXiv:2305.04174 , 2023. N. Mitra, J. Roy , and D. Small. The future of causal inference. A meric an Journal of Epidemi- olo gy , 191(10):1671–1676, 2022. 25 R. Nabi, T. McNutt, and I. Shpitser. Semiparametric causal suﬃcien t dimension reduction of m ultidimensional treatments. Unc ertainty in A rtiﬁcial Intel ligenc e , pages 1445–1455, 2022. A. Nemiro vski. T opics in non-parametric statistics. Ec ole d’Eté de Pr ob abilités de Saint-Flour , 28:85, 2000. G. Raskutti, M. J. W ain wright, and B. Y u. Minimax rates of estimation for high-dimensional linear regression ov er ℓ q -balls. IEEE T r ansactions on Information The ory , 57(10):6976–6994, 2011. P . Rigollet and J.-C. Hütter. High-dimensional statistics. arXiv pr eprint arXiv:2310.19244 , 2023. J. M. Robins, E. J. T chetgen T chetgen, L. Li, and A. W. v an der V aart. Semiparametric minimax rates. Ele ctr onic Journal of Statistics , 3:1305–1321, 2009. D. B. Rubin and M. J. v an der Laan. A general imputation metho dology for nonparametric regression with censored data. UC Berkeley Division of Biostatistics W orking Pap er Series , 194, 2005. K. Schindl, S. Shen, and E. H. Kennedy . Incremen tal eﬀects for contin uous exp osures. arXiv pr eprint arXiv:2409.11967 , 2024. A. Sharma, G. Gupta, R. Prasad, A. Chatterjee, L. Vig, and G. Shroﬀ. Hi-ci: Deep causal inference in high dimensions. Pr o c e e dings of the 2020 KDD W orkshop on Causal Disc overy , pages 39–61, 2020. E. Sm ucler, A. Rotnitzky , and J. M. Robins. A unifying approach for doubly-robust ℓ 1 regu- larized estimation of causal contrasts. arXiv pr eprint arXiv:1904.03737 , 2019. H. Susmann, Y. Li, M. A. McAdams-DeMarco, I. Díaz, and W. W u. Doubly robust non- parametric eﬃcient estimation for provider ev aluation. arXiv pr eprint arXiv:2410.19073 , 2024. R. Tibshirani. Regression shrink age and selection via the lasso. Journal of the R oyal Statistic al So ciety. Series B (Metho dolo gic al) , 58(1):267–288, 1996. ISSN 00359246. A. B. T sybako v. Optimal rates of aggregation. L e arning the ory and kernel machines , pages 303–313, 2003. A. B. T sybak ov. Intr o duction to Nonp ar ametric Estimation . New Y ork: Springer, 2009. A. W. v an der V aart and J. A. W ellner. W e ak Conver genc e and Empiric al Pr o c esses . Springer, 1996. N. V erzelen. Minimax risks for sparse regressions: Ultrahigh dimensional phenomenons. Ele c- tr onic Journal of Statistics , 6, 01 2012. doi: 10.1214/12- EJS666. M. J. W ainwrigh t. High-Dimensional Statistics: A Non-Asymptotic Viewp oint . Cam bridge Series in Statistical and Probabilistic Mathematics. Cam bridge Universit y Press, 2019. Y. W ang and R. D. Shah. Debiased in verse propensity score w eighting for estimation of a v er- age treatmen t eﬀects with high-dimensional confounders. arXiv pr eprint arXiv:2011.08661 , 2020. 26 Q. Xiang, Y. Y uan, D. Song, U. J. W udil, M. H. Aliyu, C. W. W ester, and B. E. Shepherd. Double machine learning to estimate the eﬀects of multiple treatmen ts and their interactions. arXiv pr eprint arXiv:2505.12617 , 2025. Z. Zeng, S. Balakrishnan, Y. Han, and E. H. Kennedy . Causal inference with high-dimensional discrete cov ariates. arXiv pr eprint arXiv:2405.00118 , 2024. 27 A A dditional Results A.1 No F ree Lunc h In this subsection, we study the fundamen tal limits on ho w w ell mean p otential outcomes can p ossibly b e estimated for discrete high-dimensional treatments. Sp eciﬁcally , we provide t wo minimax lo wer bounds for single multi-v alued treatments and v ector treatmen ts, resp ectively , under no further structure. First, we present the minimax low er b ound in the single multi-v alued treatmen t setting for estimating mean p oten tial outcomes in terms of mean squared error when assuming no addi- tional structure. Intuitiv ely , the minimax risk is the error rate that the b est estimator can ac hieve (in the worst case). Therefore, a low er b ound on the minimax risk indicates that no estimator can p erform b etter than the giv en risk in this minimax sense, i.e., every estima- tor must achiev e the same or a worse error rate. Minimax rates hav e crucial practical and theoretical implications: On the one hand, they pro vide a b enchmark for the b est p ossible estimation. On the other hand, they precisely determine the statistical diﬃculty of a certain estimation problem. Prop osition 1 (Minimax risk in dense regime for single multi-v alued treatmen ts) . L et A ∈ { 1 , . . . , k } b e a single-multivalue d tr e atment and deﬁne ψ ( P ) = ( E P ( µ 1 ( X )) , . . . , E P ( µ k ( X ))) ∈ R k . L et P denote the set of distributions for which π a ( x ) ≥ C ′ /k for a c onstant C ′ , Y is binary, and ( A, Y ) ⊥ ⊥ X . If k ≥ 32 , then we have inf b ψ sup P ∈P E P k X a =1 ϖ a ( P )( b ψ a − ψ a ( P )) 2 ! ≥ C min  k n , 1  for ϖ a ( P ) = P ( A = a ) and C > 0 a universal c onstant. Essen tially , this statemen t sho ws that w e suﬀer from very slow con v ergence rates in this high- dimensional treatmen t regime, and eﬃcien t estimation is hop eless for large k . F or instance, if k ≍ n , then the minimax rate is lo wer b ounded b y a rate of constan t order; hence, meaningful estimation is imp ossible, as even with increasingly more data our estimates of the mean p o- ten tial outcomes w ould still not improv e. This illustrates the c hallenge of reliable estimation in this high-dimensional regime. In tuitively , the slow conv ergence rate stated ab ov e can b e explained as follows: W e hav e in- evitable near-violations of p ositivity (refer to Section 3.2 ). That is, even when assuming that eac h treatment is receiv ed with nearly the same probability , we hav e only around n/k obser- v ations per treatmen t lev el; th us, the eﬀective sample size for estimating E ( Y a ) is n/k instead of n . This explains a low er b ound that scales with k . F urthermore, note that the stated minimax low er b ound is v alid even when we assume the prop ensit y scores to b e roughly “p erfectly” distributed in the sense that the propsensit y score is nearly uniform, i.e., C ′ /k ≤ π a ( X ) ≤ C ′′ /k for all treatmen ts a with probability one for some constan ts C ′ , C ′′ that do not depend on n and k . This means that the slo w conv ergence rate is a result of the high-dimensionality in the treatment, rather than an “unfav orable” distribution of the treatments. 28 Next, we state the minimax lo wer b ound for estimating mean p oten tial outcomes for binary v ector treatments, i.e., A = ( A 1 , . . . , A k ) ∈ { 0 , 1 } k under a linear marginal structural mo del. Prop osition 2 (Minimax risk in dense regime for binary v ector treatments) . L et A ∈ { 0 , 1 } k b e a binary ve ctor tr e atment and deﬁne ψ via the line ar mar ginal structur al mo del E X ( E ( Y | X , A 1 = a 1 , . . . , A k = a k )) = ψ 0 + k X j =1 a j ψ j = ψ 0 + a T ψ . (12) L et P denote the set of distributions for which E ( Y | A 1 = a 1 , . . . , A k = a k ) is a line ar c ombination of a 1 , . . . , a k and ( A, Y ) ⊥ ⊥ X . Then we have inf b ψ sup P ∈P E P  n A T ( b ψ − ψ ( P )) o 2  ≥ C min  k n , k  for C > 0 a universal c onstant. The ab o ve minimax risk indicates that we incur slo w error rates, and consistent estimation is imp ossible when k is large relative to n , i.e., when the treatment v ector is high-dimensional. Note that under the mo del P deﬁned in Prop osition 2 , the estimation of ψ reduces to a classical high-dimensional linear regression problem of Y on the high-dimensional features A = ( A 1 , . . . , A k ) . Consequently , Prop osition 2 can b e v eriﬁed by using classical minimax results for high-dimensional linear regression in the dense regime when no further structure is assumed, such as Theorem 3 of T sybako v [ 2003 ]. In summary , Prop osition 1 and Prop osition 2 demonstrate that consistent eﬀect estimation for discrete high-dimensional treatments is hop eless without any further assumptions. Con- sequen tly , we must imp ose additional structural assumptions to ac hieve faster con vergence rates. A.2 Alternativ e Risk F unction In this section, w e wish to discuss another risk function that app ears natural, namely b R ′ ( β ) = − 2 P 2 n  b φ 1 ( Z ) V T A β + Z A b φ 2 ,a ( Z ) V T a β dν ( a )  + Z A ( V T a β ) 2 dν ( a ) , where we assume the setup and notation of Section 4 , ν denotes an arbitrary measure on the supp ort A of the treatment v ariable A , and b φ 1 and b φ 2 ,a are estimated pseudo-outcomes. The main diﬀerence from the risk function used in ( 2 ) is that the treatmen ts are weigh ted by a ﬁxed measure rather than by the empirical w eights of the treatment v ariable. W e p oin t out the cav eat of this for the example of single multi-v alued treatments. Assuming the setup and notation of Section 5 , the alternative risk turns into b R ′ ( β ) = P 2 n ( k X a =1 w ( a ) ( b φ a ( Z ) − β a ) 2 ) (13) b y setting b φ 1 ( Z ) = 0 , b φ 2 ,a ( Z ) = b φ a ( Z ) (here without scaling by √ k ), V a = ( 1 ( a = 1) , . . . , 1 ( a = k )) , and ν to b e a sum of Dirac measures w eighted by some ﬁxed weigh t function w . F or 29 simplicit y , we consider the case where w = 1 : Using a similar analysis as in the pro ofs of Theorem 1 and 2 , w e can show that the Lasso and b est subset selection estimator un- der this alternativ e risk function yield an error guaran tee where the nuisance p enalty equals s max a ∈{ 1 ,...,k }    π a b π a − 1    ∥ µ a − b µ a ∥ . If we hav e unobserved treatment levels, then accurate estimation of the nuisance parameters π a and µ a is imp ossible, and all we can do is guess. Ho wev er, this is problematic b ecause the nuisance p enalt y incurs the worst-case n uisance es- timation error, as it is the maxim um ov er all treatmen t levels. Our risk function in ( 2 ) a voids this by w eighting the treatmen ts according to the empirical weigh ts on the data used for nui- sance estimation (see Remark 14 for details), yielding a nuisance p enalty that dep ends only on observed treatment lev els. A.2.1 Connection to the Gaussian Sequence Mo del for Single Multi-V alued T reat- men ts Consider single multi-v alued treatments in the setting of Section 5 . Under the alternative risk outlined in ( 13 ), w e wan t to mention that it is p ossible to establish a connection to the Gaussian sequence mo del rigorously . In particular, we can show the equiv alence of the Lasso estimator minimizing the alternativ e risk in ( 13 ) and the soft-thresholding estimator, whic h is giv en by b ψ S = ( b ψ S, 1 , . . . , b ψ S,k ) where b ψ S,a = b ψ S,a ( λ ) = sign ( P n { b φ a ( Z ) } ) ( | P n { b φ a ( Z ) }| − λ ) + . This can b e explained as follo ws: The constrained version of the Lasso, as deﬁned in ( 3 ), minimizing the risk in ( 13 ), is equiv alen t to the p enalized version of the Lasso. Next, we note that the soft-thresholding estimator can also b e written in the p enalized form as b ψ S = arg min ψ 1 2 k X a =1 ( P n { b φ a ( Z ) } − ψ a ) 2 + λ ∥ ψ ∥ 1 . Then, the equiv alence follo ws b y setting the weigh ts w in ( 13 ) to b e 1 / 2 , and showing with simple transformations that the minimizer of P k a =1 ( P n { b φ a ( Z ) } − ψ a ) 2 is the minimizer of the expression P n n P k a =1 ( b φ a ( Z ) − ψ a ) 2 o . A.3 Appro ximate Sparsit y Dep ending on the application, assuming exact sparsit y of the statistical functional ψ might app ear unreasonable. Instead, one might resort to approximate sparsity , where we do not hav e to assume that the ma jorit y of the en tries of ψ are exactly zero. F or appro ximate sparsit y , we assume that ∥ ψ ∥ 1 ≤ s . A.3.1 Master Theorem under Approximate Sparsit y In this subsection, w e presen t an error guarantee of our general Lasso estimator of Section 4 under approximate sparsity . Therefore, assume the setup of Section 4 . Theorem 6 (Error rate under approximate sparsity) . L et b ψ = b ψ lasso b e the L asso estimator deﬁne d in ( 3 ), minimizing the empiric al risk in ( 2 ). Supp ose appr oximate sp arsity, i.e., ∥ ψ ∥ 1 ≤ s for some sp arsity c onstr aint s . Mor e over, assume for al l j, l ∈ { 1 , . . . , k } and a ∈ A that (i) (A lmost sur e b ounde dness)   R A b φ a ( Z ) V a,j d P 1 n ( a )   , | V T A ψ V A,j | ≤ C 1 r 1 ( k ) almost sur ely, 30 (ii) (Se c ond moment b ounde dness) P n  R A b φ a ( Z ) V a,j d P 1 n ( a )  2 o , ( R A V T A ψ V A,j d P 1 n ( a )) 2 ≤ C 1 b r 1 ( k ) and P  ( V T A ψ V A,j ) 2  ≤ C 1 r 1 ( k ) almost sur ely wher e b r 1 ( k ) c an dep end on D 1 and E D 1 { b r 1 ( k ) j } ≤ Rr 1 ( k ) j for j = 1 2 , 1 , and (iii) (Nuisanc e estimation err or r ate)   P  b φ a ( Z ) − V T a ψ    ≤ C 2 r 2 ,a ( n, k ) for r ates r 1 = r 1 ( k ) ≳ 1 , r 2 = r 2 ( n, k ) , and c onstants R , C 1 , C 2 > 0 that ar e indep endent of n and k . Then, E  P 2 n  n V T A ( b ψ − ψ ) o 2  ≲ s r r 1 ( k ) log k n + s E  max j ∈{ 1 ,...,k } Z A r 2 ,a ( n, k ) | V a,j | d P 1 n ( a )  for r 1 ( k ) log k ≲ n . First, completely analogously to the error rate under exact sparsity , the giv en error rate con- sists of an oracle rate and a nuisance estimation error rate. Second, for assuming only approximate sparsit y , we pay the price of slow er con v ergence rates. The obtained rate under appro ximate sparsity , con taining a term of order p log( k ) /n , is also called slow L asso r ate . In contrast, the rate obtained under exact sparsity con taining a term of the faster order log ( k ) /n is called fast L asso r ate . This follo ws a well-kno wn pattern in the high-dimensional regression literature, e.g., refer to Raskutti et al. [ 2011 ]. Third, although the assumptions are largely iden tical to those in Theorem 1 , we do not as- sume restricted eigenv alues or the sample cov ariance condition here. The reason is that these assumptions, in combination with exact sparsit y , allo wed us to achiev e the fast Lasso rate. As w e no longer assume exact sparsit y , the restricted eigen v alue condition and sample cov ariance condition are consequently also no longer necessary . R emark 19 (In-sample vs. out-of-sample error) . It is p ossible to state the ab o ve error b ound when the prediction error is weigh ted by the marginal distribution of the treatments instead of its empirical prop ortions at the cost of an s 2 -dep endence instead of s -dep endence in the oracle rate. That is, E  n V T A ( b ψ − ψ ) o 2  ≲ s 2 r r 1 ( k ) log k n + s E  max j ∈{ 1 ,...,k } Z A r 2 ,a ( n, k ) | V a,j | d P 1 n ( a )  , where we need to additionally assume that | V A,j V A,l | , P { V 2 A,j V 2 A,l } ≤ C 1 r 1 ( k ) . This can be v er- iﬁed analogously to the pro of of Theorem 1 , where the error is decomp osed into the exp ected in-sample error and an empirical process term. The exp ected in-sample error can b e bounded as stated by the ab ov e theorem, and the empirical pro cess term can b e analyzed using the calculation in ( 22 ) and the additionally assumed almost-sure and second-momen t b oundedness. R emark 20 (Analysis of Lasso vs. best subset selection) . Note that the ab ov e theorem for the approximate sparse case only pro vides an error guarantee for the Lasso estimator, not for the b est subset selection estimator. This can b e explained by the fact that the proof relies on b ounding ∥ b ψ − ψ ∥ 1 ≤ 2 s . How ever, ∥ b ψ subset ∥ 1 do es not hav e to b e b ounded by the order s : The b est subset selection constraints the L 0 -norm, where the L 0 -constrain t tuning parameter has to b e resp ected by the true ψ , i.e., we choose the true sparsity ∥ ψ ∥ 0 as tuning parameter, so 31 that the basic Lasso inequality can b e applied in the pro of, i.e., b R ( b ψ subset ) ≤ b R ( ψ ) . Ho wev er, since we assume only approximate sparsit y , we migh t hav e ∥ ψ ∥ 0 = k . Hence, in this case, the b est subset selection estimator do es not incorp orate a constraint in the least squares regression, and, in particular, ∥ b ψ subset ∥ 1 do es not ha ve to b e b ounded by s . A.3.2 Estimating Mean P oten tial Outcomes for Single Multi-V alued T reatments under Approximate Sparsit y In this section, we provide an error guarantee of the prop osed Lasso estimator of mean p o- ten tial outcomes for single multi-v alued treatmen ts under approximate sparsit y , i.e., when assuming ∥ ψ ∥ 1 ≤ s . Therefore, assume the setup of Section 5 . W e sho w that a slo w Lasso rate can b e achiev ed and that fast conv ergence rates are p ossible even when there are man y treatmen ts in prop ortion to the sample size. F or some practical applications, exact sparsit y may not app ear to b e a realistic assumption, as it only allo ws s mean p oten tial outcomes to diﬀer from some intercept v alue. At the same time, all other mean p otential outcomes must b e exactly equal. F or that reason, some applications m ust resort to approximate sparsity . Although this assumption also requires that the mean p oten tial outcomes do not deviate to o muc h from the intercept, it imp oses no constraint on the n umber of mean p otential outcomes that may diﬀer from the intercept. F or example, all mean p oten tial outcomes w ould b e allo wed to diﬀer from the in tercept as long as these deviations are small enough. Theorem 7. L et b ψ = b ψ lasso b e the L asso estimator deﬁne d in ( 3 ) minimizing the empiric al risk in ( 7 ) for b φ a as in ( 6 ). Assume the mo del in ( 5 ) with appr oximate sp arsity, i.e., ∥ ψ ∥ 1 ≤ s for some sp arsity c onstr aint s . Mor e over, assume that for al l a ∈ { 1 , . . . , k } (i) (Ne arly uniform tr e atment pr ob abilities) c k ≤ ϖ a = P ( A = a ) ≤ C k , (ii) (Estimate d pr op ensity sc or e close to empiric al weights) | b ϖ a / b π a ( X ) | ≤ B almost sur ely, and (iii) (Bounde dness) | b µ a ( X ) | , | Y | , | ψ 0 | ≤ B almost sur ely for c onstants c, C , B > 0 that ar e indep endent of k and n . Then, E k X a =1 w ( a )( b ψ a − ψ a ) 2 ! ≲ s r log k k n + s k δ n ϵ n for w ( a ) = P 2 n { 1 ( A = a ) } and n ≳ k log k , when δ n ( a ) ≤ δ n and ϵ n ( a ) ≤ ϵ for al l a ∈ { 1 , . . . , k } . The price we pay for assuming approximate sparsity rather than exact sparsity is that the rate obtained ab ov e is slow er. F or the oracle part of the rate, we now obtain q log k kn instead of log k n . Similarly , the n uisance p enalt y now contains δ n ϵ n instead of ( δ n ϵ n ) 2 . The requirement for the n uisance error to ac hieve the oracle rate is the same as for exact sparsity , as we need δ n ϵ n ≲ q k log k n , which is, for instance, satisﬁed when δ n , ϵ n ≍  k log k n  1 / 4 . F or a discussion of the assumptions, we refer to the remarks b elonging to Theorem 2 . 32 R emark 21 . Similar to the corresponding result under exact sparsit y in Theorem 2 , it is p ossible to deriv e a more informative upp er b ound E k X a =1 w ( a )( b ψ a − ψ a ) 2 ! ≲ s r log k k n + s E  max a ∈{ 1 ,...,k } { b ϖ a δ n ( a ) ϵ n ( a ) }  , whic h rev eals that the nuisance p enalty is weigh ted by the empirical proportion at eac h treat- men t level. This result is sho wn as part of the ab o ve theorem’s pro of. Due to Remark 19 on the master theorem under approximate sparsit y (which is used to pro ve this result), it is also p ossible to state the ab o ve upp er b ound for the out-of-sample error instead of the exp ected in-sample error at the cost of an additional s 2 -dep endence, i.e., E k X a =1 ϖ a ( b ψ a − ψ a ) 2 ! ≲ s 2 r log k k n + s E  max a ∈{ 1 ,...,k } { b ϖ a δ n ( a ) ϵ n ( a ) }  ≲ s 2 r log k k n + s k δ n ϵ n for ϖ a = P ( A = a ) . A.3.3 Estimating Mean P oten tial Outcomes for V ector T reatmen ts under Ap- pro ximate Sparsit y In this section, we pro vide an error guarantee of the Lasso estimator for v ector treatments under approximate sparsity , i.e., when assuming ∥ ψ ∥ 1 ≤ s . Therefore, assume the setup of Section 6 . W e demonstrate that a slo w Lasso rate can b e achiev ed, and that fast conv ergence rates are p ossible even when the dimension of the vector treatment is large relativ e to the sample size. Theorem 8. L et b ψ = b ψ lasso b e the L asso estimator deﬁne d in ( 3 ) minimizing the empiric al risk in ( 10 ) for b φ a as in ( 9 ). Assume the mo del deﬁne d in ( 8 ) and supp ose appr oximate sp arsity, i.e., ∥ ψ ∥ 1 ≤ s for some sp arsity c onstr aint s . Mor e over, assume that for al l a ∈ { 1 , . . . , k } (i) (Estimate d pr op ensity sc or e close to empiric al weights) | b ϖ a / b π a ( X ) | ≤ B , and (ii) (Bounde dness) | Y | , | b µ | , | ψ 0 | , ∥ A ∥ ∞ ≤ B almost sur ely for a c onstant B > 0 that is indep endent of k and n . Then, E  P 2 n  n A T ( b ψ − ψ ) o 2  ≲ s r log k n + sδ n ϵ n for log k ≲ n . The obtained rate consists again of tw o parts, analogously to Theorem 3 . The price w e pay for only assuming appro ximate sparsity is that we obtain a slow Lasso rate. F or further discussion of the assumptions and implications, we refer to the remarks of Theorem 3 . R emark 22 . Similar to the result under exact sparsit y in Theorem 3 , it is p ossible to sho w the more informativ e upp er b ound where the nuisance errors are weigh ted individually at each treatmen t level, i.e. E  P 2 n  n A T ( b ψ − ψ ) o 2  ≲ s r log k n + s E X a ∈A b ϖ a δ n ( a ) ϵ n ( a ) ! . 33 This result is prov ed as part of the abov e theorem’s pro of. Moreo ver, based on Remark 19 , w e can state the ab ov e error for the out-of-sample error at the cost of an additional s 2 -dep endence, i.e., E  n A T ( b ψ − ψ ) o 2  ≲ s 2 r log k n + s E X a ∈A b ϖ a δ n ( a ) ϵ n ( a ) ! ≲ s 2 r log k n + sδ n ϵ n . A.4 V eriﬁcation of the Sample Co v ariance Condition for V ector T reatments In this subsection, w e provide a detailed example when the sample cov ariance condition of Theorem 3 for v ector treatmen ts is satisﬁed: Assume A = ( A 1 , . . . , A k ) ∈ { 0 , 1 } k with in- dep enden t comp onen ts and P ( A j = 1) = 1 / 2 for all j . W e wan t to verify that the sample co v ariance condition is satisﬁed in this sp ecial case, i.e., we wan t to show that sup v ∈V   v T ∆ v   = sup v ∈V      v T 1 n n X i =1 A i A T i − E ( AA T ) ! v      > D 2 with probabilit y less than α ≤ log k n where V = { v ∈ R k | ∥ v ∥ 2 = 1 , ∥ v ∥ 1 ≤ 2 √ s } and ∆ = 1 n A T A − E ( AA T ) . It can by veriﬁed with a straightforw ard calculation that | v T ∆ v | ≤ ∥ ∆ ∥ ∞ ∥ v ∥ 2 1 . F or v ∈ V we hav e ∥ v ∥ 1 ≤ 2 √ s , so sup v ∈V | v T ∆ v | ≤ 4 s ∥ ∆ ∥ ∞ . Hence, it suﬃces to b ound ∥ ∆ ∥ ∞ . W e start by b ounding the entries of ∆ ij . W e hav e ∆ lm = 1 n n X i =1 { A i,l A i,m − E ( A i,l A i,m ) } whic h is a sum of indep enden t mean-zero random v ariables. Using the fact that | A i,l A i,m − E ( A i,l A i,m ) | ≤ 2 B 2 and P  ( A i,l A i,m − E ( A i,l A i,m )) 2  ≤ 4 B 4 under the assumption ∥ A ∥ ∞ ≤ B , we can apply Bernstein’s inequalit y to obtain P ( | ∆ lm | ≥ t ) ≤ 2 exp − nt 2 8 B 4 + 4 3 B 2 t ! . By the union b ound, we obtain P ( ∥ ∆ ∥ ∞ ≥ t ) ≤ k 2 max l,m P ( | ∆ lm | ≥ t ) ≤ 2 k 2 exp − nt 2 8 B 4 + 4 3 B 2 t ! . So, P  sup v ∈V   v T ∆ v   ≥ D 2  ≤ P  ∥ ∆ ∥ ∞ ≥ D 8 s  ≤ 2 k 2 exp − n ( D / 8 s ) 2 8 B 4 + D 6 s B 2 ! . W e need that the right-hand side is b ounded by log k n , i.e., we require 2 k 2 exp − n ( D / 8 s ) 2 8 B 4 + D 6 s B 2 ! ≤ log k n , whic h is satisﬁed for n ≳ s 2 log k . 34 B Pro ofs of Upp er Bounds B.1 Pro of of Theorem 1 The v ast ma jority of the pro of is identical for the Lasso estimator b ψ lasso and the b est subset selection estimator b ψ subset . In the following, b ψ can b e replaced b y either of the estimators, and the statemen ts hold. In the parts of the pro of where it matters whic h estimator is b eing used, we will explicitly highlight the diﬀerence in the analysis. W e start by decomposing the error in to an exp ected in-sample error and an empirical pro cess term: E  n V T A ( b ψ − ψ ) o 2  = E  P 2 n  n V T A ( b ψ − ψ ) o 2  + E  ( P − P 2 n )  n V T A ( b ψ − ψ ) o 2  . (14) First, we handle the exp ected in-sample error and w ant to b ound it by the desired order, i.e., w e wish to show that E  P 2 n  n V T A ( b ψ − ψ ) o 2  ≲ s r 1 ( k ) log k n + s E max j ∈{ 1 ,...,k }  Z A r 2 ,a ( n, k ) | V a,j | d P 1 n ( a )  2 ! . After, we show that the empirical pro cess term can b e b ounded by a term of the same order. B.1.1 Basic Lasso Inequalit y First, by applying the basic Lasso inequalit y , we obtain E  P 2 n  n V T A ( b ψ − ψ ) o 2  = E h P 2 n n ( V T A b ψ ) 2 − 2( V T A b ψ )( V T A ψ ) + ( V T A ψ ) 2 oi = E  P 2 n  − 2 Z A b φ a ( Z ) V T a b ψ d P 1 n ( a ) + ( V T A b ψ ) 2  + 2 P 2 n  Z A b φ a ( Z ) V T a b ψ d P 1 n ( a )  + P 2 n n − 2( V T A b ψ )( V T A ψ ) + ( V T A ψ ) 2 o  ≤ E  P 2 n  − 2 Z A b φ a ( Z ) V T a ψ d P 1 n ( a ) + ( V T A ψ ) 2  + 2 P 2 n  Z A b φ a ( Z ) V T a b ψ d P 1 n ( a )  + P 2 n n − 2( V T A b ψ )( V T A ψ ) + ( V T A ψ ) 2 o  , (15) F urther simplifying the right-hand side gives = E  P 2 n  2 Z A b φ a ( Z ) V T a ( b ψ − ψ ) d P 1 n ( a ) + 2( V T A ψ ) V T A ( ψ − b ψ )  = E  ( P 2 n − P 1 n ) n 2 V T A ψ V T A ( ψ − b ψ ) o + 2 P 2 n  Z A ( b φ a ( Z ) − V T a ψ ) V T a ( b ψ − ψ ) d P 1 n ( a )  . By writing the scalar pro ducts as sums and taking the maxim um o ver all treatments j , we obtain ≤ 2 E  ∥ b ψ − ψ ∥ 1 max j ∈{ 1 ,...,k }     P 2 n  Z A ( b φ a ( Z ) − V T a ψ ) V a,j d P 1 n ( a )  + ( P 2 n − P 1 n )  V T A ψ V A,j       . (16) 35 B.1.2 Restricted Eigenv alue Assumption Let v ∈ R k suc h that ∥ v ∥ 2 = 1 and ∥ v ∥ 1 ≤ 2 √ s . Then, we obtain 1 n ∥ A v ∥ 2 2 = v T  1 n A T A  v = v T b Σ v ≥ v T Σ v − C 3 2 ≥ C 3 2 with probability at least 1 − α , where we used Assumptions (iv) and (v) to obtain the inequal- ities. Using a rescaling argumen t, the sample restricted eigenv alue condition holds with high probabilit y for all vectors v such that ∥ v ∥ 1 ≤ 2 √ s ∥ v ∥ 2 , i.e., 1 n ∥ A v ∥ 2 2 ≥ C 3 2 ∥ v ∥ 2 2 (17) with probabilit y at least 1 − α . W e no w leverage the exact sparsity assumption to show that the L 1 -norm of b ψ − ψ can b e accordingly b ounded in terms of the L 2 -norm, i.e., ∥ b ψ − ψ ∥ 1 ≤ 2 √ s ∥ b ψ − ψ ∥ 2 . If w e consider the best subset selection estimator, w e kno w that ∥ b ψ subset ∥ 0 ≤ s , and therefore ∥ b ψ subset − ψ ∥ 1 ≤ 2 √ s ∥ b ψ subset − ψ ∥ 2 . The same inequality can b e shown for the Lasso estimator b ψ lasso with a bit of extra work: The fact that ∥ ψ ∥ 0 ≤ s and ∥ b ψ lasso ∥ 1 ≤ ∥ ψ ∥ 1 implies that ∥ b ψ lasso − ψ ∥ 1 ,S c ≤ ∥ b ψ lasso − ψ ∥ 1 ,S where S is the supp ort of ψ ; for details, we refer to the pro of of Theorem 7.8 of W ain wright [ 2019 ]. Using this prop ert y that the L 1 -error oﬀ the supp ort is dominated b y the L 1 -error on the supp ort (analogously to its use in Theorem 7.13 of W ain wright [ 2019 ]), we obtain ∥ b ψ lasso − ψ ∥ 1 = ∥ b ψ lasso − ψ ∥ 1 ,S + ∥ b ψ lasso − ψ ∥ 1 ,S c ≤ 2 ∥ b ψ lasso − ψ ∥ 1 ,S ≤ 2 √ s ∥ b ψ lasso − ψ ∥ 2 . Consequen tly , we pro ved that ∥ b ψ − ψ ∥ 1 ≤ 2 √ s ∥ b ψ − ψ ∥ 2 (18) irresp ectiv e of which of the tw o estimators is used. W e can no w pro ceed with the same analysis for b oth estimators again and use the restricted eigen v alue assumption established in ( 17 ) for v = b ψ − ψ to obtain ∥ b ψ − ψ ∥ 2 2 ≤ 2 C 3 P 2 n  n V T A ( b ψ − ψ ) o 2  with probability at least 1 − α . Then, we can again apply the basic Lasso inequality to the righ t-hand side, so ∥ b ψ − ψ ∥ 2 2 ≤ 4 C 3 ∥ b ψ − ψ ∥ 1 max j ∈{ 1 ,...,k }     P 2 n  Z A ( b φ a ( Z ) − V T a ψ ) V a,j d P 1 n ( a )  + ( P 2 n − P 1 n )  V T A ψ V A,j      , (19) and by combining ( 18 ) and ( 19 ), ∥ b ψ − ψ ∥ 1 ≤ 16 s C 3 · max j ∈{ 1 ,...,k }     P 2 n  Z A ( b φ a ( Z ) − V T a ψ ) V a,j d P 1 n ( a )  + ( P 2 n − P 1 n )  V T A ψ V A,j      (20) with probability at least 1 − α . Now, denote b y SC the high-probability even t that the sample co v ariance condition in Assumption (v) holds. Conditional on this even t, we show ed that the previous inequality holds. The next step is to decomp ose the exp ectation in ( 16 ) into 2 E  1 SC · ∥ b ψ − ψ ∥ 1 max j ∈{ 1 ,...,k }     P 2 n  Z A ( b φ a ( Z ) − V T a ψ ) V a,j d P 1 n ( a )  + ( P 2 n − P 1 n )  V T A ψ V A,j       + 2 E  1 SC c · ∥ b ψ − ψ ∥ 1 max j ∈{ 1 ,...,k }     P 2 n  Z A ( b φ a ( Z ) − V T a ψ ) V a,j d P 1 n ( a )  + ( P 2 n − P 1 n )  V T A ψ V A,j       . 36 W e sho w that the second exp ectation outside the ev ent SC is negligible, i.e., that it suﬃces to b ound the exp ectation on the ev ent SC . Outside the ev ent SC , we can b ound 2 E  1 SC c · ∥ b ψ − ψ ∥ 1 max j ∈{ 1 ,...,k }     P 2 n  Z A ( b φ a ( Z ) − V T a ψ ) V a,j d P 1 n ( a )  + ( P 2 n − P 1 n )  V T A ψ V A,j       ≲ E ( 1 SC c · s · r 1 ( k )) = sr 1 ( k ) P (SC c ) ≤ sr 1 ( k ) α ≲ s r 1 ( k ) log k n b y using the almost sure b oundedness assumption (i) to obtain the ﬁrst inequality , and the assumption on α to obtain the last inequalit y . The remainder of the pro of will sho w that this do es not contribute to the o verall error up to constants. It remains to b ound the exp ectation on the even t SC : W e use ( 20 ) to obtain 2 E  1 SC · ∥ b ψ − ψ ∥ 1 max j ∈{ 1 ,...,k }     P 2 n  Z A ( b φ a ( Z ) − V T a ψ ) V a,j d P 1 n ( a )  + ( P 2 n − P 1 n )  V T A ψ V A,j       ≤ 32 s C 3 E max j ∈{ 1 ,...,k }  P 2 n  Z A ( b φ a ( Z ) − V T a ψ ) V a,j d P 1 n ( a )  + ( P 2 n − P 1 n )  V T A ψ V A,j   2 ! . B.1.3 Decomp osition of the Error Using the fact that ( a + b ) 2 ≤ 2 a 2 + 2 b 2 , we decomp ose the previous line into 128 s C 3 E " max j ∈{ 1 ,...,k } ( P 2 n − P )  Z A ( b φ a ( Z ) − V T a ψ ) V a,j d P 1 n ( a )  2 # | {z } T 1 + 128 s C 3 E " max j ∈{ 1 ,...,k } P  Z A ( b φ a ( Z ) − V T a ψ ) V a,j d P 1 n ( a )  2 # | {z } T 2 + 128 s C 3 E  max j ∈{ 1 ,...,k } ( P 2 n − P )  V T A ψ V A,j  2  | {z } T 3 + 128 s C 3 E  max j ∈{ 1 ,...,k } ( P 1 n − P )  V T A ψ V A,j  2  | {z } T 4 . The following steps upp er-bound the four terms T 1 , T 2 , T 3 , and T 4 separately by the desired error rates. Then, com bining the obtained rates prov es the theorem. B.1.4 Bounding the T erm T 1 W e recall a useful lemma giv en in v an der V aart and W ellner [ 1996 ] (refer to Lemma 2.2.10), whic h will b e the main ingredient in obtaining an upp er b ound of the desired order. Lemma 2. L et X 1 , . . . , X k b e arbitr ary r andom variables that satisfy the tail b ound P ( | X i | ≥ x ) ≤ 2 exp  − 1 2 x 2 b + ax  (21) 37 for al l x and ﬁxe d a, b > 0 . Then,     max 1 ≤ i ≤ k X i     Ψ 1 ≤ K  a log(1 + k ) + √ b p log(1 + k )  for a universal c onstant K . It is well-kno wn (e.g., refer to Chapter 2.2 of v an der V aart and W ellner [ 1996 ]) that ∥ X ∥ p ≤ p ! ∥ X ∥ Ψ 1 , i.e., that an upp er bound on the exp onen tial Orlicz norm will also provide an upp er bound on the L p -norm up to a constan t. In particular, this allo ws us to reformulate the previous lemma in terms of the L 2 -norm. Lemma 3. Under the notation and assumptions of L emma 2 , we have E  max 1 ≤ i ≤ k | X i | 2  ≤ 4 K 2  a log(1 + k ) + √ b p log(1 + k )  2 . These results will b e used to b ound the empirical pro cess term T 1 in our decomp osition. It remains to justify the assumptions of Lemma 3 . In particular, we need to sho w that the random v ariables ( P 2 n − P )  R A ( b φ a ( Z ) − V T a ψ ) V a,j d P 1 n ( a )  , j = 1 , . . . , k satisfy the tail b ound in ( 21 ). This can b e done using Bernstein’s inequality , as the random v ariables are indep enden t (conditionally on D 1 ) and mean-zero. By Assumption (i) and (ii), we ha ve     Z A ( b φ a ( Z ) − V T a ψ ) V a,j d P 1 n ( a )     ≤ 2 C 1 r 1 ( k ) and P (  Z A ( b φ a ( Z ) − V T a ψ ) V a,j d P 1 n ( a )  2 ) ≤ 4 C 1 b r 1 ( k ) . By using this almost sure and L 2 upp er b ounds (and scaling them by 1 /n ), Bernstein inequality giv es P      ( P 2 n − P )  Z A ( b φ a ( Z ) − V T a ψ ) V a,j d P 1 n ( a )      ≥ x  ≤ 2 exp  − 1 2 x 2 b + ax  with a ≍ max { r 1 ( k ) , b r 1 ( k ) } /n and b ≍ max { r 1 ( k ) , b r 1 ( k ) } /n . Then, Lemma 3 pro vides us with the error rate of the desired order by plugging in a and b , i.e., T 1 = E max j ∈{ 1 ,...,k } ( P 2 n − P )  Z A ( b φ a ( Z ) − V T a ψ ) V a,j d P 1 n ( a )  2 ! = E ( E max j ∈{ 1 ,...,k } ( P 2 n − P )  Z A ( b φ a ( Z ) − V T a ψ ) V a,j d P 1 n ( a )  2      D 1 !) ≲ E  max { r 1 ( k ) , b r 1 ( k ) } 2 log 2 k n 2 + max { r 1 ( k ) , b r 1 ( k ) } log k n  . Since E [max { r 1 ( k ) , b r 1 ( k ) } ] ≤ r 1 ( k ) + E { b r 1 ( k ) } ≲ r 1 ( k ) and E  max { r 1 ( k ) , b r 1 ( k ) } 2  ≤ r 1 ( k ) 2 + E { b r 1 ( k ) 2 } ≲ r 1 ( k ) 2 according to Assumption (ii), we obtain T 1 ≲ r 1 ( k ) log k n + r 1 ( k ) 2 log 2 k n 2 ≲ r 1 ( k ) log k n using r 1 ( k ) log k ≲ n . 38 B.1.5 Bounding the T erm T 2 W e hav e T 2 = E " max j ∈{ 1 ,...,k } P  Z A ( b φ a ( Z ) − V T a ψ ) V a,j d P 1 n ( a )  2 # ≤ E " max j ∈{ 1 ,...,k }  Z A P  b φ a ( Z ) − V T a ψ  V a,j d P 1 n ( a )  2 # ≤ C 2 2 E ( max j ∈{ 1 ,...,k }  Z A r 2 ,a ( n, k ) | V a,j | d P 1 n ( a )  2 ) . This is the desired second part of the rate stated in the theorem. B.1.6 Bounding the T erms T 3 and T 4 V T A ψ V A,j is almost surely and second-moment b ounded by r 1 ( k ) up to constants by Assump- tions (i) and (ii). Hence, with an analogous analysis as for T 1 (applying Bernstein inequality and using Lemma 3 ), we obtain the rate T 3 , T 4 ≲ r 1 ( k ) log k n . B.1.7 Bounding the Empirical Pro cess T erm The previous sections com bined bound the expected in-sample error b y the desired rate. Ho w- ev er, it remains to b ound the empirical pro cess term in the decomposition ( 14 ), whic h is done in the following: Recall that SC is the even t that the sample co v ariance condition of Assumption (v) holds. Then, E  ( P − P 2 n )  n V T A ( b ψ − ψ ) o 2  = E  ( P − P 2 n )  n V T A ( b ψ − ψ ) o 2  1 SC  + E  ( P − P 2 n )  n V T A ( b ψ − ψ ) o 2  1 SC c  . W e ﬁrst examine the term where the sample co v ariance condition holds. W e hav e ( P − P 2 n )  n V T A ( b ψ − ψ ) o 2  = ( P − P 2 n ) n ( b ψ − ψ ) T V A V T A ( b ψ − ψ ) o = ( b ψ − ψ ) T ( P − P 2 n )  V A V T A  ( b ψ − ψ ) . Recalling the deﬁnitions Σ = P ( V A V T A ) and b Σ = 1 n A T A , the previous line equals ( b ψ − ψ ) T (Σ − b Σ)( b ψ − ψ ) . Due to the sample cov ariance condition (and rescaling), we obtain ( b ψ − ψ ) T (Σ − b Σ)( b ψ − ψ ) ≤ C 3 2 ∥ b ψ − ψ ∥ 2 2 . 39 On the ev ent that the sample cov ariance condition holds, w e already prov ed the sample re- stricted eigenv alue condition in ( 17 ), from which w e obtain ∥ b ψ − ψ ∥ 2 2 ≤ 2 C 3 P 2 n  n V T A ( b ψ − ψ ) o 2  . Therefore, ( b ψ − ψ ) T (Σ − b Σ)( b ψ − ψ ) ≤ P 2 n  n V T A ( b ψ − ψ ) o 2  . The exp ectation of the righ t-hand side is the exp ected in-sample error and can b e bounded, as ab o ve, b y the desired rate. Finally , we need to b ound the term where the sample cov ariance condition is violated. W e use that ( P − P 2 n )  n V T A ( b ψ − ψ ) o 2  = ( P − P 2 n )      k X j =1 V A,j ( b ψ − ψ ) j    2   = X j,l ( P − P 2 n ) n V A,j V A,l ( b ψ − ψ ) j ( b ψ − ψ ) l o = X j,l ( b ψ − ψ ) j ( b ψ − ψ ) l ( P − P 2 n ) { V A,j V A,l } ≤ ∥ b ψ − ψ ∥ 2 1 max j,l   ( P − P 2 n ) { V A,j V A,l }   ≤ s 2 max j,l   ( P − P 2 n ) { V A,j V A,l }   . (22) Therefore, E  ( P − P 2 n )  n V T A ( b ψ − ψ ) o 2  1 SC c  ≤ s 2 E  max j,l   ( P − P 2 n ) { V A,j V A,l }   1 SC c  . W e can use Holder’s inequality to b ound the previous line b y ≤ s 2 E  max j,l  ( P − P 2 n ) { V A,j V A,l }  4  1 4 (1 − P (SC)) 1 − 1 4 . (23) No w, by Assumptions (i) and (ii), | V A,j V A,l | ≲ r 1 ( k ) and P ( V 2 A,j V 2 A,l ) ≲ r 1 ( k ) , which allo ws us to conclude E  max j,l  ( P − P 2 n ) { V A,j V A,l }  4  1 4 ≲ r r 1 ( k ) log k n b y using Bernstein’s inequality , Lemma 2 , and using the fact that ∥ X ∥ L 4 ≤ 4! ∥ X ∥ Ψ 1 . By assumption, we hav e 1 − P (SC) ≲ log k n . Hence, up to constants, w e can further upp er b ound ( 23 ) by s 2 r r 1 ( k ) log k n  log k n  3 4 = s r 1 ( k ) log k n · s p r 1 ( k )  log k n  1 4 . Since n ≳ s 4 r 1 ( k ) 2 log k by assumption, w e can upper b ound the previous line b y the desired rate ≲ s r 1 ( k ) log k n . This concludes the pro of. 40 B.2 Pro of of Theorem 2 W e w an t to deriv e the rate using Theorem 1 . In the follo wing sections, w e v erify the necessary assumptions. B.2.1 Almost Sure Boundedness W e hav e      k X a =1 b φ a ( Z ) V a,j b ϖ a      ≤ k b ϖ j  1 ( A = j ) b π j ( X ) | Y − b µ j ( X ) | + | b µ j ( X ) | + | ψ 0 |  ≤ k b ϖ j b π j ( X ) 1 ( A = j )2 B + k b ϖ j 2 B ≲ k b y using Assumption (ii),   V T A ψ V A,j   ≤ k | ψ A | 1 ( A = j ) ≲ k b y using the outcome b oundedness assumptions in (iii), and | V A,j V A,l | = k 1 ( A = j ) 1 ( A = l ) ≤ k . Therefore, Assumption (i) of Theorem 1 is satisﬁed with r 1 ( k ) = k . B.2.2 Second Moment Boundedness W e hav e P    k X a =1 b φ a ( Z ) V a,j b ϖ a ! 2    ≤ P (  k b ϖ j b π j ( X ) 1 ( A = j )2 B + k b ϖ j 2 B  2 ) ≤ 2 P  k 2 1 ( A = j )4 B 4 + k 2 b ϖ 2 j 4 B 2  ≤ 8 B 4 k 2 ϖ j + 8 B 2 k 2 b ϖ 2 j ≲ k + k 2 max j ∈{ 1 ,...,k } b ϖ 2 j , k X a =1 V T a ψ V a,j b ϖ a ! 2 = ( k ψ j b ϖ j ) 2 ≲ k 2 max j ∈{ 1 ,...,k } b ϖ 2 j , as well as P  ( V T A ψ V A,j ) 2  = P  k 2 ψ 2 j 1 ( A = j )  = k 2 ψ j ϖ j ≲ k and P  V 2 A,j V 2 A,l  = k 2 P { 1 ( A = j ) 1 ( A = l ) } ≤ k 2 E { 1 ( A = j ) } = k 2 ϖ a ≲ k . Therefore, we can set r 1 ( k ) = k and b r 1 ( k ) = k + k 2 max j ∈{ 1 ,...,k } b ϖ 2 j , as E  max j ∈{ 1 ,...,k } b ϖ 2 j  ≤ k X j =1 E ( b ϖ 2 j ) = k X j =1 1 n 2 E  Bin( n, ϖ j ) 2  ≤ 1 n 2 k X j =1  n ( n − 1) C 2 k 2 + n C k  ≲ 1 k and E  max j ∈{ 1 ,...,k } b ϖ 4 j  ≤ k n 4 E  Bin( n, C /k ) 4  ≲ k n 4 · n 4 k 4 ≲ 1 k 2 imply E ( b r 1 ( k ) j ) ≲ r 1 ( k ) j for j = 1 , 2 . 41 B.2.3 Nuisance Estimation Error Rate W e calculate that P  b φ a ( Z ) − V T a ψ  = √ k P  1 ( A = a ) b π a ( X ) { E ( Y | X , A ) − b µ a ( X ) } + b µ a ( X ) − µ a ( X )  = √ k P  1 ( A = a ) b π a ( X ) { µ a ( X ) − b µ a ( X ) } + b µ a ( X ) − µ a ( X )  = √ k P  E ( 1 ( A = a ) | X ) b π a ( X ) { µ a ( X ) − b µ a ( X ) } + b µ a ( X ) − µ a ( X )  = √ k P  π a ( X ) b π a ( X ) { µ a ( X ) − b µ a ( X ) } + b µ a ( X ) − µ a ( X )  = √ k P  ( µ a ( X ) − b µ a ( X ))  π a ( X ) b π a ( X ) − 1  ≤ √ k     π a ( · ) b π a ( · ) − 1     ∥ µ a ( · ) − b µ a ( · ) ∥ = r 2 ,a ( n, k ) . Hence, Assumption (iii) of Theorem 1 is satisﬁed with r 2 ,a ( n, k ) = √ k    π a ( · ) b π a ( · ) − 1    ∥ µ a ( · ) − b µ a ( · ) ∥ . B.2.4 Restricted Eigenv alue Condition W e hav e that v T E ( V A V T A ) v = k · v T diag( E ( 1 ( A = j ))) v = k · v T diag( P ( A = j )) v = k X j =1 k ϖ j v 2 j ≥ c ∥ v ∥ 2 2 where the last inequality follows from the assumption that kϖ j ≥ c for all j ∈ { 1 , . . . , k } . Hence, this prov es Assumption (iv) of Theorem 1 for C 3 = c . B.2.5 Sample Cov ariance Condition W e hav e A = √ k    1 ( A 1 = 1) . . . 1 ( A 1 = k ) . . . . . . 1 ( A n = 1) . . . 1 ( A n = k )    , so v T b Σ v = v T  1 n A T A  v = k · v T diag  N j n  v for N j = P n i =1 1 ( A i = j ) . F urther, we hav e v T Σ v = k · v T diag( P ( A = j )) v . Hence, we must sho w that sup v ∈ S 1     v T  diag  N j n  − diag ( P ( A = j ))  v     ≤ c 2 k with high probability . Since the matrix diag  N j n  − diag ( P ( A = j )) is symmetric, the left-hand side equals the op erator 2-norm of this matrix, hence it suﬃces to b ound the op erator norm 42 b y c/ 2 k . Since we hav e a diagonal matrix, it further suﬃces to b ound the largest diagonal en try with high probability . Consequen tly , we need to show that max j ∈{ 1 ,...,k } | N j − n P ( A = j ) | ≤ cn 2 k (24) with probabilit y at least 1 − α for α ≤ log k n . F or a ﬁxed j , we can use the assumption π j ≤ C /k to obtain P  | N j − n P ( A = j ) | ≥ cn 2 k  ≤ P  | N j − n P ( A = j ) | ≥ c 2 C n P ( A = j )  . W e hav e that N j is a sum of iid Bernoulli random v ariables with mean µ = n P ( A = j ) . Hence, w e can apply the multiplicativ e Chernoﬀ b ound to obtain, for δ := c 2 C , P  | N j − n P ( A = j ) | ≥ c 2 C n P ( A = j )  = P ( | N j − µ | ≥ δµ ) ≤ 2 exp  − δ 2 µ 3  = 2 exp  − c 2 12 C 2 nϖ j  . No w using ϖ j ≥ c/k , we obtain P  | N j − n P ( A = j ) | ≥ c 2 C n P ( A = j )  ≤ 2 exp  − c 3 12 C 2 n k  . Using the union b ound, we immediately obtain P  ∃ j ∈ { 1 , . . . , k } : | N j − n P ( A = j ) | ≥ c 2 C n P ( A = j )  ≤ 2 k exp  − c 3 12 C 2 n k  . T o satisfy ( 24 ) with probability at least 1 − α , we need to b ound the previous line by α , i.e., w e need 2 k exp  − c 3 12 C 2 n k  ≤ α. Setting α = log k n , and substituting n = γ k log k , the previous condition is equiv alent to k 2 − c 3 γ 12 C 2 ≤ 1 2 γ . T aking the log on b oth sides, we obtain  2 − c 3 γ 12 C 2  log k ≤ − log (2 γ ) , whic h is equiv alent to γ ≥ 12 C 2 c 3  2 + log(2 γ ) log k  . Using that k ≥ 2 , the ab ov e is satisﬁed whenev er we ha ve γ ≥ 12 C 2 c 3  2 + log(2 γ ) log 2  . This condition holds for a large enough constant γ . 43 B.2.6 Final Rate Plugging in the obtained rates into the error guarantee of Theorem 1 , we obtain E  n V T A ( b ψ − ψ ) o 2  ≲ s k log k n + sk 2 E max j ∈{ 1 ,...,k } b ϖ 2 j     π a ( · ) b π a ( · ) − 1     2 ∥ µ a ( · ) − b µ a ( · ) ∥ 2 ! . The left-hand side equals E k k X a =1 1 ( A = a )( b ψ a − ψ a ) 2 ! = k · E k X a =1 ϖ a ( b ψ a − ψ a ) 2 ! , so after scaling b y 1 /k w e obtain E k X a =1 ϖ a ( b ψ a − ψ a ) 2 ! ≲ s log k n + sk E max j ∈{ 1 ,...,k } b ϖ 2 j     π a ( · ) b π a ( · ) − 1     2 ∥ µ a ( · ) − b µ a ( · ) ∥ 2 ! . It remains to sho w that sk E max j ∈{ 1 ,...,k } b ϖ 2 j     π a ( · ) b π a ( · ) − 1     2 ∥ µ a ( · ) − b µ a ( · ) ∥ 2 ! ≤ s k ( δ n ϵ n ) 2 . Using that δ n ( a ) ϵ n ( a ) ≤ δ n ϵ n , we obtain sk E  max a ∈{ 1 ,...,k } { b ϖ a δ n ( a ) ϵ n ( a ) } 2  ≤ s ( δ n ϵ n ) 2 k E  max a ∈{ 1 ,...,k } b ϖ 2 a  . (25) W e now compute that E  max a ∈{ 1 ,...,k } b ϖ 2 a  = 1 n 2 E  max a ∈{ 1 ,...,k } N 2 a  where N a ∼ Bin( n, ϖ a ) . T o compute the exp ectation on the right-hand side, we note that this is the balls-in to-bins problem where w e w ant to calculate the exp ected squared maximum n umber of balls across all bins. If w e can sho w that E  max a ∈{ 1 ,...,k } N 2 a  ≲ n 2 k 2 , w e ov erall obtain the desired iden tity . W e ﬁrst calculate that E  max a ∈{ 1 ,...,k } N 2 a  = Z t> 0 P  max a ∈{ 1 ,...,k } N 2 a > t  dt = 2 Z s> 0 s P  max a ∈{ 1 ,...,k } N a > s  ds where w e substituted s = √ t . Now deﬁne T = α n k for some constan t α that is indep enden t of n and k . Then, we can split up the integral in t wo terms Z s> 0 s P  max a ∈{ 1 ,...,k } N a > s  ds = Z T 0 s P  max a ∈{ 1 ,...,k } N a > s  ds + Z ∞ T s P  max a ∈{ 1 ,...,k } N a > s  ds. W e ﬁrst upp er b ound the ﬁrst integral as follows: Z T 0 s P  max a ∈{ 1 ,...,k } N a > s  ds ≤ Z T 0 sds = 1 2 T 2 = α 2 2 n 2 k 2 . 44 This upp er b ound is of the desired order. Hence, it remains to sho w that the tail in tegral can b e b ounded by an equal or smaller order. By using the union b ound, w e obtain P  max a ∈{ 1 ,...,k } N a > s  ≤ k X a =1 P ( N a > s ) ≤ k P ( N > s ) for some random v ariable N ∼ Bin( n, C 2 /k ) . Hence, we obtain Z ∞ T s P  max a ∈{ 1 ,...,k } N a > s  ds ≤ k Z ∞ T s P ( N > s ) ds. W e w ant to apply the multiplicativ e Chernoﬀ b ound P ( N > (1 + δ ) µ ) ≤  exp( δ ) (1+ δ ) 1+ δ  µ for a sum N of independent Bernoulli random v ariables with mean E ( N ) = µ . Set δ := s µ − 1 . T ak e α > C 2 , then s > T = α n k > C 2 n k = E ( N ) = µ , so δ > 0 . Now, P ( N > s ) = P ( N > (1 + δ ) µ ) ≤  exp( δ ) (1 + δ ) 1+ δ  µ = exp( s − µ ) ( s/µ ) s = exp  s − µ − s log  s µ  . Since µ ≥ 0 and log e = 1 , w e can upp er b ound the previous line by ≤ exp  − s log  s µe  . T ake α ≥ e 2 , then log  s µe  ≥ log  T µe  = log  αµ µe  ≥ 1 , whic h implies that P ( N > s ) ≤ exp( − s ) . Consequen tly , Z ∞ T s P  max a ∈{ 1 ,...,k } N a > s  ds ≤ k Z ∞ T s exp( − s ) ds. This right-hand side equals k [ − ( s + 1) exp( − s )] ∞ T = k ( T + 1) exp( − T ) = k ( αµ + 1) exp ( − αµ ) . Finally , we sho w that this term is of equal or smaller order than µ 2 b y calculating that µ = C 2 n k ≥ C 2 γ log k, so k ( αµ + 1) exp ( − αµ ) ≤ k ( αµ + 1) exp ( − αC 2 γ log k ) = ( αµ + 1) k 1 − αC 2 γ ≤ 2 αµk 1 − αC 2 γ . The right-hand side is of order at most µ 2 if 1 − αC 2 γ < 0 , i.e., w e choose α > 1 C 2 γ . Hence, the tail integral is negligible. In summary , this prov es ( 25 ) and w e can conclude that E k X a =1 ϖ a ( b ψ a − ψ a ) 2 ! ≲ s log k n + s k ( δ n ϵ n ) 2 . B.3 Pro of of Theorem 3 W e prov e the statement b y applying the general error rate guarantee giv en in Theorem 1 and v erifying the necessary assumptions therein. 45 B.3.1 Almost Sure Boundedness W e hav e      X a ∈A b φ a ( Z ) V a,j b ϖ a      ≤ X a ∈A  1 ( A = a ) b π a ( X ) 2 B 2 + 2 B 2  b ϖ a ≤ 2 B 3 X a ∈A 1 ( A = a ) + 2 B 2 X a ∈A b ϖ a ≲ 1 as well as   V T A ψ V A,j   = | A T ψ A j | ≲ 1 and | V A,j V A,l | = | A j A l | ≲ 1 b y using the b oundedness assumptions. Therefore, Assumption (i) of Theorem 1 is satisﬁed with r 1 ( k ) = 1 . B.3.2 Second Moment Boundedness The second-moment b oundedness assumptions follow immediately from the almost sure b ound- edness prov ed ab ov e. Hence, Assumption (ii) of Theorem 1 is satisﬁed with b r 1 ( k ) , r 1 ( k ) = 1 . B.3.3 Nuisance Estimation Error Rate W e hav e P  b φ a ( Z ) − V T a ψ  = P  b φ a ( Z ) − a T ψ  = P  1 ( A = a ) b π a ( X ) { E ( Y | X , A ) − b µ a ( X ) } + b µ a ( X ) − µ a ( X )  = P  1 ( A = a ) b π a ( X ) { µ a ( X ) − b µ a ( X ) } + b µ a ( X ) − µ a ( X )  = P  E ( 1 ( A = a ) | X ) b π a ( X ) { µ a ( X ) − b µ a ( X ) } + b µ a ( X ) − µ a ( X )  = P  π a ( X ) b π a ( X ) { µ a ( X ) − b µ a ( X ) } + b µ a ( X ) − µ a ( X )  = P  ( µ a ( X ) − b µ a ( X ))  π a ( X ) b π a ( X ) − 1  ≤     π a b π a − 1     ∥ µ a − b µ a ∥ = r 2 ,a ( n, k ) . B.3.4 Restricted Eigenv alue Condition By assumption, we ha ve v T E ( V A V T A ) v = v T E ( AA T ) v ≥ D ∥ v ∥ 2 2 , so Assumption (iv) of Theorem 1 is satisﬁed with C 3 = D . 46 B.3.5 Sample Cov ariance Condition By assumption, we ha ve sup ∥ v ∥ 2 =1 , ∥ v ∥ 1 ≤ 2 √ s    v T Σ v − v T b Σ v    = sup ∥ v ∥ 2 =1 , ∥ v ∥ 1 ≤ 2 √ s     v T  E ( AA T ) − 1 n A T A  v     ≤ D 2 with high probability . B.3.6 Final Rate Plugging the obtained rates into Theorem 1 , we obtain E  P 2 n  n A T ( b ψ − ψ ) o 2  ≲ s log k n + s E max j ∈{ 1 ,...,k }  Z A r 2 ,a ( n, k ) | V a,j | d P 1 n ( a )  2 ! ≲ s log k n + s E   X a ∈A b ϖ a     π a b π a − 1     ∥ µ a − b µ a ∥ ! 2   . Finally , we can simplify the upp er b ound by using that δ n ( a ) ≤ δ n , ϵ n ( a ) ≤ ϵ n , and the fact that P a ∈A b ϖ a = 1 , which gives E  P 2 n  n A T ( b ψ − ψ ) o 2  ≲ s log k n + s ( δ n ϵ n ) 2 . B.4 Pro of of Theorem 6 The pro of follo ws the same structure as the deriv ation of the upp er b ound for the exp ected in-sample error in the pro of of Theorem 1 . Therefore, w e only highligh t the necessary adjust- men ts here. The basic Lasso inequalit y can b e applied in the same w ay , giving us E  P 2 n  2 Z A b φ a ( Z ) V T a ( b ψ − ψ ) d P 1 n ( a ) + 2( V T A ψ ) V T A ( ψ − b ψ )  ≤ 2 E  ∥ b ψ − ψ ∥ 1 max j ∈{ 1 ,...,k }     P 2 n  Z A ( b φ a ( Z ) − V T a ψ ) V a,j d P 1 n ( a )  + ( P 2 n − P 1 n )  V T A ψ V A,j       . Then, the pro of of Theorem 1 further upp er-b ounds ∥ b ψ − ψ ∥ 1 using the exact sparsity , restricted eigen v alues, and sample co v ariance condition to ac hieve the fast Lasso rate. How ev er, these assumptions are not a v ailable to us here, so w e simply use the approximate sparsity assumption to b ound ∥ b ψ − ψ ∥ ≤ 2 s instead. Consequen tly , E  P 2 n  2 Z A b φ a ( Z ) V T a ( b ψ − ψ ) d P 1 n ( a ) + 2( V T A ψ ) V T A ( ψ − b ψ )  ≤ 4 s E  max j ∈{ 1 ,...,k }     P 2 n  Z A ( b φ a ( Z ) − V T a ψ ) V a,j d P 1 n ( a )  + ( P 2 n − P 1 n )  V T A ψ V A,j       . The term on the right-hand side is analogous to the one obtained in the pro of of Theorem 1 , with the only diﬀerence that the term does not appear squared. W e apply the same decompo- sition in to the empirical pro cess terms T 1 , T 3 , T 4 and n uisance estimation error term T 2 , i.e., 47 the right-hand side equals = 4 s E  max j ∈{ 1 ,...,k }     ( P 2 n − P )  Z A ( b φ a ( Z ) − V T a ψ ) V a,j d P 1 n ( a )       | {z } T 1 + 4 s E  max j ∈{ 1 ,...,k }     P  Z A ( b φ a ( Z ) − V T a ψ ) V a,j d P 1 n ( a )       | {z } T 2 + 4 s E  max j ∈{ 1 ,...,k }   ( P 2 n − P )  V T A ψ V A,j     | {z } T 3 + 4 s E  max j ∈{ 1 ,...,k }   ( P 1 n − P )  V T A ψ V A,j     | {z } T 4 . Note that, also here, the terms are not of quadratic order, as is the case in the pro of of Theorem 1 . Nonetheless, w e can use the same to ols to analyze b oth terms. W e obtain T 2 ≤ C 2 E  max j ∈{ 1 ,...,k } Z A r 2 ,a ( n, k ) | V a,j | d P 1 n ( a )  . with completely analogous calculations. F or the term T 1 , we ha ve to formulate Lemma 3 in terms of the L 1 -norm instead of the L 2 -norm. This can b e done by using Lemma 2 and the fact that the Orlicz norm can b e b ounded b elow b y ∥ X ∥ 1 ≤ ∥ X ∥ ψ 1 . This implies the follo wing lemma. Lemma 4. Under the notation and assumptions of L emma 2 , we have E  max 1 ≤ i ≤ k | X i |  ≤ K  a log(1 + k ) + √ b p log(1 + k )  . Using Bernstein’s inequalit y in combination with Assumptions (i) and (ii) to verify the as- sumptions of this lemma, we obtain T 1 ≲ r 1 ( k ) log k n + r r 1 ( k ) log k n . The same analogous analysis applies to the terms T 3 and T 4 . This prov es Theorem 6 . B.5 Pro of of Theorem 7 W e wan t to deriv e the rate using Theorem 6 . T o verify the assumptions, w e can reuse the computations in the pro of of Theorem 2 . Assumption (i) of Theorem 6 is satisﬁed with r 1 ( k ) = k by the calculation in Section B.2.1 . Assumption (ii) is satisﬁed with r 1 ( k ) = k using the calculations in Section B.2.2 and addi- tionally verifying that E (max j ∈{ 1 ,...,k } b ϖ j ) ≲ 1 √ k . Assumption (iii) can b e veriﬁed with r 2 ,a ( n, k ) = √ k    π a b π a − 1    ∥ µ a − b µ a ∥ according to Section B.2.3 . 48 Using Theorem 6 sho ws E k X a =1 w ( a )( b ψ a − ψ a ) 2 ! ≲ s r log k k n + s E  max a ∈{ 1 ,...,k } { b ϖ a δ n ( a ) ϵ n ( a ) }  . Finally , w e simplify the nuisance p enality b y b ounding δ n ( a ) ≤ δ n and ϵ n ( a ) ≤ ϵ n , and the fact that E  max a ∈{ 1 ,...,k } b ϖ a  ≤ s E  max a ∈{ 1 ,...,k } b ϖ 2 a  ≲ 1 k due to the already sho wn identit y E  max a ∈{ 1 ,...,k } b ϖ 2 a  ≍ 1 /k 2 in App endix B.2.6 . This pro ves that E k X a =1 w ( a )( b ψ a − ψ a ) 2 ! ≲ s r log k k n + s k δ n ϵ n . B.6 Pro of of Theorem 8 W e prov e the result b y verifying the conditions of Theorem 6 . In the pro of of Theorem 3 , w e already veriﬁed that Assumptions (i) and (ii) hold for r 1 ( k ) = 1 , and Assumption (iii) holds for r 2 ,a ( n, k ) =     π a b π a − 1     ∥ µ a − b µ a ∥ . Applying Theorem 6 no w gives E  P 2 n  n A T ( b ψ − ψ ) o 2  ≲ s r log k n + s E X a ∈A b ϖ a     π a ( · ) b π a ( · ) − 1     ∥ µ a ( · ) − b µ a ( · ) ∥ ! . Using the b ounds δ n ( a ) ≤ δ n and ϵ n ( a ) ≤ ϵ n , and the fact that P a ∈A b ϖ a = 1 , we can conclude the claim of the theorem. C Pro ofs of Lo w er Bounds C.1 Pro of of Prop osition 1 T o deriv e this result, we use F ano’s method. The following lemma gives a v ersion based on χ 2 div ergence, adapted from Theorem 2.6 of T sybak ov [ 2009 ]. Lemma 5. L et { P 0 , P 1 , ..., P M } ⊆ P with M ≥ 2 . Assume 1 M P M j =1 χ 2 ( P j , P 0 ) ≤ αM for 0 < α < 1 / 2 , and d ( θ j , θ j ′ ) ≥ ∆ > 0 for al l j  = j ′ . Then inf b θ sup P ∈P E P h ℓ n d  b θ , θ ( P ) oi ≥ ℓ (∆ / 2) 2  1 − α − 1 M  for any monotonic non-ne gative loss function ℓ . 49 C.1.1 Construction Consider the 2 k distributions P ω where Y | A = a ∼ Bern (1 / 2 + ω a ϵ ) , i.e., p ω ( z ) = n y (1 / 2 + ω a ϵ ) + (1 − y )(1 / 2 − ω a ϵ ) o π a where ω = ( ω 1 , ..., ω k ) ∈ { 0 , 1 } k , π a ≥ C ′ /k and ϵ is sp eciﬁed later. Note that µ a ( x ) = 1 / 2 + ω a ϵ . These are v alid distributions as long as ϵ ≤ 1 / 2 . Let P j = P ω ( j ) , j = 1 , . . . , M , de- note at least 2 k/ 8 distributions with probabilit y mass functions p ω ( j ) for which an y ω ( j )  = ω ( j ′ ) diﬀer on at least k / 8 indices. Existence is guaranteed b y the following V arshamo v-Gilb ert lemma ( T sybak ov [ 2009 ], Lemma 2.9). Lemma 6 (V arshamo v-Gilb ert) . Assume k ≥ 8 . Then ther e exists a subset { ω (0) , ..., ω ( M ) } of Ω = { 0 , 1 } k such that M ≥ 2 k/ 8 , ω (0) = (0 , ..., 0) , and ∥ ω ( j ) − ω ( j ′ ) ∥ 0 ≥ k / 8 for al l j  = j ′ . C.1.2 F unctional Separation Note for an y ω , ω ′ diﬀering at index a w e hav e | ψ a ( P ω ) − ψ a ( P ω ′ ) | = ϵ . No w since any ω , ω ′ m ust diﬀer on at least k / 8 indices we ha ve 1 k k X a =1 { ψ a ( P ω ) − ψ a ( P ω ′ ) } 2 ≥ 1 8 ϵ 2 (26) Note we also hav e for any ω that ∥ ψ ( P ω ) ∥ 2 2 = P j (1 / 2 + ω j ϵ ) 2 ≤ k (1 / 2 + ϵ ) 2 ≤ 2 k (1 / 4 + ϵ 2 ) . C.1.3 Distributional Distance F or a single observ ation, the χ 2 div ergence satisﬁes χ 2 ( P ω , P 0 ) = X a,y p ω ( z ) 2 p 0 ( z ) − 1 ≤ X a,y 2 { y (1 / 2 + ϵ ) + (1 − y )(1 / 2 − ϵ ) } 2 π a − 1 = 2 n (1 / 2 + ϵ ) 2 + (1 / 2 − ϵ ) 2 o − 1 = 4 ϵ 2 . So for the pro duct measures P ⊗ n 0 = Q n i =1 P 0 and P ⊗ n ω = Q n i =1 P ω w e hav e log  1 + χ 2 ( P ⊗ n ω , P ⊗ n 0 )  = n X i =1 log  1 + χ 2 ( P ω , P 0 )  ≤ n log(1 + 4 ϵ 2 ) ≤ 4 nϵ 2 using tensorization prop erties of χ 2 and log (1 + x ) ≤ x . If w e take ϵ 2 = log( M / 4 + 1) 4 n ≤ k log 2 4 n then log  1 + χ 2 ( P ⊗ n ω , P ⊗ n 0 )  ≤ log( M / 4 + 1) , whic h implies χ 2 ( P ⊗ n ω , P ⊗ n 0 ) ≤ M / 4 , satisyﬁng the distance condition in Lemma 5 with α = 1 / 4 . Note that the condition ϵ ≤ 1 / 2 needed for p ω to b e a v alid probability mass function is satisﬁed if k ≤ n/ log 2 . When k > n/ log 2 w e can take ϵ to b e a constant, e.g., ϵ = 1 / 4 , and the same arguments below show a constan t minimax low er b ound. 50 C.1.4 Final Result Plugging the ab o ve choice of ϵ in to the functional separation ( 26 ) yields 1 k k X a =1 { ψ a ( P ω ) − ψ a ( P ω ′ ) } 2 ≥ (1 / 8) ϵ 2 = log( M / 4 + 1) 32 n and note log( M / 4 + 1) ≥ log (2 k/ 8 / 4) = ( k / 8 − 2) log 2 ≥ k log 2 / 16 as long as k ≥ 32 . Lemma 5 with α = 1 / 4 and ∆ = p C 1 k /n where C 1 = log 2 512 , and d ( x, y ) = q 1 k P k a =1 ( x a − y a ) 2 then yields inf b θ sup P ∈P E P k X a =1 ϖ a ( P ) n b ψ a − ψ a ( P ) o 2 ! ≥ C ′ C 1 8  1 − 1 / 4 − 1 2 k/ 8  k n ≥ C 2 k n for C 2 = 11 C ′ C 1 / 128 , using the fact that 1 / 2 k/ 8 ≤ 1 / 16 when k ≥ 32 , and ϖ a ( P ) ≥ C ′ k uniformly across all a ∈ { 1 , . . . , k } and P ∈ P . This prov es the prop osition. C.2 Pro of of Theorem 4 T o derive this result, w e use F ano’s metho d. Recall Lemma 5 , whic h gives a v ersion of F ano’s metho d based on the χ 2 distance, adapted from Theorem 2.6 of T sybako v [ 2009 ]. Consequently , the main task will be to construct distributions P j , j = 0 , 1 , . . . , M that satisfy the conditions of this lemma. In particular, we wan t to ensure that the diﬀerence of the functional ψ under t wo diﬀerent distributions P j and P j ′ is large, while the distributions are still close enough, i.e., the χ 2 distance of the t wo distributions can b e b ounded as required by the lemma. C.2.1 Construction Consider the distributions P ω where Y | A = a ∼ Bern(1 / 2 + ω a ε ) , i.e., p ω ( z ) = { y (1 / 2 + ω a ε ) + (1 − y )(1 / 2 − ω a ε ) } π a , where ϵ will b e sp eciﬁed later and ω = ( ω 1 , . . . , ω k ) ∈ Ω with Ω = n ω = ( ω 1 , . . . , ω k ) ∈ { 0 , 1 } k : ∥ ω ∥ 0 = s o the set of all s -sparse binary v ectors, and π a satisfying Assumption (i). These are v alid distributions as long as ϵ ≤ 1 / 2 . Moreov er, note that µ a ( x ) = 1 / 2 + ω a ε . Let P 0 denote the distribution corresp onding to the choice ω a = 0 for all a . Let P j = P ω ( j ) , j = 1 , . . . , M denote at least  1 + k 2 s  s/ 8 distributions with densities p ω ( j ) for which any s -sparse v ectors ω ( j )  = ω ( j ′ ) diﬀer on at least s/ 2 indices. The existence is guaranteed b y the follo wing sparse version of the V arshamo v-Gilb ert lemma (see Lemma 4.14 of Rigollet and Hütter [ 2023 ] for a pro of ). Lemma 7 (Sparse V ersion of V arshamov-Gilbert) . L et 1 ≤ s ≤ k / 8 . Then ther e exists a subset { ω (1) , . . . .ω ( M ) } ⊆ Ω of s -sp arse binary ve ctors such that ∥ ω ( j ) − ω ( j ′ ) ∥ 0 ≥ s 2 for al l j  = j ′ , and log( M ) ≥ s 8 log  1 + k 2 s  . Note that the distributions P j , j = 1 , . . . , M resp ect the mo del: If ω ( j ) a = 0 then ψ a ( P j ) = ψ a ( P 0 ) = 1 / 2 . W e can only hav e ψ a ( P j )  = 1 / 2 whenev er ω ( j ) a = 1 , but this can only b e the case for at most s of the treatmen ts a ∈ { 1 , . . . , k } by construction. Hence, at most s entries of ψ ( P j ) diﬀer from 1 / 2 . 51 C.2.2 F unctional Separation Note for any ω , ω ′ diﬀering at index a we hav e | ψ a ( P ω ) − ψ a ( P ω ′ ) | = | ω a ε − ω ′ a ε | = ε. No w since any ω , ω ′ m ust diﬀer on at least s/ 2 indices, we conclude 1 k k X a =1 { ψ a ( P ω ) − ψ a ( P ω ′ ) } 2 ≥ s 2 k ε 2 . (27) C.2.3 Distributional Distance F or the densit y of a single observ ation, the χ 2 distance satisﬁes χ 2 ( P ω , P 0 ) = k X a =1 1 X y =0 p ω ( z ) 2 p 0 ( z ) − 1 ≤ k X a =1 1 X y =0 2 { y (1 / 2 + ω a ε ) + (1 − y )(1 / 2 − ω a ε ) } 2 π a − 1 = 2 k X a =1  (1 / 2 + ω a ε ) 2 + (1 / 2 − ω a ε ) 2  π a − 1 = 2 k X a =1  1 / 2 + 2 ω a ε 2  π a − 1 = k X a =1 π a + 4 ε 2 k X a =1 π a ω a − 1 ≤ 4 C ′′ ε 2 s k . So for the pro duct measures P ⊗ n 0 = Q n i =1 P 0 and P ⊗ n ω = Q n i =1 P ω , we hav e log  1 + χ 2 ( P ⊗ n ω , P ⊗ n 0 )  = log " n Y i =1  1 + χ 2 ( P ω , P 0 )  # = n X i =1 log { 1 + χ 2 ( P ω , P 0 ) } ≤ n log  1 + 4 C ′′ ε 2 s k  ≤ n 4 C ′′ ε 2 s k using the tensorization prop erty of the χ 2 distance in the ﬁrst line, and log (1 + x ) ≤ x in the last. W e pro ceed with a case distinction no w. Case k log( k /s ) ≤ C ′′ 2 n : In this case, we take ε 2 = k log( M / 4 + 1) 4 C ′′ sn , then log  1 + χ 2 ( P ⊗ n ω , P ⊗ n 0 )  ≤ log( M / 4 + 1) , which implies χ 2  P ⊗ n ω , P ⊗ n 0  ≤ M 4 , satisfying the distance condition in Lemma 5 with α = 1 / 4 . Note that this is a v alid c hoice for ε , as ϵ = s k log  M 4 + 1  4 C ′′ ns ≤ v u u t k log   k s  / 4 + 1  4 C ′′ ns ≤ v u u t k log   ek s  s / 4 + 1  4 C ′′ ns ≤ s k s log  ek s  4 C ′′ ns ≤ s k log  k s  2 C ′′ n ≤ 1 2 , where we used the assumption k log( k /s ) ≤ C ′′ 2 n as well as k ≥ 8 s . 52 Case k log( k /s ) > C ′′ 2 n : In this case, we choose ε = 1 / 34 . Then, we ha ve log  1 + χ 2 ( P ⊗ n ω , P ⊗ n 0 )  ≤ n 4 C ′′ ε 2 s k = C ′′ 289 ns k ≤ s 144 log( k /s ) . By the prop ert y of the sparse V arshamo v-Gilb ert construction, we ha ve log( M / 4 + 1) ≥ log( M ) − log (4) ≥ s 8 log  1 + k 2 s  − log (4) ≥  s 8 − 1  log  1 + k 2 s  ≥ s 72 log  k 2 s  = s 72  log  k s  − log (2)  ≥ s 144 log  k s  as long as s ≥ 9 and k ≥ 8 s . Therefore, log  1 + χ 2 ( P ⊗ n ω , P ⊗ n 0 )  ≤ log( M / 4 + 1) , hence χ 2 ( P ⊗ n ω , P ⊗ n 0 ) ≤ M 4 . Finally , note that this choice of ε leads to a v alid distribution p ω , as ε ≤ 1 / 2 . C.2.4 Final Minimax Lo wer Bound Since we used diﬀerent choices of ε for the case k log( k /s ) ≤ C ′′ n/ 2 and k log( k /s ) > C ′′ n/ 2 , w e also make this case distinction here and derive the minimax low er b ounds separately . Case k log( k /s ) ≤ C ′′ 2 n : Plugging in the abov e choice of ε in to the functional separation ( 27 ) yields 1 k k X a =1 { ψ a ( P ω ) − ψ a ( P ω ′ ) } 2 ≥ s 2 k k log( M / 4 + 1) 4 C ′′ sn = log( M / 4 + 1) 8 C ′′ n ≥ s log( k /s ) 1152 C ′′ n ≥ C 1 s log( k /s ) n =: ∆ 1 , as long as s ≥ 9 and k ≥ 8 s , where C 1 = 1 1152 C ′′ . Case k log( k /s ) > C ′′ 2 n : In this case, plugging in ε into the functional separation gives 1 k k X a =1 { ψ a ( P ω ) − ψ a ( P ω ′ ) } 2 ≥ s 2 k 1 1156 = C 2 s k =: ∆ 2 , where C 2 = 1 2312 . No w, considering b oth cases com bined, we obtain v u u t 1 k k X a =1 { ψ a ( P ω ) − ψ a ( P ω ′ ) } 2 ≥ p min(∆ 1 , ∆ 2 ) ≥ s min( C 1 , C 2 ) min  s log( k /s ) n , s k  =: ∆ . 53 Applying Lemma 5 with this choice of ∆ , α = 1 / 2 , and d ( x, y ) = q 1 k P k a =1 ( x a − y a ) 2 giv es inf b ψ sup P ∈P E ( k X a =1 ϖ a ( P )( b ψ a − ψ a ) 2 ) ≥ C ′  1 − 1 4 − 1 M  (∆ / 2) 2 2 ≥ C ′ min( C 1 , C 2 ) 8 3 4 −  1 + k 2 s  − s/ 8 ! · min  s log( k /s ) n , s k  ≥ C ′ min( C 1 , C 2 ) 8  3 4 − 5 − 9 / 8  min  s log( k /s ) n , s k  = C 3 · min  s log( k /s ) n , s k  for C 3 = C ′ min( C 1 ,C 2 ) 8  3 4 − 5 − 9 / 8  , where we used M ≥  1 + k 2 s  s/ 8 , k /s ≥ 8 , s ≥ 9 , and the fact that ϖ a ( P ) ≥ C ′ k uniformly across all a ∈ { 1 , . . . , k } and P ∈ P . This prov es the theorem. C.3 Pro of of Theorem 5 Without loss of generality , we can assume δ n , ϵ n ≤ 1 (the sequences con verge to zero and are therefore b ounded, and the constan ts C π , C µ can b e adjusted accordingly .) Also, note that the density of a distribution in our mo del can b e written as p ( z ) = f ( x ) π a ( x ) µ a ( x ) . T o prov e the theorem, we use Le Cam’s metho d with fuzzy hypotheses. The following lemma presen ts a v ersion based on the Hellinger distance, adapted from Theorem 2.15 of T sybak ov [ 2009 ]. Lemma 8 ( T sybak ov [ 2009 ]) . L et P λ and Q λ denote distributions in P indexe d by a ve ctor λ = ( λ 1 , . . . , λ k ) , with n -fold pr o ducts denote d by P n λ and Q n λ , r esp e ctively. L et ϖ denote a prior distribution over λ . If H 2  Z P n λ dϖ ( λ ) , Z Q n λ dϖ ( λ )  ≤ α < 2 and d ( θ ( P λ ) , θ ( Q λ ′ )) ≥ ∆ > 0 for a semi-distanc e d , functional θ : P 7→ R k and for al l λ, λ ′ , then inf b θ sup P ∈P E P n ℓ  d  b θ , θ ( P ) o ≥ ℓ (∆ / 2) 1 − p α (1 − α/ 4) 2 ! for any monotonic non-ne gative loss function ℓ . Consequen tly , the main task will be to construct distributions P λ and Q λ that satisfy the conditions of the previous lemma. In particular, w e m ust ensure that the diﬀerence in the functionals ψ under these distributions is maximally large while ensuring that the distributions are suﬃciently close in Hellinger distance. This can b e achiev ed with a construction similar to the one in Jin and Syrgk anis [ 2025 ]. 54 C.3.1 Construction Let S = { a ∈ { 1 , . . . , k } | ψ a  = θ } , where k / ∈ S . First, we deﬁne the null distribution P λ . Let the density of P λ b e p λ ( z ) = ( b π a ( x ) b µ a ( x ) y { 1 − b µ a ( x ) } 1 − y if a ∈ S b π a ( x ) e µ a ( x ) y { 1 − e µ a ( x ) } 1 − y if a / ∈ S for x ∈ [0 , 1] , where e µ a is a function satisfying ∥ b µ a − e µ a ∥ ≤ C µ ϵ n , and R e µ a ( x ) d P ( x ) = θ for all a / ∈ S . Then, the parameter under the null distribution is ψ a ( P λ ) = ( R b µ a ( x ) dx if a ∈ S θ if a / ∈ S. Next, w e deﬁne the alternativ e distribution Q λ . This must b e done separately for the cases δ n ≲ ϵ n and ϵ n ≲ δ n . Case δ n ≲ ϵ n : Assume that δ n ≤ D ϵ n for all n ∈ N for some constant D > 0 . Let λ = ( λ 1 , . . . , λ m ) with λ j ∈ { 1 , − 1 } for all j . W e deﬁne the density of the alternativ e distribution b y setting π a,λ ( x ) = b π a ( x )    1 − α b µ a ( x ) m X j =1 λ j B j ( x )    , µ a,λ ( x ) = b µ a ( x ) + β P m j =1 λ j B j ( x ) 1 − α b µ a ( x ) P m j =1 λ j B j ( x ) for a ∈ S , where α = ε 2 C π max { 1 , 2 C π , 2 D C π /C µ } δ n , β = C µ 4 ϵ n , and B j ( x ) = 1 ( x ∈ b 2 j ) − 1 ( x ∈ b 2 j − 1 ) , j = 1 , . . . , m for 2 m cub es b j ⊆ R d suc h that b j ∩ b j ′ = ∅ , [0 , 1] d = S 2 m j =1 b j , and Z B j ( x ) dx = 0 , Z B j ( x ) 2 dx = 2vol( b 1 ) = 1 m . Note that this also implies R B j ( x ) 3 dx = R B j ( x ) dx = 0 . When a / ∈ S and a  = k , we set π a,λ ( x ) = b π a ( x ) , µ a,λ ( x ) = e µ a ( x ) . When a = k , w e deﬁne π a,λ ( x ) = 1 − k − 1 X j =1 π j,λ ( x ) , µ a,λ ( x ) = e µ a ( x ) . 55 The constructed distribution lies in the mo del b ecause, for a ∈ S ,     π a,λ b π a − 1     2 = Z α 2 b µ a ( x ) 2 m X j =1 B j ( x ) 2 dx ≤ α 2 ε 2 m X j =1 Z B j ( x ) 2 dx = α 2 ε 2 m 1 m = α 2 ε 2 ≤ ( C π δ n ) 2 and ∥ b µ a − µ a,λ ∥ 2 = Z b µ a ( x ) − b µ a ( x ) + β P m j =1 λ j B j ( x ) 1 − α b µ a ( x ) P m j =1 λ j B j ( x ) ! 2 dx = Z − α P m j =1 λ j B j ( x ) 1 − α b µ a ( x ) P m j =1 λ j B j ( x ) − β P m j =1 λ j B j ( x ) 1 − α b µ a ( x ) P m j =1 λ j B j ( x ) ! 2 dx = ( α + β ) 2 Z P m j =1 B j ( x ) 2  1 − α b µ a ( x ) P m j =1 λ j B j ( x )  2 ≤ 4( α + β ) 2 ≤ 16 β 2 ≤ ( C µ ϵ n ) 2 , where the last inequalit y also used the fact that α b µ a ( x ) m X j =1 λ j B j ( x ) ≤ α ε ≤ 1 2 due to our c hoice of α . Moreo ver, the prop ensity scores are almost uniform, as 3 C 1 4 k ≤ b π a ( X )  1 − α ε  ≤ π a,λ ( X ) ≤ b π a ( X )  1 + α ε  ≤ 5 C 2 4 k . F or a / ∈ S, a  = k , the required rate conditions are immediately satisﬁed, and the prop ensit y scores are nearly uniform by assumption. F or a = k , ∥ b µ a − µ a,λ ∥ 2 ≤ ( C µ ϵ n ) 2 and     π k,λ b π k − 1     2 =       1 − P t / ∈ S ,t  = k b π t ( x ) − P t ∈ S b π t ( x ) n 1 − α b µ t ( x ) P m j =1 λ j B j ( x ) o 1 − P k − 1 t =1 b π t ( x ) − 1       2 =      1 − P k − 1 t =1 b π t ( x ) + P t ∈ S b π t ( x ) α b µ t ( x ) P m j =1 λ j B j ( x ) 1 − P k − 1 t =1 b π t ( x ) − 1      2 =      P t ∈ S b π t ( x ) α b µ t ( x ) P m j =1 λ j B j ( x ) b π k ( x )      2 ≤ 1 ε 2 Z   m X j =1 λ j B j ( x )   2 X t ∈ S b π t ( x ) α b µ t ( x ) ! 2 dx ≤ 1 ε 2 Z m X j =1 B j ( x ) 2 α 2 ε 2 dx = α 2 ε 4 ≤ ( C π δ n ) 2 . In addition, we ha ve π k,λ ( x ) = b π k ( x ) + X t ∈ S b π t ( x ) α b µ t ( x ) m X j =1 λ j B j ( x ) ≥ ε − α ε ≥ ε 2 = ε ′ for almost every x . 56 Case ϵ n ≲ δ n : Supp ose that ϵ n ≤ D δ n for some constant D > 0 and for all n ∈ N . In this case, we deﬁne the alternative, for a ∈ S , as µ a,λ ( x ) = b µ a ( x ) 1 + β b µ a ( x ) P m j =1 λ j B j ( x ) − α β , π a,λ ( x ) =    1 + β b µ a ( x ) m X j =1 λ j B j ( x ) − α β       b π a ( x ) + α b π a ( x ) b µ a ( x ) m X j =1 λ j B j ( x )    . W e use analogous deﬁnitions for a / ∈ S as in the previous case δ n ≲ ϵ n . No w, choose α = εC π 4 max { 1 , C π } δ n , β = ε 2 C µ 4 max { 1 , C µ , D C µ /C π , D C µ } ϵ n . Then, for a ∈ S , ∥ b µ a − µ a,λ ∥ 2 ≤ 4 Z   β m X j =1 λ j B j ( x ) − α β b µ a ( x )   2 dx ≤ 8( β 2 + α 2 β 2 ) ≤ 16 β 2 ≤ ( C µ ϵ n ) 2 and     π a,λ b π a − 1     2 =          1 + β b µ a ( x ) m X j =1 λ j B j ( x ) − α β       1 + α b µ a ( x ) m X j =1 λ j B j ( x )    − 1       2 ≤     1 + β ε + αβ ε + α ε − 1     2 ≤ ( β + αβ + α ) 2 ε 2 ≤ ( C π δ n ) 2 . Note that the prop ensit y scores are nearly uniform under this alternative distribution since, for a ∈ S , C 1 4 k ≤ b π a ( X )  1 − β ε − αβ  (1 − α ) ≤ π a,λ ( X ) ≤ b π a ( X )  1 + β ε + αβ  (1 + α ) ≤ 9 C 2 4 k . When a = k , w e obtain     π k,λ b π k − 1     2 =       1 − P t / ∈ S ,t  = k b π t ( x ) − P t ∈ S n 1 + β b µ t ( x ) P m j =1 λ j B j ( x ) − α β o n b π t ( x ) + α b π t ( x ) b µ t ( x ) P m j =1 λ j B j ( x ) o 1 − P k − 1 t =1 b π t ( x ) − 1       2 =       P t ∈ S n β b µ t ( x ) P m j =1 λ j B j ( x ) − α β o n b π t ( x ) + α b π t ( x ) b µ t ( x ) P m j =1 λ j B j ( x ) o + α b π t ( x ) b µ t ( x ) P m j =1 λ j B j ( x )  b π k       2 ≤ 1 ε 2       X t ∈ S      β b µ t ( x ) m X j =1 λ j B j ( x ) − α β       b π t ( x ) + α b π t ( x ) b µ t ( x ) m X j =1 λ j B j ( x )    + α b π t ( x ) b µ t ( x ) m X j =1 λ j B j ( x )         2 57 ≤ 1 ε 2      X t ∈ S  b π t ( x )  β ε + α 2 β (1 − ε )  + 2 b π t ( x ) αβ + b π t ( x ) α (1 − ε )       2 = 1 ε 2 Z  β ε + α 2 β + 2 αβ + α  2 X t ∈ S b π t ( x ) ! 2 dx ≤  β ε 2 + β ε + 2 β ε + α ε  2 ≤ ( C π δ n ) 2 . A dditionally , π k,λ ( x ) = b π k ( x ) − X t ∈ S      β b µ t ( x ) m X j =1 λ j B j ( x ) − α β       b π t ( x ) + α b π t ( x ) b µ t ( x ) m X j =1 λ j B j ( x )    + α b π t ( x ) b µ t ( x ) m X j =1 λ j B j ( x )   ≥ ε −  β ε + α 2 β + 2 αβ + α  ≥ ε − 3 4 ε = ε 4 = ε ′ . C.3.2 F unctional Separation Case δ n ≲ ϵ n : Under Q λ , for a ∈ S , the functional is Z b µ a ( x ) + β P m j =1 λ j B j ( x ) 1 − α b µ a ( x ) P m j =1 λ j B j ( x ) d P ( x ) = Z    b µ a ( x ) + β m X j =1 λ j B j ( x )    ∞ X k =0    α b µ a ( x ) m X j =1 λ j B j ( x )    k dx, b y using the fact that α b µ a ( x ) P m j =1 λ j B j ( x ) < 1 . Considering the ﬁrst three summands of the series separately , the previous line equals = Z b µ a ( x ) + ( α + β ) m X j =1 λ j B j ( x ) + αβ + α 2 b µ a ( x ) m X j =1 B j ( x ) 2 + β α 2 b µ a ( x ) 2 m X j =1 λ j B j ( x ) 3 +    b µ a ( x ) + β m X j =1 λ j B j ( x )    ∞ X k =3    α b µ a ( x ) m X j =1 λ j B j ( x )    k dx. Therefore, the diﬀerence in the functionals under P λ and Q λ is Z αβ + α 2 b µ a ( x ) + β α 2 b µ a ( x ) 2 m X j =1 λ j B j ( x ) 3 +    b µ a ( x ) + β m X j =1 λ j B j ( x )    ∞ X k =3    α b µ a ( x ) m X j =1 λ j B j ( x )    k dx ≥ Z αβ + α 2 b µ a ( x ) − β α 2 b µ a ( x ) 2 +    b µ a ( x ) + β m X j =1 λ j B j ( x )    ∞ X k =3    α b µ a ( x ) m X j =1 λ j B j ( x )    k dx. 58 Since α b µ a ( x ) P m j =1 λ j B j ( x ) ≤ 1 / 2 , we can use the formula P ∞ k =3 t k = t 3 1 − t . Hence, the previous line equals = Z αβ + α 2 b µ a ( x ) − β α 2 b µ a ( x ) 2 +    b µ a ( x ) + β m X j =1 λ j B j ( x )    n α b µ a ( x ) P m j =1 λ j B j ( x ) o 3 1 − α b µ a ( x ) P m j =1 λ j B j ( x ) dx = Z αβ + α 2 b µ a ( x ) − β α 2 b µ a ( x ) 2 +    b µ a ( x ) + β m X j =1 λ j B j ( x )    α 3 b µ a ( x ) 3 P m j =1 λ j B j ( x ) 1 − α b µ a ( x ) P m j =1 λ j B j ( x ) dx ≥ Z αβ + α 2 b µ a ( x ) + − β α 2 b µ a ( x ) 2 b µ a ( x ) − α 3 b µ a ( x ) 3 1 − α b µ a ( x ) P m j =1 λ j B j ( x ) dx + Z β α 3 b µ a ( x ) 3 P m j =1 B j ( x ) 2 1 − α b µ a ( x ) P m j =1 λ j B j ( x ) dx ≥ Z αβ + α 2 b µ a ( x ) − β α 2 b µ a ( x ) 2 − 2 α 3 b µ a ( x ) 2 dx ≥ Z ( αβ + α 2 ) ε − ( α β + α 2 ) α − α 3 b µ a ( x ) 2 dx. No w, since α ≤ ε/ 2 and α 3 ≤ ( ε/ 2) α 2 , we can b ound the functional separation by ≥ Z ( αβ )( ε/ 2) b µ a ( x ) 2 dx ≥ ε 2 αβ . for a ∈ S . F or a / ∈ S , the function is equal to θ under both the n ull and alternative distributions. Hence, the functional separation of ψ a is zero in that case. Case ϵ n ≲ δ n : In this case, for a ∈ S , the diﬀerence of the functional equals Z b µ a ( x ) 1 + β b µ a ( x ) P m j =1 λ j B j ( x ) − α β − b µ a ( x ) ! dx = Z b µ a ( x ) ∞ X k =0 β k    α − 1 b µ a ( x ) m X j =1 λ j B j ( x )    k − b µ a ( x ) dx = Z b µ a ( x ) β    α − 1 b µ a ( x ) m X j =1 λ j B j ( x )    + b µ a ( x ) ∞ X k =2 β k    α − 1 b µ a ( x ) m X j =1 λ j B j ( x )    k dx ≥ εαβ + Z b µ a ( x ) β 3 n α − 1 b µ a ( x ) P m j =1 λ j B j ( x ) o 3 1 − β n α − 1 b µ a ( x ) P m j =1 λ j B j ( x ) o dx ≥ εαβ − 2 β 3 ε ≥ εαβ − 2 β ε αβ ≥ εαβ − ε 2 αβ = ε 2 αβ where obtaining the equalit y in the second line and the inequality in the fourth line relies on β    α − 1 b µ a ( x ) m X j =1 λ j B j ( x )    ≤ αβ + β ε ≤ 1 4 + 1 4 = 1 2 . 59 C.3.3 Distributional Distance Next, we v erify that the constructed distributions P λ and Q λ are close. More sp eciﬁcally , our goal is to b ound the Hellinger distance by 1 / 2 . This is done using the following lemma from Robins et al. [ 2009 ], whic h allows one to b ound the Hellinger distance betw een n -fold pro ducts of mixtures using single-observ ation densities. Lemma 9. L et P λ and Q λ denote distributions indexe d by a ve ctor λ = ( λ 1 , . . . , λ m ) , and let Z = S m j =1 Z j denote a p artition of the sample sp ac e. Assume 1. P λ ( Z j ) = Q λ ( Z j ) = p j for al l λ , and 2. the c onditional distributions 1 Z j dP λ /p j and 1 Z j dQ λ /p j (given an observation is in Z j ) do not dep end on λ l for l  = j . F or a prior distribution ϖ over λ , let ¯ p = R p λ dϖ ( λ ) and ¯ q = R q λ dϖ ( λ ) , and deﬁne δ 1 = max j ∈{ 1 ,...,m } sup λ Z Z j ( p λ − ¯ p ) 2 p λ p j dν, δ 2 = max j ∈{ 1 ,...,m } sup λ Z Z j ( q λ − p λ ) 2 p λ p j dν, δ 3 = max j ∈{ 1 ,...,m } sup λ Z Z j ( ¯ q − ¯ p ) 2 p λ p j dν for a dominating me asur e ν . If ¯ p/p λ ≤ b < ∞ and np j max(1 , δ 1 , δ 2 ) ≤ b for al l j , then H 2  Z P ⊗ n λ dω ( λ ) , Z Q ⊗ n λ dω ( λ )  ≤ C n  n  max j ∈{ 1 ,...,m } p j   δ 1 δ 2 + δ 2 2  + δ 3  for a c onstant C that dep ends on b . Case δ n ≲ ϵ n : W e w ant to v erify the assumptions of the previous lemma. W e ﬁrst calculate the densities p λ , q λ , and ¯ p, ¯ q . W e ha ve p λ ( z ) = 1 ( a ∈ S ) b π a ( x ) b µ a ( x ) y { 1 − b µ a ( x ) } 1 − y + 1 ( a / ∈ S, a  = k ) b π a ( x ) e µ a ( x ) y { 1 − e µ a ( x ) } 1 − y + 1 ( a = k )    1 − k − 1 X j =1 b π t ( x )    e µ k ( x ) y { 1 − e µ k ( x ) } 1 − y . In particular, p λ do es not dep end on λ , hence ¯ p = Z p λ dϖ ( λ ) = p λ . 60 F urther, we ha ve q λ ( z ) = 1 ( a  = k ) π a,λ ( x ) µ a,λ ( x ) y { 1 − µ a,λ ( x ) } 1 − y + 1 ( a = k ) ( 1 − k − 1 X t =1 π t,λ ( x ) ) e µ a ( x ) y { 1 − e µ a ( x ) } 1 − y = 1 ( a ∈ S ) b π a ( x )    1 − α b µ a ( x ) m X j =1 λ j B j ( x )    ( b µ a ( x ) + β P m j =1 λ j B j ( x ) 1 − α b µ a ( x ) P m j =1 λ j B j ( x ) ) y ·    1 − b µ a ( x ) −  β + α b µ a ( x )  P m j =1 λ j B j ( x ) 1 − α b µ a ( x ) P m j =1 λ j B j ( x )    1 − y + 1 ( a / ∈ S, a  = k ) b π a ( x ) e µ a ( x ) y { 1 − e µ a ( x ) } 1 − y + 1 ( a = k )   1 − X t ∈ S b π t ( x )    1 − α b µ t ( x ) m X j =1 λ j B j ( x )    − X t / ∈ S ,t  = k b π t ( x )   e µ a ( x ) y { 1 − e µ a ( x ) } 1 − y = 1 ( a ∈ S ) b π a ( x )    b µ a ( x ) + β m X j =1 λ j B j ( x )    y    1 − b µ a ( x ) −  β + α b µ a ( x )  m X j =1 λ j B j ( x )    1 − y + 1 ( a / ∈ S, a  = k ) b π a ( x ) e µ a ( x ) y { 1 − e µ a ( x ) } 1 − y + 1 ( a = k )   1 − X t ∈ S b π t ( x )    1 − α b µ t ( x ) m X j =1 λ j B j ( x )    − X t / ∈ S ,t  = k b π t ( x )   e µ a ( x ) y { 1 − e µ a ( x ) } 1 − y . Cho ose the uniform prior ϖ ( λ ) = 1 2 m Q j 1 ( λ j ∈ { 1 , − 1 } ) . Then ¯ q ( z ) = Z q λ ( z ) dϖ ( λ ) = 1 ( a ∈ S ) b π a ( x ) b µ a ( x ) y { 1 − b µ a ( x ) } 1 − y + 1 ( a / ∈ S, a  = k ) b π a ( x ) e µ a ( x ) y { 1 − e µ a ( x ) } 1 − y + 1 ( a = k ) ( 1 − k − 1 X t =1 b π t ( x ) ) e µ a ( x ) y { 1 − e µ a ( x ) } 1 − y . b y using the fact that q λ is linear in P j λ j B j and R f ( z ) P j λ j B j ( x ) dϖ ( λ ) = 0 due to the c hoice of prior. No w, let the partition of the lemma b e giv en b y Z j = { 0 , 1 } × { 1 , . . . , k } × b 2 j − 1 ∪ b 2 j . Then, w e hav e p j = P λ ( Z j ) = Q λ ( Z j ) = 2v ol( b 1 ) = 1 m b y assumption on the size of the cubes, i.e., R 1 ( x ∈ b j ( x )) dx = 1 2 m , and the fact that the marginal distribution of P λ and Q λ with resp ect to X is the uniform distribution. Hence, Assumption 1 of Lemma 9 is satisﬁed. Assumption 2 is also satisﬁed since P j λ j B j ( x ) dep ends only on λ s whenev er x ∈ Z s b ecause x ∈ Z s implies B j ( x ) = 0 for all j  = s . W e can now characterize the constants δ 1 , δ 2 , and δ 3 in Lemma 9 : Since p λ = ¯ p , we hav e δ 1 = 0 . Moreo v er, ¯ q = ¯ p , so also δ 3 = 0 . It remains to determine δ 2 . W e hav e δ 2 = max j ∈{ 1 ,...,m } sup λ Z Z j ( q λ − p λ ) 2 p λ p j dν = m max j ∈{ 1 ,...,m } sup λ Z Z j ( q λ − p λ ) 2 p λ dν. No w, note that p λ ( z ) ≥ 1 ( a ∈ S ) ε b π a ( x ) + 1 ( a = k ) b π k ( x ) e µ k ( x ) y { 1 − e µ k ( x ) } 1 − y . 61 Then, ( q λ ( z ) − p λ ( z )) 2 p λ ( z ) ≤ 1 ( a ∈ S ) b π a ( x ) ε    β 2 m X j =1 B j ( x ) 2    y     β + α b µ a ( x )  2 m X j =1 B j ( x ) 2    1 − y + 1 ( a = k )   X t ∈ S b π t ( x ) p b π k ( x )    α b µ t ( x ) m X j =1 λ j B j ( x )      2 e µ k ( x ) y { 1 − e µ k ( x ) } 1 − y ≤ 1 ( a ∈ S ) 1 ε β 2 y  2 β 2 + 2 α 2 b µ a ( x ) 2  1 − y b π a ( x ) m X j =1 B j ( x ) 2 + 1 ( a = k )(1 − ε ) α 2 ε 2 1 ε m X j =1 B j ( x ) 2 ≤ 1 ( a ∈ S ) 2 ε 3  β 2 + α 2  b π a ( x ) m X j =1 B j ( x ) 2 + 1 ( a = k ) 1 − ε ε 3 α 2 m X j =1 B j ( x ) 2 ≤ 2 ε 3  α 2 + β 2    m X j =1 B j ( x ) 2   ( 1 ( a ∈ S ) b π a ( x ) + 1 ( a = k )) . Since the previous line do es not depend on λ , when choosing the dominating measure ν as the Leb esgue measure µ , we obtain δ 2 ≤ m max j ∈{ 1 ,...,m } Z Z j ( q λ − p λ ) 2 p λ dν ( z ) ≤ m max j ∈{ 1 ,...,m } Z b 2 j − 1 ∪ b 2 j k X a =1 1 X y =0 2 ε 3  α 2 + β 2  m X r =1 B r ( x ) 2 ( 1 ( a ∈ S ) b π a ( x ) + 1 ( a = k )) dx ≤ m max j ∈{ 1 ,...,m } Z b 2 j − 1 ∪ b 2 j 8 ε 3  α 2 + β 2  m X r =1 B r ( x ) 2 dx = max j ∈{ 1 ,...,m } 8 ε 3 ( α 2 + β 2 ) · m · vol( b 2 j − 1 ∪ b 2 j ) = 8 ε 3 ( α 2 + β 2 ) · m · 1 m = 8 ε 3 ( α 2 + β 2 ) . Lastly , choose m = max ( n, 2 C n 2  8 ε 3  2 ( α 2 + β 2 ) 2 ) , where C is the constan t giv en by Lemma 9 if it was applied with b = max  1 , 8 ε 3 ( C 2 µ + C 2 π )  . No w, we ha ve ¯ p/p λ = 1 and np j max(1 , δ 1 , δ 2 ) ≤ n m max  1 , 8 ε 3 ( C 2 µ + C 2 π )  ≤ max  1 , 8 ε 3 ( C 2 µ + C 2 π )  . Therefore, we can apply Lemma 9 with b deﬁned as ab ov e and obtain H 2  Z P ⊗ n λ dω ( λ ) , Z Q ⊗ n λ dω ( λ )  ≤ C n 2 m 64 ε 6 ( α 2 + β 2 ) 2 ≤ 1 2 . 62 Case ϵ n ≲ δ n : First compute that q λ ( z ) = 1 ( a ∈ S )    1 + β b µ a ( x ) m X j =1 λ j B j ( x ) − α β       b π a ( x ) + α b π a ( x ) b µ a ( x ) m X j =1 λ j B j ( x )    b µ a ( x ) 1 + β b µ a ( x ) P m j =1 λ j B j ( x ) − α β ! y 1 − b µ a ( x ) 1 + β b µ a ( x ) P m j =1 λ j B j ( x ) − α β ! 1 − y + 1 ( a / ∈ S, a  = k ) b π a ( x ) e µ a ( x ) y { 1 − e µ a ( x ) } 1 − y + 1 ( a = k )   1 − X t ∈ S    1 + β b µ t ( x ) m X j =1 λ j B j ( x ) − α β       b π t ( x ) + α b π t ( x ) b µ t ( x ) m X j =1 λ j B j ( x )    − X t / ∈ S ,t  = k b π t ( x )   e µ t ( x ) y { 1 − e µ t ( x ) } 1 − y = 1 ( a ∈ S )    b π a ( x ) + α b π a ( x ) b µ a ( x ) m X j =1 λ j B j ( x )    b µ a ( x ) y    1 + β b µ a ( x ) m X j =1 λ j B j ( x ) − α β − b µ a ( x )    1 − y + 1 ( a / ∈ S, a  = k ) b π a ( x ) e µ a ( x ) y { 1 − e µ a ( x ) } 1 − y + 1 ( a = k ) 1 − k − 1 X t =1 b π t ( x ) − X t ∈ S    β b µ t ( x ) m X j =1 λ j B j ( x ) − α β       b π t ( x ) + α b π t ( x ) b µ t ( x ) m X j =1 λ j B j ( x )    − X t ∈ S α b π t ( x ) b µ t ( x ) m X j =1 λ j B j ( x ) ! e µ a ( x ) y { 1 − e µ a ( x ) } 1 − y . Therefore, ¯ q λ ( z ) = 1 ( a ∈ S ) b π a ( x ) b µ a ( x ) y { 1 − b µ a ( x ) } 1 − y + 1 ( a / ∈ S, a  = k ) b π a ( x ) e µ a ( x ) y { 1 − e µ a ( x ) } 1 − y + 1 ( a = k ) 1 − k − 1 X t =1 b π t ( x ) ! e µ a ( x ) y { 1 − e µ a ( x ) } 1 − y . Hence, we again obtain δ 1 = 0 and δ 3 = 0 . It remains to determine δ 2 . Using again that p λ ( z ) ≥ 1 ( a ∈ S ) ε b π a ( x ) + 1 ( a = k ) b π k ( x ) e µ k ( x ) y { 1 − e µ k ( x ) } 1 − y , w e obtain ( q λ ( z ) − p λ ( z )) 2 p λ ( z ) ≤ 1 ( a ∈ S ) b π a ( x ) ε "    1 + α b µ a ( x ) m X j =1 λ j B j ( x )    b µ a ( x ) y    1 + β b µ a ( x ) m X j =1 λ j B j ( x ) − α β − b µ a ( x )    1 − y − b µ a ( x ) y { 1 − b µ a ( x ) } 1 − y # 2 63 + 1 ( a = k ) e µ a ( x ) y { 1 − e µ a ( x ) } 1 − y 1 b π k ( x ) − X t ∈ S    β b µ t ( x ) m X j =1 λ j B j ( x ) − α β       b π t ( x ) + α b π t ( x ) b µ t ( x ) m X j =1 λ j B j ( x )    − X t ∈ S α b π t ( x ) b µ t ( x ) m X j =1 λ j B j ( x ) ! 2 ≤ 1 ( a ∈ S ) b π a ( x ) ε     m X j =1 λ j B j ( x )    β b µ a ( x ) + α b µ a ( x ) − α 2 β b µ a ( x ) − α b µ a ( x ) 2    2 + 1 ( a = k )(1 − ε ) 1 ε   −   m X j =1 λ j B j ( x )   X t ∈ S  β b π t ( x ) b µ t ( x ) − α 2 β b π t ( x ) b µ t ( x ) + α b π t ( x ) b µ t ( x )    2 ≤ 1 ( a ∈ S ) 1 ε  β ε + α  2 b π a ( x ) m X j =1 B j ( x ) 2 + 1 ( a = k )(1 − ε ) 1 ε  β ε + α  2 m X j =1 B j ( x ) 2 ≤ 2 ε 3 ( α 2 + β 2 )   m X j =1 B j ( x ) 2   ( 1 ( a ∈ S ) b π a ( x ) + 1 ( a = k )) . Consequen tly , δ 2 is the same as for the case δ n ≲ ϵ n , which means that w e can choose the same m to b ound the Hellinger distance. C.3.4 Final Minimax Lo wer Bound F rom Section C.3.2 , we kno w that v u u t k X a =1 1 k { ψ a ( P λ ) − ψ a ( Q λ ′ ) } 2 ≥ r s k ε 2     ε 2 C π max { 1 , 2 C π , 2 DC π /C µ }   C µ 4  δ n ϵ n for δ n ≲ ϵ n  εC π 4 max { 1 ,C π }   ε 2 C µ 4 max { 1 ,C µ ,DC µ /C π ,DC µ }  δ n ϵ n for ϵ n ≲ δ n ≥ √ C ′ r s k δ n ϵ n =: ∆ for C ′ = ε 2 4 min (  ε 2 C π max { 1 , 2 C π , 2 D C π /C µ }  2  C µ 4  2 ;  εC π 4 max { 1 , C π }  2  ε 2 C µ 4 max { 1 , C µ , D C µ /C π , D C µ }  2 ) , and from Section C.3.3 that H 2  Z P ⊗ n λ dω ( λ ) , Z Q ⊗ n λ dω ( λ )  ≤ 1 2 =: α. 64 W e can now apply Lemma 8 with d ( x, y ) = q 1 k P k a =1 ( x a − y a ) 2 for this ∆ and α and any P ∈ P , which gives inf b ψ sup P ∈P E P ( k X a =1 ϖ a ( P )  b ψ a − ψ a ( P )  2 ) ≥ min { C ′ 1 , ε ′ }  ∆ 2  2 1 − p α (1 − α/ 4) 2 ! ≥ min { C ′ 1 , ε ′ } C ′ 4 − √ 7 8 · s k ( δ n ϵ n ) 2 b y using the fact that ϖ a ( P ) ≥ min { C ′ 1 , ε ′ } /k uniformly across all a ∈ { 1 , . . . , k } and P ∈ P . This prov es Theorem 5 . 65

Causal Inference with High-Dimensional Treatments

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment