Nonparametric Estimation of Mediation Effects with A General Treatment

Nonparametric Estimation of Mediation Effects with A General T r eatment * Lukang Huang † School of Statistics & Data Science, Nankai Uni versity W ei Huang ‡ School of Mathematics and Statistics, Uni versity of Melbourne Oli ver Linton § Faculty of Economics, Uni v ersity of Cambridge and Zheng Zhang ¶ Center for Applied Statistics, Institute of Statistics & Big Data, Renmin Uni versity of China Abstract T o in vestigate causal mechanisms, causal mediation analysis decomposes the total treat- ment ef fect into the natural direct and indirect ef fects. This paper examines the estimation of the direct and indirect ef fects in a general treatment ef fect model, where the treatment can be binary , multi-v alued, continuous, or a mixture. W e propose generalized weighting estima- tors with weights estimated by solving an expanding set of equations. Under some suf ﬁcient conditions, we sho w that the proposed estimators are consistent and asymptotically normal. Speciﬁcally , when the treatment is discrete, the proposed estimators attain the semiparamet- ric efﬁcienc y bounds. Meanwhile, when the treatment is continuous, the conv ergence rates of the proposed estimators are slo wer than N − 1 / 2 ; ho wev er , the y are still more ef ﬁcient than that constructed from the true weighting function. A simulation study reveals that our es- timators exhibit a satisfactory ﬁnite-sample performance, while an application shows their practical v alue. K e ywor ds: Cov ariate balancing; Direct and indirect ef fects; General treatment; Semiparametric ef ﬁciency . * The authors are alphabetically ordered. † E-mail: lkhuang@nankai.edu.cn ‡ E-mail: wei.huang@unimelb.edu.au § E-mail: obl20@cam.ac.uk ¶ Corr esponding author . E-mail: zhengzhang@ruc.edu.cn 1 1 Intr oduction One essential goal of program e v aluation and scientiﬁc research is to understand why and ho w a treatment v ariable af fects the potential outcomes of interest, going beyond the estimation of the a verage treatment ef fects (A TE). F or e xample, when assessing the effect of participating in an academic and vocational institute on criminal behavior , in vestigators might want to separate the program’ s direct ef fects on criminal behavior , such as integrity or discipline, from its effects relayed through the employment chances, which may indirectly affect the criminal beha vior . In this regard, causal mediation analysis plays an important role by decomposing the total treatment ef fect into the natural direct effect and the indirect effects mediated through an intermediate v ariable, called the mediator ( Robins & Greenland 1992 , Pearl 2001 ). Such an approach has been widely used in a number of disciplines in the medical and social sciences (see e.g., Baron & K enny 1986 , Imai et al. 2011 , V anderW eele 2015 ). A fundamental problem of treatment ef fect analysis is that, although a randomized trial is the golden standard to identify the treatment ef fects, it is often unav ailable or ev en unethical. More usually , in practice, treatments are randomly assigned to the indi viduals based on some features that also relate to the mediator and outcome of interest, thus causing the confounding issue. Such features are called confounders. The literature on causal mediation analysis, which can identify the direct and indirect effects from data with confounding, has rapidly grown o ver the last decade, producing abundant approaches and e xtensions. An important nonparametric identiﬁability condition called the sequential ignorability as- sumption, which establishes a minimum set of assumptions required to identify the direct and indirect treatment effects regardless of the statistical models used, is commonly imposed ( Imai, K eele & T ingley 2010 , Imai, K eele & Y amamoto 2010 , Hsu et al. 2018 ). Most of the approaches focus on the binary treatment where an individual either recei ves or does not recei ve the treatment (see, e.g., Imai, K eele & Y amamoto 2010 , Jof fe et al. 2007 , V anderW eele 2009 , V anderW eele & V ansteelandt 2010 , Huber 2014 , Chan, Imai, Y am & Zhang 2016 , Hsu et al. 2018 , Liu et al. 2021 ). F or example, based on the sequential ignorability assumption, Hsu et al. ( 2018 ) proposed nonparametric estimation of the binary natural direct and indirect ef fects by weighting observ a- tions using the in v erse of estimated propensity scores. While binary treatment effects hav e been extensi vely studied, the estimation of continuous 2 treatment ef fects has also dra wn considerable attention (see Hirano & Imbens ( 2004 ), Imai & v an Dyk ( 2004 ), Galvao & W ang ( 2015 ), K ennedy et al. ( 2017 ), Fong et al. ( 2018 ), Dong et al. ( 2019 ), Colangelo & Lee ( 2020 ), Ai et al. ( 2021 ), Huang et al. ( 2022 ) among others). Howe v er , all of the abo ve studies focus on the total treatment effect rather than analyzing the causal me- diation ef fects. In recent work, Huber et al. ( 2020 ) studied the identiﬁcation and estimation of the natural direct and indirect ef fects when the treatment v ariable is continuous. Their estimators are constructed from weighting observations using the estimated mar ginal density of the treat- ment and the in v erse of two estimated conditional densities, namely the conditional density of the treatment giv en the confounder and that giv en both the confounder and the mediator . Singh et al. ( 2021 ) identiﬁed the natural direct and indirect ef fects by the g-formula and estimated them using a Reproducing K ernel Hilbert Space approach. They showed their estimators are consistent and provided the rate of con vergence, which achie v es the minimax rate. In man y applications, the treatment variable can be comple x, being a mixture of discrete and continuous elements, or ev en multidimensional, comprising both discrete and continuous components. For example, when e v aluating the impact of an academic and vocational training program on criminal behavior , the causal ef fect might depend not only on participation in the program but also on the duration of this participation. Additionally , it could depend on multi- factors such as the duration and number of training courses undertaken. The contrib ution of the work is three-fold. First, we propose a general frame work for es- timating the direct and indirect effects, which uniﬁes the binary , multi-valued, and continuous treatments as well as the mixture of discrete and continuous treatments. It can also be easily adapted to multidimensional treatment. Speciﬁcally , we integrate the sequential ignorability as- sumption for discrete treatments from Hsu et al. ( 2018 ) with that for continuous treatments in Huber et al. ( 2020 ), and use a siev e method to estimate the treatment effect parameters. This synthesis results in a more general form of identifying and estimating direct and indirect effects, enabling nonparametric estimation for discrete, continuous, mix ed and multidimensional treat- ments. Second, we demonstrate that the moment balancing method for estimating the stabilized weights, originally proposed for unconditional continuous treatment ef fects by Ai et al. ( 2021 ), can impro ve the mediation analysis in the e xisting frame work. Speciﬁcally , Huber et al. ’ s ( 2020 ) 3 mediation identiﬁcation requires estimation of a weight that consists of the marginal density of the treatment and the inv erse of two conditional densities, namely the conditional density of the treatment gi ven the confounder and that gi ven both the confounder and the mediator . Huber et al. ( 2020 ) estimate this weight using a ratio of kernel density estimators. Ho we ver , it is known that the in verse probability weighting is sensiti ve to estimated densities in the denominator , poten- tially leading to extreme weights and unstable results (see e.g., Fong et al. 2018 , Ai et al. 2021 ). By reformulating the weight estimation, we apply the moment balancing idea to estimate the weight integrally , signiﬁcantly reducing the likelihood of extreme weights and enhancing result accuracy . Third, we verify that the proposed estimators of the direct and indirect effects are consistent and asymptotically con ver ge to normal distributions after appropriate normalization. Speciﬁ- cally , the proposed estimators attain the semiparametric ef ﬁciency bounds when the treatment is discrete. Meanwhile, when the treatment is continuous, the con ver gence rates of the proposed nonparametric estimators are slower than N − 1 / 2 . Howe ver , we show that our proposed estimators achie ve an asymptotic variance smaller than or equal to those constructed from the true weight- ing function, pro vided that the tuning parameters are selected to ensure equi v alent con ver gence rates. The equality is only attained when the conditional mean of the observed outcome gi ven the observed treatment, mediators and confounders is al ways zero. The remainder of the paper is structured as follows. Section 2 establishes the basic framework and Section 3 presents the estimation procedure. In Section 4 , we deriv e the asymptotic prop- erties of the proposed estimators and their efﬁcienc y results. Section 5 constructs the consistent estimators for the asymptotic v ariances based on plug-in approaches. In Section 6 , we propose data-dri ven approaches for selecting the tuning parameters. Subsequently , Section 7 reports the results of a simulation study and Section 8 applies the proposed estimation method to analyze the ef fect of Job Corps, an educational and v ocational training program, on criminal activity . Finally , Section 9 presents the study’ s conclusions. 4 2 Basic Framework Let T denote the observed treatment variable with support T ⊂ R 1 , where T is either a discrete set, a continuum, or a mixture of discrete and continuum subsets. Let f T ( t ) denote a probability distribution function of T at point t ; that is, it stands for the probability density if T is continuous at t and the probability mass if T is discrete at t . Under the standard framew ork of causal infer- ence, let M ( t ) ∈ M ⊂ R s , for some positi ve integer s , denote a potential mediating v ariable that represents the v alue of the mediator if the treatment v ariable is equal to t ∈ T . Similarly , let Y ( t, m ) ∈ R denote the potential outcome if one receiv es treatment t and mediator m . The observed mediator and response are denoted by M := M ( T ) and Y := Y { T , M ( T ) } , respec- ti vely . Deﬁne µ ( t, t ′ ) := E [ Y { t, M ( t ′ ) } ] for any ( t, t ′ ) ∈ T × T , then µ ( t, t ) − µ ( t ′ , t ′ ) is the av erage treatment effect (A TE) caused by the change in the treatment level from t ′ to t . In par - ticular , if T = { 0 , 1 } , then µ (1 , 1) − µ (0 , 0) is the binary A TE studied by Hahn ( 1998 ), Hirano et al. ( 2003 ), Chan, Y am & Zhang ( 2016 ), Hsu & Lai ( 2018 ), and Ai, Huang & Zhang ( 2022 ) among others; if T = { 0 , 1 , ..., J } for some J ∈ N , then µ ( t, t ) − µ ( t ′ , t ′ ) is the multi-valued A TE studied by Cattaneo ( 2010 ) and Lee ( 2018 ) among others; and if the treatment v ariable is continuous, then µ ( t, t ) − µ ( t ′ , t ′ ) is the continuous A TE studied by Imai & v an Dyk ( 2004 ), Hirano & Imbens ( 2004 ), Lee & Lemieux ( 2010 ), Colangelo & Lee ( 2020 ), and Ai et al. ( 2021 ) among others. A primary goal of causal mediation analysis is to decompose the A TE into the av erage natural indir ect ef fect and the average natur al dir ect ef fect , for any t  = t ′ ∈ T , µ ( t, t ) − µ ( t ′ , t ′ ) = { µ ( t, t ) − µ ( t ′ , t ) } + { µ ( t ′ , t ) − µ ( t ′ , t ′ ) } , (2.1) where µ ( t, t ) − µ ( t ′ , t ) is the av erage natural direct effect representing the av erage dif ference if the treatment variable changes from t ′ to t while the mediator is held constant at M ( t ) ; µ ( t ′ , t ) − µ ( t ′ , t ′ ) is the av erage natural indirect ef fect representing the av erage difference if the mediator v alue changes from M ( t ′ ) to M ( t ) while holding the treatment variable constant at t ′ . Therefore, this decomposition enables researchers to quantitati vely explore the extent to which the treatment 1 Our causal mediation frame work and the proposed estimation methods can also be adapted to the multiple dimensions of treatment. Howe ver , this article focuses on the univ ariate treatment variable for simplicity of presen- tation. 5 and mediator contribute to the treatment ef fect. Note from ( 2.1 ) that all the A TE, av erage natural indirect ef fect, and the a verage natural direct ef fect depend only on µ : T × T 7→ R . Our goal is then reduced to estimating µ on T × T . Due to the confounding issue, the potential outcome Y { t, M ( t ′ ) } and the potential mediator M ( t ) are not observ ed for all ( t, t ′ ) ∈ T × T . T o address this identiﬁcation problem, most studies impose a selection on the observable condition. Speciﬁcally , let X ∈ X ⊂ R r be a vector of ob- served co variates, for some positiv e integer r . W e maintain the following sequential ignorability assumption imposed on the treatment and mediator assignment ( Imai, K eele & Y amamoto 2010 , Huber 2014 , Hsu et al. 2018 , Huber et al. 2020 ). Assumption 1 ( Sequential Ignorability ) . (i) { Y ( t ′ , m ) , M ( t ) } ⊥ T | X = x for all ( t, t ′ , m , x ) ∈ T × T × M × X ; (ii) Y ( t ′ , m ) ⊥ M ( t ) | ( T = t, X = x ) for all ( t, t ′ , m , x ) ∈ T × T × M × X , wher e the conditional pr obability functions (in the sense as f T ) satisfy , f T | X ( t | x ) > 0 and f T | M , X ( t | m , x ) > 0 for all ( t, m , x ) ∈ T × M × X . When M is multidimensional, Assumption 1 needs to hold for each element in M . Under Assumptions 1 , we sho w in the Appendix A that µ ( t, t ′ ) can be identiﬁed as follo ws: µ ( t, t ′ ) = E  π M , X ( T , M , X ) π M , X ( T + δ, M , X ) · π X ( T + δ, X ) Y     T = t  , (2.2) where, for ( t, t ′ ) ∈ T × T , δ := t ′ − t, π Z ( t, Z ) := f T ( t ) f T | Z ( t | Z ) for Z ∈ { X , ( M , X ) } . In the particular case of δ = 0 , µ ( t, t ) = E [ π X ( T , X ) Y | T = t ] is the dose-response function studied in Ai et al. ( 2021 ). Let { T i , M i , X i , Y i } N i =1 denote an independent and identically distributed ( i.i.d. ) sam- ple of observations drawn from the joint distrib ution of ( T , M , X , Y ) . If π X ( t, X ) and π M , X ( t, M , X ) were known, µ ( t, t ′ ) can be estimated by the nonparametric series re gression 6 ( Ne wey 1994 , 1997 ): e µ ( t, t ′ ) := " N X i =1 π M ,X ( T i , M i , X i ) π M ,X ( T i + δ, M i , X i ) · π X ( T i + δ, X i ) Y i u K 0 ( T i ) ⊤ # × " N X i =1 u K 0 ( T i ) u K 0 ( T i ) ⊤ # − 1 u K 0 ( t ) , (2.3) where u K 0 ( T ) = ( u K 0 , 1 ( T ) , . . . , u K 0 ,K 0 ( T )) ⊤ is a prespeciﬁed basis function with dimension K 0 ∈ N . Ho wev er , both π X ( T , X ) and π M , X ( T , M , X ) are unkno wn in practice and need to be replaced by some estimates. Remark 1. When T is a continuous tr eatment variable, Huber et al. ( 2020 ) identify µ ( t, t ′ ) as µ ( t, t ′ ) = E [ Y ( t, M ( t ′ ))] = lim h → 0 E  K h ( T − t ) f T | M , X ( t | M , X ) · f T | M ,X ( t ′ | M , X ) f T | X ( t ′ | X ) · Y  , (2.4) wher e K h ( x ) = K ( x/h ) /h with K ( · ) a univariate k ernel function. Our pr oposed identiﬁcation differ s in two signiﬁcant ways. F irst, ( 2.2 ) does not depend on the kernel weighting, K h ( T − t ) , allowing inte gr ation with any nonpar ametric r e gr ession methods for discr ete, continuous or mix- tur e T . Indeed, our pr oposed sieve estimator in ( 2.3 ) accommodates all the scenarios. Second, we addr ess the selection bias using thr ee stabilized weights, π M , X ( T , M , X ) , π M , X ( T + δ, M , X ) and π X ( T + δ , M , X ) , rather than the conditional densities used in ( 2.4 ) . This appr oac h, lever - aging the moment balancing idea fr om Ai et al. ( 2021 ) (see Section 3 for details), ensur es stability , especially when the densities, f T | M , X ( t | M , X ) and f T | X ( t ′ | X ) , ar e close to zer o. Howe ver , Ai et al. ( 2021 ) does not consider the mediation and only r equir es one stabilized weight, π X ( T , X ) , to identify their causal parameter . 3 Estimation In this section, we introduce a two-step nonparametric estimation of µ ( t, t ′ ) across all pairs ( t, t ′ ) ∈ T × T based on the identiﬁcation ( 2.2 ). In the ﬁrst step, shown in Section 3.1 , we pro- pose a uniﬁed framework for estimating the weighting functions, π X and π M , X . Subsequently , in the second step sho wn in Section 3.2 , we regress the estimated  π M , X ( T , M , X ) · π X ( T + 7 δ, X ) · Y  π M , X ( T + δ, M , X ) against T to obtain our ﬁnal estimator of µ ( t, t ′ ) . F or this esti- mation, we employ a siev e method suitable for discrete, continuous, or a mixture of discrete and continuous treatment types. 3.1 Uniﬁed Framework f or Estimating the W eighting Functions This section proposes the estimators for π X and π M , X . A nai ve estimator of π X can be the ratio of some estimated f T and f T | X . Ho we ver , such a ratio estimator is very sensiti ve to small v alues of estimated f T | X and lead to undesirable results. T o mitigate this problem, Ai et al. ( 2021 ) propose estimating π X as a whole. W e note that, using the deﬁnition of π Z in ( 2.2 ), their approach can be applied to estimate π X and π M , X in the same frame work. Note that for Z ∈ { X , ( M , X ) } , E [ π Z ( T , Z ) u ( T ) v ( Z )] = E [ u ( T )] · E [ v ( Z )] , (3.1) holds for any suitable functions u ( T ) and v ( Z ) . Using Ai et al. ( 2021 , Theorem 2), one can sho w that ( 3.1 ) identiﬁes π Z . Equation ( 3.1 ) suggests a possible way of estimating π Z ; ho wev er , it implies an inﬁnite number of equations, which is impossible to solve using a ﬁnite sample of ob- serv ations. T o ov ercome this dif ﬁculty , the y approximate the inﬁnite-dimensional function space using a sequence of ﬁnite-dimensional sie ve spaces. Speciﬁcally , let u k 1 ( T ) = ( u k 1 , 1 ( T ) , . . . , u k 1 ,k 1 ( T )) ⊤ and v k Z ( Z ) = ( v k Z , 1 ( Z ) , . . . , v k Z ,k Z ( Z )) ⊤ be speciﬁed basis with dimensions k 1 ∈ N and k Z ∈ N , respecti vely , and let K Z := k 1 · k Z . The functions u k 1 ( T ) and v k Z ( Z ) are approximation sieves that can approximate any suitable functions u ( T ) and v ( Z ) arbitrarily well (see Chen ( 2007 ) for discussions on the sie ve approximation). Then ( 3.1 ) implies that E  π Z ( T , Z ) u k 1 ( T ) v k Z ( Z ) ⊤  = E [ u k 1 ( T )] · E [ v k Z ( Z )] ⊤ . (3.2) Follo wing Ai et al. ( 2021 ), π Z ( T i , Z i ) , for i = 1 , . . . , N , can be estimated by the b π i ’ s, which are 8 the follo wing maximizer of an entropy , subject to the sample analog of ( 3.2 ):        { b π i } N i =1 = arg max n − N − 1 P N i =1 π i log π i o subject to 1 N P N i =1 π i u k 1 ( T i ) v k z ( Z i ) ⊤ = n 1 N P N i =1 u k 1 ( T i ) o n 1 N P N j =1 v k z ( Z j ) ⊤ o . (3.3) Note that by including a constant of one in the sie ve bases u k 1 ( T ) and v k Z ( Z ) , ( 3.3 ) implies that N − 1 P N i =1 b π i = 1 . Moreo ver , max − N − 1 N X i =1 π i log π i ! = − min ( N X i =1 ( N − 1 π i ) · log  N − 1 π i N − 1  ) , that is, the entropy maximization problem is equi v alent to the minimization of the Kullback- Leibler di ver gence between { N − 1 π i } N i =1 and the uniform empirical distribution { N − 1 } N i =1 . Such an entrop y serv es a reasonable metric in the sense that the { N − 1 π i } N i =1 can be treated as a dis- crete probability distribution. This comes from the observ ation that π Z ( T i , Z i ) is positiv e and E { π Z ( T i , Z i ) } = 1 . Moreov er , the formulation in ( 3.3 ) also guarantee the empirical counter - parts, the π i ’ s are all positiv e and P N i =1 ( N − 1 π i ) = 1 . T o estimate π Z ( T , Z ) , we use the dual solution to ( 3.3 ) sho wed by Ai et al. ( 2021 ): b π K Z ( T , Z ) := ρ ′ n u k 1 ( T ) ⊤ b Λ k 1 × k Z v k Z ( Z ) o , (3.4) where ρ ′ ( v ) = exp( − v − 1) is the ﬁrst deriv ati ve of ρ ( v ) = − exp( − v − 1) , and b Λ k 1 × k Z is the maximizer of the strictly concav e function b G k 1 × k Z deﬁned by b G k 1 × k Z (Λ) := 1 N N X i =1 ρ  u k 1 ( T i ) ⊤ Λ v k Z ( Z i )  − ( 1 N N X i =1 u k 1 ( T i ) ) ⊤ Λ ( 1 N N X j =1 v k Z ( Z j ) ) . (3.5) The ﬁrst order condition of ( 3.5 ) implies that { b π K Z ( T i , Z i ) } N i =1 satisﬁes the sample analog of ( 3.2 ). Such restrictions improv e the robustness of the estimation, with extreme weights being unlikely to be obtained. The conca vity of ( 3.5 ) enables us to easily obtain the solution via the Gauss-Ne wton algorithm. T o ensure consistent estimation of π Z ( T , Z ) for continuous T (resp. 9 Z ), the dimension of the bases, k 1 (reps. k Z ), shall increases as the sample size increases. 3.2 Final Estimator of µ ( t, t ′ ) W ith ( 2.2 ) and the estimated weighting functions b π K X and b π K M,X , we deﬁne the estimator of µ ( t, t ′ ) the sie ve re gression estimator: b µ ( t, t ′ ) := " N X i =1 b π K M,X ( T i , M i , X i ) b π K M,X ( T i + δ, M i , X i ) · b π K X ( T i + δ, X i ) Y i u K 0 ( T i ) ⊤ # (3.6) · " N X i =1 u K 0 ( T i ) u K 0 ( T i ) ⊤ # − 1 u K 0 ( t ) . Remark 2. When T is a binary treatment taking values in T = { 0 , 1 } , u k 1 ( T ) = u K 0 ( T ) = ( 1 ( T = 0) , 1 ( T = 1)) ⊤ with k 1 = K 0 ≡ 2 . In this case, { µ ( t, t ′ ) : ( t, t ′ ) ∈ T × T } is a ﬁnite discr ete set and estimable at a rate of N − 1 / 2 . The semipar ametric efﬁciency bounds and efﬁcient estimation for { µ (1 , 1) , µ (0 , 0) } ar e pr esented in Hahn ( 1998 ), Hir ano et al. ( 2003 ), and Chan, Y am & Zhang ( 2016 ). The semiparametric efﬁciency bounds and efﬁcient estimation for the mediation causal ef fects { µ (1 , 0) , µ (0 , 1) } ar e pr esented in Tchetgen Tc hetgen & Shpitser ( 2012 ) and Hsu et al. ( 2018 ). Speciﬁcally , Hsu et al. ( 2018 ) construct the estimator for µ (1 , 0) based on the following r epr esentation: µ (1 , 0) = E [ Y { 1 , M (0) } ] = E  1 ( T = 1) P ( T = 1 | M , X ) P ( T = 0 | M , X ) P ( T = 0 | X ) · Y  , with the generalized pr opensity scor e functions P ( T = 1 | M , X ) and P ( T = 1 | X ) estimated by the nonparametric series lo git r e gr ession. In the discr ete setting, that is, T = { 0 , 1 , ..., J } for some J ∈ N , our pr oposed estimator s can be obtained by taking u k 1 ( T ) = u K 0 ( T ) = ( 1 ( T = 0) , 1 ( T = 1) , · · · , 1 ( T = J )) ⊤ with k 1 = K 0 ≡ J + 1 . W e show in the ne xt section that the pr oposed estimator s have √ N -asymptotic normality , pr ovided some conditions on k Z , and attain the semiparametric ef ﬁciency bounds. In a special case of a mixture of discr ete and continuous treatment variable, wher e T = 0 with positive pr obability and T > 0 continuous with positive pr obability , our estimators can be ob- tained by taking u k 1 ( T ) = u K 0 ( T ) = ( 1 ( T = 0) , u K 0 , 2 ( T ) { 1 − 1 ( T = 0) } , · · · , u K 0 ,K 0 ( T ) { 1 − 10 1 ( T = 0) } ) ⊤ . When T is continuous or a mixtur e, we show in Theor em 2 that our pr oposed estimator con ver g es to normal distrib utions slower than N − 1 / 2 , b ut has an asymptotic variance smaller than or equal to that of e µ in ( 2.3 ) that uses the true π X and π M ,X , pr ovided the con ver g ence rates ar e the same . The equivalence is only taken when E ( Y | T , M , X ) = 0 . Remark 3. F or continuous T , Huber et al. ( 2020 ) pr opose a nonparametric weighted kernel (Nadaraya-W atson) type estimator for µ ( t, t ′ ) based on ( 2.4 ) . The conditional density functions f T | M , X and f T | X ar e estimated thr ough the kernel method. Their estimator s ar e shown to be asymptotically normal. In such a case, using our pr oposed estimators of the weights π Z , a k ernel (Nadaraya-W atson) type estimator for µ ( t, t ′ ) , an alternative to ( 3.6 ) , can also be constructed: b µ h ( t, t ′ ) = P N i =1 b R K M,X ( T i , T i + δ, M i , X i ) · b π K X ( T i + δ, X i ) K h ( T i − t ) Y i P N i =1 b R K M,X ( T i , T i + δ, M i , X i ) · b π K X ( T i + δ, X i ) K h ( T i − t ) , (3.7) wher e b R K M,X ( t, t ′ , m , x ) := b π K M,X ( t, m , x ) / b π K M,X ( t ′ , m , x ) for ( t, t ′ , m , x ) ∈ T × T × M × X . W e establish its asymptotic pr operties in Appendix C , where we show that it attains a con ver g ence r ate as a standar d Nadaraya-W atson estimator and it ac hieves a faster con ver gence rate and smaller asymptotic variance than Huber et al. ( 2020 ), unless Huber et al. ( 2020 ) use a same bandwidth for all the estimations of f T | M , X , f T | X and µ ( t, t ′ ) . However , in practice, since these thr ee quantities ar e conditional on differ ent set of variables, namely { M , X } , X and T , r espectively , using a same bandwidth does not guarantee good ﬁnite sample performance. Indeed, Huber et al. ( 2020 ) suggest using dif fer ent bandwidths for these estimations, and thus their µ ( t, t ′ ) estimator cannot achie ve the optimal con ver gence rate (see footnote 7 in Huber et al. 2020 ). Our pr oposed estimator can achie ve a faster rate and smaller variance, because our pr oposed b π Z efﬁciently incorporates the additional information available in the form of π Z as per ( 3.1 ) (see more discussions in Hir ano et al. 2003 ). Remark 4. In addition to the natural direct effect E [ Y ( t, M ( t ))] − E [ Y ( t ′ , M ( t ))] for t  = t ′ studied in this paper , the contr olled dir ect ef fect (CDE) deﬁned as E [ Y ( t, m )] − E [ Y ( t ′ , m )] for t  = t ′ and a ﬁxed value m of the mediator is also of inter est in the liter atur e, see Goetgeluk et al. ( 2008 ), V anderW eele ( 2009 ), Hong et al. ( 2015 ) for e xamples. The pr oposed method in this 11 paper can also be adapted to estimate E [ Y ( t, m )] as well as CDE; indeed, by Hong et al. ( 2015 , Theor em 2), E [ Y ( t, m )] can be identiﬁed as the following density ratio weighting form: E [ Y ( t, m )] = E  f T , M ( T , M ) f T , M | X ( T , M | X ) Y     T = t, M = m  , which is similar to the identiﬁcation of E [ Y ( t, M ( t ′ ))] in Eq. ( 2.2 ) . Ther efor e, we can ap- ply the same pr ocedur e of estimating π Z ( T , Z ) and E [ Y ( t, M ( t ′ ))] to obtain the estimator s of f T , M ( T , M ) /f T , M | X ( T , M | X ) and E [ Y ( t, m )] , r espectively . 4 Large Sample Pr operties This section studies the asymptotic properties of the proposed estimator b µ ( t, t ′ ) . The con ver gence rates for b π K Z ( · , Z ) , Z ∈ { X , ( M , X ) } , are implied directly by the results in Ai, Linton & Zhang ( 2022 ). W e also recall these results in Appendix B . T o facilitate the presentation our main results, we introduce the follo wing notations: Φ K 0 × K 0 := E [ u K 0 ( T ) u ⊤ K 0 ( T )] , d K 0 ,i ( T , M , X , Y ; δ ) := IF π X ,i  δ, π M , X ( T , M , X ) π M , X ( T + δ, M , X ) · u K 0 ( T ) Y  (4.1) + IF π M , X ,i  0 , π X ( T + δ, X ) π M , X ( T + δ, M , X ) · u K 0 ( T ) Y  − IF π M , X ,i  δ, π M , X ( T , M , X ) π X ( T + δ, X ) π 2 M ,X ( T + δ, M , X ) · u K 0 ( T ) Y  − E  π M , X ( T i , M i , X i ) π X ( T i + δ, X i ) π M ,X ( T i + δ, M i , X i ) · u K 0 ( T i ) Y i    T i  + E  π M , X ( T i , M i , X i ) π X ( T i + δ, X i ) π M ,X ( T i + δ, M i , X i ) · u K 0 ( T i ) Y i  , for any δ ≥ 0 and i = 1 , . . . , N , where the IF π Z ,i ’ s are the i.i.d. mean zero inﬂuence functions for b π Z , Z ∈ { X , ( M , X ) } , deﬁned in ( B.2 ) in Appendix B , and V tt ′ := E h  u ⊤ K 0 ( t )Φ − 1 K 0 × K 0 d K 0 ,i ( T , M , X , Y ; δ )  2 i (4.2) = u ⊤ K 0 ( t ) · Φ − 1 K 0 × K 0 · E  d K 0 ,i ( T , M , X , Y ; δ ) d ⊤ K 0 ,i ( T , M , X , Y ; δ )  · Φ − 1 K 0 × K 0 · u K 0 ( t ) . The follo wing conditions are maintained throughout this article. 12 Assumption 2. F or e very ﬁxed t ′ ∈ T , ther e e xist a γ ∗ ∈ R K 0 and a positive constant β > 0 such that sup t ∈T | µ ( t, t ′ ) − ( γ ∗ ) ⊤ u K 0 ( t ) | = O  K − β 0  . Assumption 3. The eigen values of E  u K 0 ( T ) u ⊤ K 0 ( T )  and E [ d K 0 ,i ( T , M , X , Y ; δ ) · d ⊤ K 0 ,i ( T , M , X , Y ; δ )  ar e bounded away fr om zer o and inﬁnity uniformly with respect to K 0 ∈ N . Assumption 2 requires the sie ve approximation error of the function µ ( · , t ′ ) to shrink at a polynomial rate. This condition is satisﬁed for a v ariety of sie ve basis functions. For example, if T is discrete, then the approximation error is zero for sufﬁciently large K 0 ; thus, in this case, Assumption 2 is satisﬁed with β = + ∞ . If T is continuous or a mixture, the polynomial rate β depends positiv ely on the smoothness of µ ( t, t ′ ) in t ; indeed, for po wer series and B -splines, β is the smoothness of µ ( t ; t ′ ) in t ( Chen 2007 , Section 2.3.1). W e show that the con v ergence rate of the estimated µ ( t, t ′ ) is bounded by this polynomial rate. Assumption 3 essentially ensures the variance of the estimator is non-de generate. Under these conditions and Assumptions 4 – 6 presented in Appendix B , we establish the follo wing two theorems, which hold whenev er the treatment is continuous, discrete or mixed: Theorem 1. Under Assumptions 1 – 3 and 4 – 6 pr esented in Appendix B ,we have 1. (Con verg ence Rates) Z T | b µ ( t, t ′ ) − µ ( t, t ′ ) | 2 dF T ( t ) = O p  K − 2 β 0 + K 0 N  + ζ ( K X ) 2  K X N + K − 2 α X X  + ζ ( K M ,X ) 2  K M ,X N + K − 2 α M,X M ,X  ! , sup t ∈T | b µ ( t, t ′ ) − µ ( t, t ′ ) | = O p ζ 1 ( K 0 ) (  K − β 0 + K 0 N  + ζ ( K X ) r K X N + K − α X X ! + ζ ( K M ,X ) r K M ,X N + K − α M,X M ,X ! )! , hold for every ﬁxed t ′ ∈ T , wher e ζ ( K Z ) , for Z = { X , ( M , X ) } , and ζ 1 ( K 0 ) ar e deﬁned in Assumption 6 of Appendix B . 13 2. (Asymptotic Normality) Suppose √ N K − β 0 → 0 . Then, for any ﬁxed t, t ′ ∈ T , √ N { b µ ( t, t ′ ) − µ ( t, t ′ ) } = 1 √ N N X i =1 ϕ tt ′ ( Y i , T i , M i , X i ; δ ) + o p (1) , wher e ϕ tt ′ ( Y i , T i , M i , X i ; δ ) = u ⊤ K 0 ( t )Φ − 1 K 0 × K 0 d K 0 ,i ( T , M , X , Y ; δ ) . Thus, we have √ N V − 1 / 2 tt ′ { b µ ( t, t ′ ) − µ ( t, t ′ ) } d − → N (0 , 1) , and V tt ′ = const × ∥ u K 0 ( t ) ∥ 2 = O ( K 0 ) for some positive constant const. The proof of Theorem 1 is presented in section S3 in the supplemental material. In addition to the con ver gence rate and asymptotic normality , we also provide the asymptotic linear expansion of √ N { b µ ( t, t ′ ) − µ ( t, t ′ ) } . This can help conduct statistical inference, as we can approximate the limiting distribution of our estimator by adopting the exchangeable bootstrap method ( Cher- nozhuko v et al. 2013 , Donald & Hsu 2014 , Huang et al. 2022 ). Using Theorem 1 , the asymptotic results of the estimated direct and indirect ef fects can be easily obtained. In particular , note from Remark 1 and Assumption 2 that if the treatment is discrete, K 0 is a constant independent of N and β = + ∞ ; thus b µ ( t, t ′ ) − µ ( t, t ′ ) attains a √ N -asymptotic normality . If the treatment is con- tinuous or a mixture, and u K 0 ( t ) is a siev e series, then the con vergence rate of b µ ( t, t ′ ) − µ ( t, t ′ ) is slo wer than N − 1 / 2 ; see Remark 5 for more discussion on the curse of dimensionality that arises from M and X . The next theorem sho ws that our proposed estimator of mean potential outcome b µ ( t, t ′ ) is asymptotically more ef ﬁcient than the oracle estimator e µ ( t, t ′ ) in ( 2.3 ) that uses the true π X and π M ,X . Theorem 2. Suppose √ N K − β 0 → 0 . Under Assumptions 1 – 3 , for any ﬁxed t, t ′ ∈ T , √ N e V − 1 / 2 tt ′ { e µ ( t, t ′ ) − µ ( t, t ′ ) } d − → N (0 , 1) , wher e e V tt ′ is the asymptotic variance satisfying e V tt ′ ≥ V tt ′ . It is kno wn that in the estimation of av erage treatment ef fects with a binary treatment, the in v erse probability weighting (IPW) estimator constructed from nonparametrically estimated 14 propensity score is more ef ﬁcient than that constructed by using the true one, see Hahn ( 1998 ), Hirano et al. ( 2003 ) and Chen et al. ( 2008 ). Theorem 2 establishes the similar result for the mediation ef fects with a general treatment. The proof of Theorem 2 is presented in sec- tion S4 in the supplemental material, where we also deri ve the asymptotic linear expansion of √ N e V − 1 / 2 tt ′ { e µ ( t, t ′ ) − µ ( t, t ′ ) } and the detailed asymptotic v ariance e V tt ′ . Theorem 2 implies the ef ﬁciency of our estimator o ver the oracle one for discrete, continuous, and mixed treatments. Speciﬁcally , when the treatment v ariable is discrete, we further prove the corollary below that our proposed estimators attain the semiparametric efﬁcienc y bounds. The proof is in section S5 in the supplemental material. Corollary 1. Suppose that the tr eatment variable T is discr ete with values in T = { 0 , 1 , ..., J } , wher e J ≥ 1 is an positive inte ger . Under Assumptions 1 – 3 and 4 – 6 pr esented in Appendix B , we have √ N ( b µ ( t, t ′ ) − µ ( t, t ′ )) d − → N (0 , V tt ′ ) , wher e V tt ′ = E h S 2 µ ( t,t ′ ) i and S µ ( t,t ′ ) = 1 { T = t } f M | T , X ( M | T = t ′ , X ) f T | X ( t | X ) f M | T , X ( M | T = t, X ) { Y − E ( Y | X , M , T = t ) } + 1 { T = t ′ } f T | X ( t ′ | X ) { E ( Y | X , M , T = t ) − η ( t, t ′ , X ) } + η ( t, t ′ , X ) − µ ( t, t ′ ) , wher e η ( t, t ′ , X ) = Z E ( Y | X , M = m , T = t ) f M | T , X ( m | T = t ′ , X ) d m , and S µ ( t,t ′ ) is equivalent to the efﬁcient inﬂuence function in Hsu et al. ( 2018 ). Remark 5. Note fr om Assumptions 5 and 6 in Appendix B that the more continuous or mixed components in the covariate set X or the mediator set M , the harder our estimation of π Z would be, for Z ∈ { X , ( M , X ) } . This is a curse of dimensionality of our fully nonparametric estimation. T o r esolve this issue, it is common to r educe the dimension. A solution is to use an index model (see e.g . F an & Gijbels 1996 , Lewbel & Linton 2007 ). Speciﬁcally , we may assume 15 that the r andom tr eatment assignment T depends on Z thr ough a function known up to a ﬁnite number of parameters θ , s Z , θ : Z 7→ R , in the sense that f T | Z ( t | z ) = f T | s Z , θ { t | s Z , θ ( z ) } for all ( t, z ) ∈ T × Z , and for Z ∈ { X , ( M , X ) } that has a lar ge dimension of continuous or mixed components. Then π Z can be identiﬁed by E [ π Z ( T , Z ) u ( T ) v { s Z , θ ( Z ) } ] = E [ u ( T )] · E [ v { s Z , θ ( Z ) } ] (4.3) for all suitable functions u and v . Howe ver , the moment equation does not identify θ , we need to ﬁrst estimate θ . One possible way is to estimate θ by the maximum lik elihood (MLE) with a nonparametric estimator of f T | s Z , θ ( t | s ) , b f T | s Z , θ ( t | s ) . That is, b θ := arg max θ N X i =1 log { b f T | s Z , θ ( T i | s Z , θ ( Z i ) } . F or e xample, we can take the k ernel density estimator b f T | s Z , θ ( t | s ) = P N i =1 K h 1 ( T i − t ) K h 2 { s Z , θ ( Z i ) − s } P N i =1 K h 2 { s Z , θ ( Z i ) − s } , wher e K h ( x ) = K ( x/h ) h with K a pr e-speciﬁed kernel function, and h 1 , h 2 ar e the rule of thumb bandwidths for T and s Z , θ ( Z ) , respectively . Then with b θ and the moment equation ( 4.3 ) , we can estimate π Z using the same way as in section 3 . Since the index model s Z , θ maps Z to a univariate space, both the nonparametric estimator of f T | s Z , θ ( t | s ) and the sie ve appr oximation of ( 4.3 ) ar e two-dimensional nonparametric estimator s, r educing the curse of dimensionality . 5 V ariance Estimation Using the expression ( 4.2 ), we propose a plug-in estimator for V tt ′ . For i = 1 , . . . , N and δ ≥ 0 , let b d K 0 ,i deﬁned in section B.1 in the Appendix be an estimator of d K 0 ,i deﬁned in ( 4.1 ). Then, a consistent estimator of V tt ′ is gi ven by b V tt ′ := u ⊤ K 0 ( t ) b Φ − 1 K 0 × K 0 · " 1 N N X i =1 b d K 0 ,i ( T , M , X , Y ; δ ) b d ⊤ K 0 ,i ( T , M , X , Y ; δ ) # · b Φ − 1 K 0 × K 0 u K 0 ( t ) , 16 where b Φ K 0 × K 0 = N − 1 P N i =1 u K 0 ( T i ) u ⊤ K 0 ( T i ) . From Proposition 1 and Theorem 1 , we have sup ( t, z ) ∈T ×Z | b π K Z ( t, z ) − π Z ( t, z ) | = o p (1) and | b µ ( t, t ) − µ ( t, t ) | → 0 . W ith these results, the consistency of b V tt ′ follo ws from standard arguments in Chen ( 2007 ). 6 Selecting the Smoothing Parameters The proposed estimator b µ ( t, t ′ ) (resp. b µ h ( t, t ′ ) ) in ( 3.6 ) (resp. ( 3.7 )) in volv es tuning parameters k 1 , k X , k M ,X , and K 0 (resp. h ). If the treatment T is discrete, k 1 and K 0 can be determined in the way described in Remark 1, and the kernel type estimator b µ h ( t, t ′ ) is not applicable. When the confounders X or the mediators M are discrete, the corresponding siev e bases and parameters k X and k M ,X can also be determined in that way . Thus, in this section, we focus on proposing a data-dri ven method for choosing the parameters for continuous T , X and M . Note that our estimators b µ ( t, t ′ ) and b µ h ( t, t ′ ) for continuous T are nonparametric regression type estimators. F or such estimators, the smoothing parameters are usually selected by minimiz- ing certain cross-v alidation (CV) criteria that approximates the mean squared error (MSE) of the estimator . Although simultaneously selecting all the smoothing parameters using one CV criteria can better approximate the MSE, it is too time-consuming, gi ven so many parameters. Thus, we propose to choose them separately . First, we propose a method to choose K Z for b π K Z , where Z ∈ { X , ( M , Z ) } , inspired by the least square CV idea for choosing the bandwidth of a kernel density estimator (see e.g., F an & Gijbels 1996 ). Notice that a weighted integrated squared error (WISE) of the estimated weighting functions can be written in the follo wing way: Z { b π K Z ( T , Z ) − π Z ( T , Z ) } 2 f T , Z ( t, z ) dt d z = Z { b π K Z ( T , Z ) } 2 f T , Z ( t, z ) dt d z − 2 Z b π K Z ( t, z ) π Z ( t, z ) f T , Z ( t, z ) dt d z + E  { π Z ( T , Z ) } 2  , where by the deﬁnition of π Z , we ha ve π Z ( t, z ) f T , Z ( t, z ) = f T ( t ) f Z ( z ) . This result suggests that minimizing the WISE w .r .t. the number of sie ve basis k 1 and k Z is equi v alent to minimizing Z { b π K Z ( T , Z ) } 2 f T , Z ( t, z ) dt d z − 2 Z b π K Z ( t, z ) f T ( t ) f Z ( z ) dt d z . 17 W ith a ﬁnite sample, we can then choose k 1 and k Z by minimizing the follo wing least square CV , which is an empirical fully accessible analog of the abov e criteria C V ( k 1 , k Z ) = 1 N N X i =1 { b π K Z ( T i , Z i ) } 2 − 2 N ( N − 1) N X i =1 N X j  = i b π K Z ( T i , Z j ) , ov er a candidate set of k 1 and k Z . T o select the parameter K 0 in b µ ( t, t ′ ) , we rewrite b µ ( t, t ′ ) as b µ K 0 ( t, t ′ ) for now . Then, we obtain K 0 by using a leav e-one-out CV (see Li & Racine ( 2007 , Section 15.2)), b K 0 = arg min K 0   1 N N X i =1 ( b π b K M,X ( T i , M i , X i ) b π b K X ( T i + δ, X i ) Y i b π b K M,X ( T i + δ, M i , X i ) − b µ ( − i ) K 0 ( T i , T i + δ ) ) 2   , where b K Z = b k 1 · b k Z for Z ∈ { X , ( M , X ) } , and b µ ( − i ) K 0 ( T i , T i + δ ) is computed as b µ K 0 ( T i , T i + δ ) but without using { T i , X i , M i , Y i } . F or b µ h ( t, t ′ ) , we need to choose the bandwidth h . W e can also apply a leav e-one-out CV method for choosing h : b h = arg min h   1 N N X i =1 ( b π b K M,X ( T i , M i , X i ) b π b K X ( T i + δ, X i ) Y i b π b K M,X ( T i + δ, M i , X i ) − b µ ( − i ) h ( T i , T i + δ ) ) 2   , where b µ ( − i ) h ( T i , T i + δ ) is computed as the series estimator b µ h ( T i , T i + δ ) but without using { T i , X i , M i , Y i } . Apart from the CV method, the rule of thumb bandwidth for kernel methods (see e.g. Sil- verman 1986 ) is also a v ailable, which sacriﬁces a small amount of accuracy but is less time- consuming. W e e xperimented with the rule of thumb bandwidth in our numerical studies and obtained satisfactory results. Speciﬁcally , the rule of thumb bandwidth h = C · sd ( T ) · N − 1 / 5 , where sd ( T ) is the standard deviation of T and C is a constant depending on the k ernel func- tion. F or example, C = 2 . 34 for second-order Epanechniko v kernels, C = 3 . 03 for fourth-order Epanechniko v kernels, and C = 1 . 06 for the Gaussian kernel. Howe v er , Theorem 4 requires an undersmoothing bandwidth such that N h 5 → 0 as N → ∞ . Thus, following the suggestion of Huber et al. ( 2020 ), we take h = C · N − 0 . 25 . 18 7 Monte Carlo Simulation In this section, we conduct Monte Carlo simulations to e v aluate the ﬁnite sample performance of our proposed estimators. 7.1 Continuous T reatment W e consider data-generating processes (DGPs) similar to the designs in Huber et al. ( 2020 ). The confounder X is drawn from the uniform distrib ution over [ − 1 . 5 , 1 . 5] . W e consider the continuous treatment v ariable generated from T = 0 . 3 X + ϵ , where ϵ is drawn from the uniform distribution ov er [ − 2 , 2] . W e further generate U and V from the uniform distrib ution o ver [ − 2 , 2] such that ϵ , U , and V are independent of each other . The mediator v ariable is then generated according to M = 0 . 3 T + 0 . 3 X + V . The outcomes in each scenario are gi ven as follo ws: • Scenario I: Y = 0 . 3 T + 0 . 3 M + 0 . 5 T M + 0 . 3 X + U . • Scenario II: Y = 0 . 3 T + 0 . 3 M + 0 . 3 X + 0 . 25 T 3 + U . • Scenario III: Y = 0 . 3 T + 0 . 3 M + 0 . 5 T M + 0 . 3 X + 0 . 25 T 3 + U . In Scenario I, there is an interaction ef fect between T and M , with the outcome model being linear . Scenario II considers a nonlinear outcome model; howe ver , there is no interaction effect, implying that the direct and indirect effects are homogeneous, that is, µ ( t, t ) − µ ( t ′ , t ) = µ ( t, t ′ ) − µ ( t ′ , t ′ ) and µ ( t, t ) − µ ( t, t ′ ) = µ ( t ′ , t ) − µ ( t ′ , t ′ ) . In Scenario III, there is an interaction effect, with the outcome model being nonlinear . The true dose-response functions are • Scenario I: µ ( t, t ′ ) = 0 . 3 t + 0 . 09 t ′ + 0 . 15 tt ′ . • Scenario II: µ ( t, t ′ ) = 0 . 3 t + 0 . 09 t ′ + 0 . 25 t 3 . • Scenario III: µ ( t, t ′ ) = 0 . 3 t + 0 . 09 t ′ + 0 . 15 tt ′ + 0 . 25 t 3 . In all the scenarios, we set the sample size N = 500 and 1000 . The Monte Carlo trials are repeated for 500 times. W e set t ′ = 0 and let t vary ov er the interv al [ − 1 . 5 , 0) ∪ (0 , 1 . 5] . Speciﬁcally , we set t equal to the grid points G r := {− 1 . 5 , − 1 . 4 , ..., − 0 . 1 , 0 . 1 , ..., 1 . 4 , 1 . 5 } . 19 T o ev aluate the performance, we compare the proposed estimators, the co v ariate-balancing series (CBS) estimators in ( 3.6 ) and the cov ariate-balancing kernel (CBK) regression estimators in ( 3.7 ), with the alternativ e estimators established in the literature. Speciﬁcally , we estimate the weighting functions in both CBS and CBK estimators using the power series. The smoothing parameters are selected by the data-dri ven method described in Section 6 . The alternati ve competitors considered here are the nonparametric weighting kernel (NWK) and semiparametric weighting k ernel (SWK) estimators produced by Huber et al. ( 2020 ) and the linear ordinary least squares (OLS) regression estimators used in the simulation study of Huber et al. ( 2020 , Section 5). In particular , SWK (incorrectly) assumes the conditional distributions of T giv en X and T giv en ( M , X ) are both normal and estimates the distrib ution parameters using maximum likelihood. OLS estimates the direct and indirect treatment ef fects by linearly regressing the observed mediator on the observed treatment and cov ariates and linearly regressing the observed outcome on the observ ed mediator , treatment and cov ariates, respecti v ely . Thus, for nonlinear models, OLS is biased. T able 1: 10 3 × ARMSE for the estimated direct and indirect ef fects under Scenarios I – III, with the smallest v alue and the smallest nonparametric v alue in each conﬁguration highlighted in boldface and underline, respecti vely . A verage Natural Direct Effects A verage Natural Indirect Effects ARMSE { b µ ( t, t ) − b µ (0 , t ) } ARMSE { b µ ( t, 0) − b µ (0 , 0) } ARMSE { b µ ( t, t ) − b µ ( t, 0) } ARMSE { b µ (0 , t ) − b µ (0 , 0) } I II III I II III I II III I II III N=500 OLS 133.70 283.79 305.15 46.14 283.79 285.59 125.65 15.96 125.69 18.17 15.96 18.39 SWK 100.79 122.31 130.11 95.14 123.63 127.52 32.85 31.17 39.48 22.68 24.25 24.82 NWK 73.31 149.91 155.49 97.92 153.31 169.17 118.38 102.31 119.75 35.79 34.79 34.71 CBS 64.08 61.16 70.24 57.16 54.25 62.44 32.09 27.00 33.99 28.02 27.81 28.65 CBK 128.29 119.89 130.14 125.88 120.57 127.94 34.98 28.00 35.78 27.01 26.52 27.06 N=1000 OLS 129.72 281.73 302.32 34.78 281.73 283.04 124.79 11.05 124.84 12.48 11.05 12.85 SWK 86.19 87.84 94.66 84.07 90.01 93.54 23.54 22.32 28.04 15.38 15.59 16.02 NWK 59.97 116.40 120.46 90.29 121.97 139.49 107.25 91.49 107.44 29.74 29.05 29.12 CBS 45.72 41.48 49.49 40.53 37.62 44.34 21.75 18.44 23.22 19.18 18.76 19.70 CBK 107.37 100.23 108.27 105.57 100.14 106.27 24.97 20.75 25.34 20.19 20.22 20.21 For each t ∈ G r , we measure the performance of the estimators by the square root mean square errors (Rmse) over the 500 Monte Carlo trials. T ables 1 reports the averages of Rmse (ARMSE) 20 for the estimated direct and indirect ef fects under Scenarios I to III, where the a verages are taken across all the treatment values t ∈ G r . Speciﬁcally , for any estimator b µ ( t ) of a function µ ( t ) for t ∈ G r , ARMSE { b µ ( t ) } = |G r | − 1 P t ∈G r q P 500 j =1 { b µ j ( t ) − µ ( t ) } 2 / 500 , where |G r | is the number of elements in the set G r and b µ j ( t ) denotes the estimate calculated from the j th Monte Carlo trial. Overall, our CBS estimator consistently outperforms the nonparametric NWK across all sce- narios, and our proposed CBK method surpasses NWK in ev ery instance except one – the av erage natural direct effects under Scenario I. These observ ations align well with our theories and val- idate the robust ﬁnite sample properties of our methods.Notably , CBS tends to perform better than CBK in most of the cases. The adv antage is more obvious for the average natural direct ef- fects, where the models in v olves non-linear components. This could be attrib uted to the fact that the tuning parameters for CBS are data-dri ven, whereas the bandwidth for the kernel function in CBK is determined by a rule-of-thumb approach. Note that the results of SWK and NWK are consistent with those in Huber et al. ( 2020 ), where the misspeciﬁed SWK is better than NWK in most cases. This maybe because the conditional normal distribution assumption can capture the linear relationships of T with X and ( M , X ) in the simulated models fairly well. Ho we v er , our proposed CBS performs better than SWK in most of the cases. In particular , under Scenario I, the direct and indirect ef fects are heterogeneous. Speciﬁcally , µ ( t, 0) − µ (0 , 0) and µ (0 , t ) − µ (0 , 0) are reduced to linear models, whereas µ ( t, t ) − µ (0 , t ) and µ ( t, t ) − µ ( t, 0) are non-linear . It is thus not surprising that the OLS estimator giv es the best estimation for µ ( t, 0) − µ (0 , 0) and µ (0 , t ) − µ (0 , 0) , but performs w orst for the others two. In Scenario II, the direct and indirect effects are homogeneous, and the direct effects models are non-linear . The OLS estimators for the direct ef fects thus perform the worst and do not sho w any consistency . Howe v er , they estimate the indirect effects consistently , and the best since the true indirect effects are linear . Both the proposed CBS and CBK estimators produce smaller ARMSE than NWK. While CBK is comparable to SWK, CBS outperforms SWK. Under Scenario III, our proposed CBS and CBK outperform the nonparametric NWK for all the ef fects. Moreo ver , the direct effects and indirect ef fect µ ( t, t ) − µ ( t, 0) are non-linear and the indirect ef fect µ (0 , t ) − µ (0 , 0) is linear . The OLS estimators are not consistent except for µ (0 , t ) − µ (0 , 0) . In those non-linear conﬁgurations, our proposed CBS estimator provides the 21 best performance among all the estimators. T able 2: 10 3 × ARMSE for the estimated direct and indirect ef fects under the binary treatment model, with the smallest v alue in each conﬁguration highlighted in boldface. A verage Natural Direct Effects A verage Natural Indirect Effects ARMSE { b µ (1 , 0) − b µ (0 , 0) } ARMSE { b µ (1 , 1) − b µ (0 , 1) } ARMSE { b µ (1 , 1) − b µ (1 , 0) } ARMSE { b µ (0 , 1) − b µ (0 , 0) } n=500 IPW 123.59 125.53 94.83 43.72 CBS 122.58 122.95 91.53 38.84 n=1000 IPW 85.95 85.16 66.80 30.22 CBS 81.03 80.71 66.66 28.88 7.2 Binary T reatment W e also conduct a simulation to in vestig ate the ﬁnite sample performance when the treat- ment is binary , that is, T ∈ { 0 , 1 } . Follo wing Hsu et al. ( 2018 ), we generate T from a conditional Bernoulli distrib ution with a success probability of exp( X ) / { 1 + exp( X ) } . The confounder X is generated from the uniform distribution ov er [ − 1 . 5 , 1 . 5] , the mediator v ari- able is generated through M = 0 . 3 T + 0 . 3 X + V , and the outcome is generated through Y = 0 . 3 T + 0 . 3 M + 0 . 5 T M + 0 . 3 X + 0 . 25 T 3 + U , where U and V are independently drawn from the uniform distribution over [ − 2 , 2] . The population means of the potential outcomes are µ (0 , 0) = 0 , µ (0 , 1) = 0 . 09 , u (1 , 0) = 0 . 55 and µ (1 , 1) = 0 . 79 . W e compare our proposed CBS estimator with the in verse probability weighting estimator (IPW) introduced by Hsu et al. ( 2018 ), where the generalized propensity score functions P ( T = t | M , X ) and P ( T = t | X ) are estimated using the nonparametric series logit regression. T able 2 reports the ARMSE for the estimated direct and indirect ef fects. The results sho w that both estimators improv e with the sample size increases, whereas our CBS estimator performs slightly better in all conﬁgurations. 22 Figure 1: Estimated direct ef fects b µ ( t, t ) − b µ (40 , t ) (top left) and b µ ( t, 40) − b µ (40 , 40) (top right), and indirect ef fects b µ ( t, t ) − b µ ( t, 40) (bottom left) and b µ (40 , t ) − b µ (40 , 40) (bottom right) for t ∈ { 100 , 200 , · · · , 2000 } , with the estimated 95% conﬁdence bands (dashed lines). 8 A pplication T o ev aluate the practical value of our method, we revisit the case study of Job Corps analyzed by Schochet et al. ( 2008 ) and Huber et al. ( 2020 ). Job Corps is a publicly funded U.S. training program that targets economically disadv antaged youths between the ages of 16 and 24 who are legal US residents. Participants receiv ed approximately 1200 hours of v ocational training and education, housing, and boarding over an a verage duration of 8 months. Pre vious analyses by Huber ( 2014 ) and Fr ¨ olich & Huber ( 2017 ) focused on the ef fect of the program’ s participation, as a binary treatment ef fect, on the health and earnings, respectively . Schochet et al. ( 2008 ) ﬁnd that participation in the Job Corp program increases educational attainment and reduces criminal acti vity . Huber et al. ( 2020 ) apply a continuous treatment ef fect analysis to inv estigate ho w the total hours spent in either academic or v ocational classes during the 12 months affects the 23 participants’ criminal acti vities, mediated through the employment status after the training. Figure 2: Estimated direct ef fects b µ h ( t, t ) − b µ h (40 , t ) (top left) and b µ h ( t, 40) − b µ h (40 , 40) (top right), and indirect effects b µ h ( t, t ) − b µ h ( t, 40) (bottom left) and b µ h (40 , t ) − b µ h (40 , 40) (bottom right) for t ∈ { 100 , 200 , · · · , 2000 } , with the estimated 95% conﬁdence bands (dashed lines). Similar to Huber et al. ( 2020 ), our research e xplores the training program’ s ef fects on crim- inal acti vity , mediated by post-program emplo yment. But different from them, we ﬁrst apply our method to assess the direct and indirect ef fects of attending the program (binary treatment). Subsequently , we utilize our method to examine ho w these effects v ary with the length of par - ticipation (continuous treatment) in the program. Speciﬁcally , the mediator variable, denoted by M , is the proportion of weeks employed in the second year after the training. The outcome observ ation, denoted by Y , is the number of times an individual was arrested by the police in the fourth year after the program. The confounding v ariables, denoted by X , include a rich set of pre-treatment cov ariates: age, gender , ethnicity , language competency , education, marital sta- tus, household size, and income, pre vious receipt of social aid, family background (e.g., parents’ education), health and health-related beha vior at baseline, the expectations about the Job Corps 24 T able 3: 10 × Direct and indirect ef fects with conﬁdence interv al under the binary treatment A verage Natural Direct Ef fects A verage Natural Indirect Ef fects { b µ (1 , 0) − b µ (0 , 0) } { b µ (1 , 1) − b µ (0 , 1) } { b µ (1 , 1) − b µ (1 , 0) } { b µ (0 , 1) − b µ (0 , 0) } CBS -0.075 (-0.348, 0.184) -0.082 (-0.336,0.185) 0.005 (-0.041,0.051) 0.011 (-0.050,0.072) program, and the interaction with the recruiters. Let T denote the total hours a participant spent in either academic or vocational classes dur- ing the 12-months training program. T o analyze the binary treatment mediation ef fect, we set the binary treatment D = 1 for all T > 0 and D = 0 for T = 0 . The dataset contains 10,775 indi- viduals, whose post-treatment variables M and Y were fully observed in the follow-up surveys after 2 and 4 years, respectiv ely (see Huber et al. 2020 for a detailed description of the dataset and related statistics). The empirical results, as summarized in T able 3 , indicate that both direct and indirect effects of this treatment on reducing criminal behavior are not signiﬁcant. This lack of signiﬁcance suggests that simply attending the training, regardless of the duration, does not hav e a causal effect on altering criminal behavior . Gi ven the ﬁnding, it is important to consider not just the presence of the treatment (i.e., attending the training) but also its intensity (ho w many hours were spent in training). W e then analyze the 4,000 individuals, from the 10,775 ones in the original sample, who recei ved a positi ve treatment intensity , that is, T > 0 . W e apply the proposed method to this dataset to analyze the mediation ef fects to help policymakers design more efﬁcient intervention programs. The benchmark treatment lev el is ﬁx ed at t ′ = 40 , that is, a rather small intensity of 40 hours. W e compute the response curv e estimators b µ ( t, t ′ ) and b µ h ( t, t ′ ) proposed in ( 3.6 ) and ( 3.7 ), respecti vely . Then, we use them to obtain the estimates of the direct and indirect effects for v arying t ∈ { 100 , 200 , ..., 1900 , 2000 } . The smoothing parameters are determined using the data-dri ven approach described in Section 6 . Figure 1 reports the estimated direct and indirect ef fects based on our CBS b µ ( t, t ′ ) , while Figure 2 reports the estimated direct and indirect ef fects based on our CBK b µ h ( t, t ′ ) . Figure 1 sho ws that the direct ef fects are signiﬁcantly negati ve at the 5% le vel for all t values considered, with the effects becoming larger as the training time increases. Additionally , all estimated indirect 25 ef fects are insigniﬁcant. The results in Figure 2 are similar to those of Huber et al. ( 2020 ). They sho w that small treatment intensities do not reduce the number of arrests. From 800 hours onw ard, the direct ef fects become signiﬁcantly negati v e, the 95% conﬁdence bands exclude zero, and the ef fect peaks occur around 1400 hours. The estimated indirect ef fects are mostly insigniﬁcant, consistent with the results presented in Figure 1 . In conclusion, the Job Corps program has signiﬁcant direct ef fects on the number of arrests in the fourth year . In contrast, for the in vestig ated range of treatment intensities, the indirect ef fects of program-induced employment changes on arrests are close to zero. Comparing Figure 1 to Figure 2 and the results in Huber et al. ( 2020 ), our sie ve regression estimator gi ves smoother results than the kernel-type ones. The kernel-type estimators seem to ha ve some boundary effects that sho w unstable results on the boundaries of the support of the treatment v alues. Recall that in our simulation, with or without interaction for both linear and nonlinear models, our CBS gives better direct ef fects estimation than CBK. Thus, Figure 1 may provide a more reliable result. Ho wev er , to conﬁrm these analyses, some speciﬁcation and monotonicity tests are needed, which can be an interesting future direction. 9 Discussion and Conclusion This study provides a novel approach for estimating causal mediation ef fects, which uniﬁes the binary , multi-v alued, and continuous treatments and the mixture of discrete and continuous treat- ments under a sequential ignorability condition. Furthermore, we establish the asymptotic nor - mality for our proposed estimators. In particular , we show that our proposed estimators attain the semiparametric ef ﬁcienc y bounds when the treatment is discrete and asymptotically more ef- ﬁcient than the existing method when the treatment is continuous. Ne vertheless, our study has some limitations that should be addressed by future research. First, an extension to allow for high-dimensional covariates is needed since the fully nonparametric estimation suffers from the curse of dimensionality . Second, our idea is readily applicable to panel data, which is commonly observed in practice; howe ver , the asymptotic analysis could be more dif ﬁcult and a worthwhile future study . 26 Acknowledgments The authors sincerely thank the editor Esfandiar Maasoumi and the referees for their constructi v e suggestions and comments. W ei Huang’ s research is supported by the Professor Maurice H. Belz Fund of the Uni versity of Melbourne. Zheng Zhang is supported by the fund from the fund from the National Natural Science Foundation of Beijing, China [grant number 1222007] and the fund for building world-class univ ersities (disciplines) of Renmin Uni versity of China [project number KYGJC2023011]. SUPPLEMENT AR Y MA TERIAL Supplementary Material f or “Nonparametric Estimation of Mediation Effects with A Gen- eral T reatment”: The supplementary material is only for online publication (pdf ﬁle). It contains the assumptions required to deriv e the asymptotic properties of b π Z and detailed discussions on the assumptions, the asymptotic results of b π Z and the proofs of Theorems 1 , 2 and Corollary 1 . Refer ences Ai, C., Huang, L. & Zhang, Z. (2022), ‘ A simple and ef ﬁcient estimation of av erage treatment ef fects in models with unmeasured confounders’, Statistica Sinica 32 (3). Ai, C., Linton, O., Motegi, K. & Zhang, Z. (2021), ‘ A uniﬁed framework for ef ﬁcient estimation of general treatment models’, Quantitative Economics 12 (3), 779–816. Ai, C., Linton, O. & Zhang, Z. (2022), ‘Estimation and inference of counterfactual distribution and quantile functions in continuous treatment models’, Journal of Econometrics 228 (1), 39– 61. Andre ws, D. W . K. (1991), ‘ Asymptotic normality of series estimators for nonparametric and semiparametric regression models’, Econometrica 59 (2), 307–345. Baron, R. M. & K enny , D. A. (1986), ‘The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. ’, Journal of P er- sonality and Social Psycholo gy 51 (6), 1173–1182. 27 Cattaneo, M. D. (2010), ‘Ef ﬁcient semiparametric estimation of multi-valued treatment ef fects under ignorability’, J ournal of Econometrics 155 (2), 138–154. Chan, K. C. G., Y am, S. C. P . & Zhang, Z. (2016), ‘Globally ef ﬁcient non-parametric inference of av erage treatment ef fects by empirical balancing calibration weighting’, J ournal of the Royal Statistical Society: Series B (Statistical Methodolo gy) 78 (3), 673–700. Chan, K., Imai, K., Y am, S. & Zhang, Z. (2016), ‘Efﬁcient nonparametric estimation of causal mediation ef fects’, arXiv pr eprint arXiv:1601.03501 . Chen, X. (2007), ‘Large sample sie v e estimation of semi-nonparametric models’, Handbook of Econometrics 6 (B), 5549–5632. Chen, X., Hong, H. & T arozzi, A. (2008), ‘Semiparametric efﬁcienc y in gmm models with aux- iliary data’, The Annals of Statistics 36 (2), 808–843. Chernozhuko v , V ., Fern ´ andez-V al, I. & Melly , B. (2013), ‘Inference on counterf actual distrib u- tions’, Econometrica 81 (6), 2205–2268. Colangelo, K. & Lee, Y .-Y . (2020), ‘Double debiased machine learning nonparametric inference with continuous treatments’, arXiv pr eprint arXiv:2004.03036 . Donald, S. G. & Hsu, Y .-C. (2014), ‘Estimation and inference for distribution functions and quantile functions in treatment ef fect models’, Journal of Econometrics 178 (3), 383–397. Dong, Y ., Lee, Y .-Y . & Gou, M. (2019), ‘Regression discontinuity designs with a continuous treatment’, A vailable at SSRN 3167541 . Fan, J. & Gijbels, I. (1996), Local P olynomial Modelling and Its Applications: Monographs on Statistics and Applied Pr obability 66 , V ol. 66, CRC Press. Fong, C., Hazlett, C. & Imai, K. (2018), ‘Cov ariate balancing propensity score for a continuous treatment: Application to the ef ﬁcacy of political adv ertisements’, Annals of Applied Statistics 12 (1), 156–177. 28 Fr ¨ olich, M. & Huber , M. (2017), ‘Direct and indirect treatment effects–causal chains and medi- ation analysis with instrumental variables’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 (5), 1645–1666. Galv ao, A. F . & W ang, L. (2015), ‘Uniformly semiparametric ef ﬁcient estimation of treat- ment ef fects with a continuous treatment’, Journal of the American Statistical Association 110 (512), 1528–1542. Goetgeluk, S., V ansteelandt, S. & Goetghebeur , E. (2008), ‘Estimation of controlled direct ef- fects’, Journal of the Royal Statistical Society Series B: Statistical Methodology 70 (5), 1049– 1066. Hahn, J. (1998), ‘On the role of the propensity score in ef ﬁcient semiparametric estimation of av erage treatment ef fects’, Econometrica 66 (2), 315–331. Hirano, K. & Imbens, G. W . (2004), The propensity score with continuous treatments, in A. Gel- man & X.-L. Meng, eds, ‘ Applied Bayesian Modeling and Causal Inference from Incomplete- Data Perspecti ves’, John W ile y & Sons Ltd., chapter 7, pp. 73–84. Hirano, K., Imbens, G. W . & Ridder , G. (2003), ‘Ef ﬁcient estimation of av erage treatment ef fects using the estimated propensity score’, Econometrica 71 (4), 1161–1189. Hong, G., Deutsch, J. & Hill, H. D. (2015), ‘Ratio-of-mediator-probability weighting for causal mediation analysis in the presence of treatment-by-mediator interaction’, Journal of Educa- tional and Behavioral Statistics 40 (3), 307–340. Hsu, Y .-C., Huber , M. & Lai, T .-C. (2018), ‘Nonparametric estimation of natural direct and indirect effects based on in v erse probability weighting’, J ournal of Econometric Methods 8 (1). Hsu, Y .-C. & Lai, T .-C. (2018), ‘T reatment effect models: A brief re vie w’, Jing Ji Lun W en Cong Kan 46 (4), 501–521. Huang, W ., Linton, O. & Zhang, Z. (2022), ‘ A uniﬁed frame work for speciﬁcation tests of contin- uous treatment ef fect models’, Journal of Business & Economic Statistics 40 (4), 1817–1830. 29 Huber , M. (2014), ‘Identifying causal mechanisms (primarily) based on inv erse probability weighting’, J ournal of Applied Econometrics 29 (6), 920–943. Huber , M., Hsu, Y .-C., Lee, Y .-Y . & Lettry , L. (2020), ‘Direct and indirect ef fects of continuous treatments based on generalized propensity score weighting’, J ournal of Applied Econometrics 35 (7), 814–840. Imai, K., Keele, L. & Tingle y , D. (2010), ‘ A general approach to causal mediation analysis. ’, Psycholo gical methods 15 (4), 309. Imai, K., K eele, L., T ingley , D. & Y amamoto, T . (2011), ‘Unpacking the black box of causality: Learning about causal mechanisms from experimental and observational studies’, American P olitical Science Review 105 (4), 765–789. Imai, K., K eele, L. & Y amamoto, T . (2010), ‘Identiﬁcation, inference, and sensiti vity analysis for causal mediation ef fects’, Statistical Science 25 (1), 51–71. Imai, K. & v an Dyk, D. A. (2004), ‘Causal inference with general treatment re gimes: Generaliz- ing the propensity score’, J ournal of the American Statistical Association 99 (467), 854–866. Jof fe, M. M., Small, D., Hsu, C.-Y . et al. (2007), ‘Deﬁning and estimating interv ention ef fects for groups that will de velop an auxiliary outcome’, Statistical Science 22 (1), 74–97. K ennedy , E. H., Ma, Z., McHugh, M. D. & Small, D. S. (2017), ‘Non-parametric methods for doubly robust estimation of continuous treatment effects’, Journal of the Royal Statistical So- ciety: Series B (Statistical Methodolo gy) 79 (4), 1229–1245. Lee, D. S. & Lemieux, T . (2010), ‘Re gression discontinuity designs in economics’, Journal of Economic Literatur e 48 (2), 281–355. Lee, Y .-Y . (2018), ‘Efﬁcient propensity score regression estimators of multiv alued treatment ef- fects for the treated’, J ournal of Econometrics 204 (2), 207–222. Le wbel, A. & Linton, O. (2007), ‘Nonparametric matching and ef ﬁcient estimators of homothet- ically separable functions’, Econometrica 75 (4), 1209–1227. 30 Li, Q. & Racine, J. S. (2007), Nonparametric Econometrics: Theory and Practice , Princeton Uni versity Press. Liu, Z., Shen, J., Barﬁeld, R., Schwartz, J., Baccarelli, A. A. & Lin, X. (2021), ‘Large-scale hypothesis testing for causal mediation ef fects with applications in genome-wide epigenetic studies’, J ournal of the American Statistical Association pp. 1–15. Ne wey , W . K. (1994), ‘The asymptotic v ariance of semiparametric estimators’, Econometrica 62 (6), 1349–1382. Ne wey , W . K. (1997), ‘Con ver gence rates and asymptotic normality for series estimators’, J our- nal of Econometrics 79 (1), 147–168. Pearl, J. (2001), ‘Direct and indirect effects’, Pr oceedings of the Seventeenth Confer ence on Uncertainty in Articial Intelligence pp. 411–420. Robins, J. M. & Greenland, S. (1992), ‘Identiﬁability and exchangeability for direct and indirect ef fects’, Epidemiology pp. 143–155. Schochet, P . Z., Burghardt, J. & McConnell, S. (2008), ‘Does job corps work? impact ﬁndings from the national job corps study’, American Economic Revie w 98 (5), 1864–86. Silverman, B. W . (1986), Density Estimation for Statistics and Data Analysis , Chapman and Hall. Singh, R., Xu, L. & Gretton, A. (2021), ‘Kernel methods for multistage causal inference: Medi- ation analysis and dynamic treatment ef fects’, arXiv pr eprint arXiv:2111.03950 . Tchetgen Tchetgen, E. J. & Shpitser , I. (2012), ‘Semiparametric theory for causal mediation anal- ysis: Ef ﬁcienc y bounds, multiple rob ustness, and sensiti vity analysis’, The Annals of Statistics 40 (3), 1816–1845. V anderW eele, T . (2015), Explanation in Causal Infer ence: Methods for Mediation and Interac- tion , Oxford Uni versity Press. V anderW eele, T . J. (2009), ‘Marginal structural models for the estimation of direct and indirect ef fects’, Epidemiology 20 (1), 18–26. 31 V anderW eele, T . J. & V ansteelandt, S. (2010), ‘Odds ratios for mediation analysis for a dichoto- mous outcome’, American J ournal of Epidemiology 172 (12), 1339–1348. 32 A ppendix A Pr oof of ( 2.2 ) W e rewrite the identiﬁcation of µ ( t, t ′ ) in Huber et al. ( 2020 ) as follo ws: µ ( t, t ′ ) = E  f T ( t ) f T | M , X ( t | M , X ) · f T | M , X ( t ′ | M , X ) f T | X ( t ′ | X ) Y     T = t  = E  f T ( t ) f T | M , X ( t | M , X ) · f T | M , X ( t ′ | M , X ) f T ( t ′ ) · f T ( t ′ ) f T | X ( t ′ | X ) Y     T = t  = E  f T ( T ) f T | M , X ( T | M , X ) · f T | M , X ( T + δ | M , X ) f T ( T + δ ) · f T ( T + δ ) f T | X ( T + δ | X ) Y     T = t  = E  π M , X ( T , M , X ) π M , X ( T + δ, M , X ) · π X ( T + δ, X ) Y     T = t  . B Some Pr eliminary Results W e recall some preliminary results on the con ver gence rates of b π K Z ( t, Z ) that are directly implied by the results in Ai et al. ( 2021 ). W e impose the following conditions based on those of Ai et al. ( 2021 ): F or Z ∈ { X , ( M , X ) } , we assume Assumption 4. (i) The support of Z , Z is a compact set. (ii) Ther e exist two positive constants η 1 and η 2 such that 0 < η 1 ≤ π Z ( t, z ) ≤ η 2 < ∞ , ∀ ( t, z ) ∈ T × Z . Assumption 5. Ther e exist Λ k 1 × k Z ∈ R k 1 × k Z and constant α Z > 0 such that sup ( t, z ) ∈T ×Z   ( ρ ′− 1 { π Z ( t, z ) } − u k 1 ( t ) ⊤ Λ k 1 × k Z v k Z ( z )   = O ( K − α Z Z ) , wher e ρ ′− 1 ( v ) = − log v − 1 . Assumption 6. (i) The eigen values of E [ u k 1 ( T ) u k 1 ( T ) ⊤ ] , E [ v k Z ( Z ) v k Z ( Z ) ⊤ ] ar e bounded away fr om zer o and inﬁnity uniformly in k 1 , k Z . (ii) Ther e ar e sequences of constants ζ 1 ( k 1 ) and ζ Z ( k Z ) satisfying sup t ∈T ∥ u k 1 ( t ) ∥ ≤ ζ 1 ( k 1 ) and sup z ∈Z ∥ v k Z ( z ) ∥ ≤ ζ Z ( k Z ) such that 33 √ N K − α Z Z → 0 and ζ ( K Z ) p K 2 Z / N → 0 as N → ∞ , where K Z = k 1 k Z and ζ ( K Z ) = ζ 1 ( k 1 ) ζ Z ( k Z ) . Assumption 4 (i) requires the cov ariates, the treatment v ariable and the mediator to be bounded. This condition, despite being restricti ve, is commonly imposed in the non-parametric regression literature. Howe ver , we can replace it with a restriction on the tail distribution of ( M , X , T ) . For e xample, Chen, Hong & T arozzi ( 2008 , Assumption 3) assume that the support of X is the entire Euclidean space, b ut impose R R r (1 + | x | 2 ) ω f x ( x ) d x < ∞ for some ω > 0 . Assumption 4 (ii) requires the weighting function to be bounded and bounded away from zero. W e can relax Assumption 4 (ii) by allowing η 1 ( η 2 ) to go to zero (inﬁnity) slowly as N → ∞ . Notice that u k 1 ( t ) ⊤ Λ v k z ( z ) is a linear sie ve approximation for ρ ′− 1 { π z ( t, z ) } . Assumption 5 requires the siev e approximation error to shrink to zero at a polynomial rate. A variety of sie ve basis functions satisfy this condition. When both T and Z are discrete, K Z is a ﬁnite constant independent of N and α Z = + ∞ . When ( T , Z ) has continuous or mix ed components, K Z → ∞ as N → ∞ and α Z is positi v ely af fected by the smoothness of ρ ′− 1 ( · ) and negati v ely af fected by the number of continuous and mix ed components. Assumption 6 (i) ensures that the sie ve estimator is non-degenerate. This condition is common in the sie ve regression literature (see An- dre ws 1991 and Newe y 1997 ). If the approximation error is nonzero, Assumption 6 (ii) imposes a restriction on the gro wth rate of the smoothing parameters k 1 and k z to ensure under-smoothing. Under these conditions, the following are direct results from Ai et al. ( 2021 , Propositions 1 and 2): Proposition 1. Suppose that Assumptions 4 – 6 hold. F or Z ∈ { X , ( M , X ) } , we have sup ( t, m , x ) ∈T ×M×X | b π K Z ( t, z ) − π Z ( t, z ) | = O p max ( ζ ( K Z ) K − α Z Z , ζ ( K Z ) r K Z N )! , and Z T ×Z | b π K Z ( t, z ) − π Z ( t, z ) | 2 dF T , Z ( t, z ) = O p  max  K − 2 α Z Z , K Z N  , and 1 N N X i =1 | b π K Z ( T i , Z i ) − π Z ( T i , Z i ) | 2 = O p  max  K − 2 α Z Z , K Z N  . 34 Proposition 2. Assume that Assumptions 4 – 6 hold. F or for any squar e-inte grable random variable ϕ ( T , M , X , Y ) ∈ L 2 and Z ∈ { X , ( M , X ) } , if there exist a Γ k 1 × k Z ∈ R k 1 × k Z and a constant γ Z > 0 , s.t. sup t × z ∈T ×Z    E [ ϕ ( T , M , X , Y ) | T = t, Z = z ] − u k 1 ( t ) ⊤ Γ k 1 × k Z v k Z ( z )    = O ( K − γ Z Z ) , then we have, for any δ ≥ 0 , 1 √ N N X i =1 { b π K Z ( T i + δ, Z i ) ϕ ( T i , M i , X i , Y i ) − E [ π Z ( T + δ, Z ) ϕ ( T , M , X , Y )] } (B.1) = 1 √ N N X i =1 IF π Z ,i { δ, ϕ ( T , M , X , Y ) } + O p  √ N K − α Z Z  + O p  K − γ Z Z  + O p ζ ( K Z ) r K 2 Z N ! wher e, for i = 1 , . . . , N , IF π Z ,i { δ, ϕ ( T , M , X , Y ) } = π Z ( T i + δ, Z i ) ϕ ( T i , M i , X i , Y i ) (B.2) − π Z ( T i , Z i ) f T | Z ( T i − δ | Z i ) f T | Z ( T i | Z i ) E [ ϕ ( T i − δ, M i , X i , Y i ) | T i , Z i ] + E [ π Z ( T i , Z i ) f T | Z ( T i − δ | Z i ) f T | Z ( T i | Z i ) ϕ ( T i − δ, M i , X i , Y i ) | Z i ] − E [ π Z ( T i , Z i ) f T | Z ( T i − δ | Z i ) f T | Z ( T i | Z i ) ϕ ( T i − δ, M i , X i , Y i )] + E [ π Z ( T i , Z i ) f T | Z ( T i − δ | Z i ) f T | Z ( T i | Z i ) ϕ ( T i − δ, M i , X i , Y i ) | T i ] − E [ π Z ( T i , Z i ) f T | Z ( T i − δ | Z i ) f T | Z ( T i | Z i ) ϕ ( T i − δ, M i , X i , Y i )] , and E [ IF π Z ,i { δ, ϕ ( T , M , X , Y ) } ] = 0 . Using these results, we sho w the follo wing lemma that is useful for deriving our main theo- rems. The proof can be found in section S2 in the supplemental material. Lemma 3. Assume that Assumptions 4 – 6 hold. F or for any square-inte grable random variable ϕ ( T , M , X , Y ) ∈ L 2 and Z ∈ { X , ( M , X ) } , if ther e exist a Γ k 1 × k Z ∈ R k 1 × k Z and a constant 35 γ Z > 0 , s.t. sup t × z ∈T ×Z    E [ ϕ ( T , M , X , Y ) | T = t, Z = z ] − u k 1 ( t ) ⊤ Γ k 1 × k Z v k Z ( z )    = O ( K − γ Z Z ) , then we have, for any δ ≥ 0 , 1 √ N N X i =1  b π K M,X ( T i , M i , X i ) b π K M,X ( T i + δ, M i , X i ) b π K X ( T i + δ, X i ) ϕ ( T i , M i , X i , Y i ) − E  π M ,X ( T , M , X ) π M ,X ( T i + δ, M i , X i ) π X ( T + δ, X ) ϕ ( T , M , X , Y )   (B.3) = 1 √ N N X i =1 " IF π X ,i  δ, π M , X ( T , M , X ) π M , X ( T + δ, M , X ) · ϕ ( T , M , X , Y )  + IF π M , X ,i  0 , π X ( T + δ, X ) π M , X ( T + δ, M , X ) · ϕ ( T , M , X , Y )  − IF π M , X ,i  δ, π M , X ( T , M , X ) π X ( T + δ, X ) π 2 M ,X ( T + δ, M , X ) · ϕ ( T , M , X , Y )  # + O p  √ N K − α X X  + O p  K − γ X X  + O p ζ ( K X ) r K 2 X N ! + O p  √ N K − α M , X M , X  + O p  K − γ M , X M , X  + O p   ζ ( K M , X ) s K 2 M , X N   . B.1 Estimating the d K 0 ,i ’ s T o estimate the asymptotic variance V tt ′ in Section 5 , for i = 1 , . . . , N , we estimate d K 0 ,i by b d K 0 ,i ( T , M , X , Y ; δ ) := b IF π X ,i  δ, b π K M,X ( T , M , X ) b π K M,X ( T + δ, M , X ) · u K 0 ( T ) Y  + b IF π M , X ,i  0 , b π K X ( T + δ, X ) b π K M,X ( T + δ, M , X ) · u K 0 ( T ) Y  − b IF π M , X ,i  δ, b π K M,X ( T , M , X ) b π K X ( T + δ, X ) · u K 0 ( T ) Y  − b E  b π K M , X ( T i , M i , X i ) b π K X ( T i + δ, X i ) b π K M,X ( T i + δ, M i , X i ) · u K 0 ( T i ) Y i | T i  + 1 N N X i =1 b π K M , X ( T i , M i , X i ) b π K X ( T i + δ, X i ) b π K M,X ( T i + δ, M i , X i ) · u K 0 ( T i ) Y i , 36 where, for Z ∈ { X , ( M , X ) } , b IF π Z ,i is an estimator of IF π Z ,i deﬁned in ( B.2 ) in Ap- pendix B , and b E denotes the least square regression. Speciﬁcally , we estimate π Z , f T | Z , the conditional expectations, and expectations in ( B.2 ) by b π K Z , the conditional kernel den- sity estimation, the least square re gression of the estimated response v ariable on the cor- responding sieve basis, and the sample av erage of the estimated variables, respectiv ely . For example, E [ π X ( T i , X i ) u K 0 ( T i ) Y i | T i , X i ] is estimated by the least square re gression of b π K X ( T i , X i ) u K 0 ( T i ) Y i on a sie ve basis w K T , X ( T i , X i ) . C Asymptotics of Ker nel Estimators under Continuous T reat- ments T o deriv e the asymptotic normality of our kernel regression estimators, the follo wing assumptions are imposed. Assumption 7. K ( · ) is a univariate k ernel function symmetric ar ound the origin that satisﬁes (i) R K ( u ) du = 1 ; (ii) R u 2 K ( u ) du = κ 21 ∈ (0 , ∞ ) ; (iii) R K 2 ( u ) du = κ 02 < ∞ ; and (iv) R |K ( u ) | 2+ δ du < ∞ , for some δ > 0 . Assumption 8. As N → ∞ , h → 0 , N h → ∞ and N h 5 → 0 . Assumption 8 is common in the kernel re gression literature (see Li & Racine ( 2007 )). Theorem 4. Suppose Assumptions 1 , 4 – 8 hold. Then, we have √ N h { b µ h ( t, t ′ ) − µ ( t, t ′ ) } = r h N N X i =1 ψ tt ′ ( Y i , T i , M i , X i ; h ) + o P (1) d − → N (0 , V h tt ′ ) , wher e ψ tt ′ ( Y i , T i , M i , X i ; h ) = π M ,X ( T i , M i , X i ) π X ( T i + δ, X i ) π M ,X ( T i + δ, M i , X i ) · p t,h K h ( T i − t ) { Y i − E [ Y | T i , M i , X i ] } , − f T | X ( T i − δ | X i ) π X ( T i , X i ) f T | X ( T i | X i ) · p t,h K h ( T i − t − δ ) E  π M ,X ( T i − δ, M i , X i ) Y i π M ,X ( T i , M i , X i )     T i , X i  + f T ( T i − δ ) π X ( T i , X i ) f T ( T i ) · p t,h K h ( T i − t − δ ) E [ Y i | T i , M i , X i ] , 37 and V h tt ′ = lim h → 0 h · E h { ψ tt ′ ( Y i , T i , M i , X i ; h ) } 2 i = κ 02 f T ( t ) E  π 2 M ,X ( T i , M i , X i ) π 2 M ,X ( T i + δ, M i , X i ) π 2 X ( T i + δ, X i ) { Y i − E [ Y | T i , M i , X i ] } 2     T i = t  + κ 02 f T ( t + δ ) f 2 T ( t ) E "( π X ( T i , X i ) f T | X ( T i − δ | X i ) f T | X ( T i | X i ) E  π M ,X ( T i − δ, M i , X i ) Y i π M ,X ( T i , M i , X i )     T i , X i  − π X ( T i , X i ) f T ( T i − δ ) f T ( T i ) E [ Y i | T i , M i , X i ] ) 2     T i = t + δ # , and p t,h = E [ K h ( T − t )] , κ ij = R u i K j ( u ) du. Follo wing Theorem 4 , we establish the asymptotics of the estimated direct and indirect ef fect. Remark 6. Note that the asymptotic variance of b µ N K W ( t, t ) and b µ N K W ( t, t ′ ) intr oduced by Huber et al. ( 2020 ) ar e V N K W tt =      κ 02 E  V ar [ Y | T = t, X ] /f T | X ( t | X )  if h = h 1 = h 2 and K 1 ,h 1 ( · ) = K 2 ,h 2 ( · ); κ 02 E  E  { Y − µ ( t, t ) } 2 | T = t, X  /f T | X ( t | X )  if h = h 2 < h 1 , and V N K W tt ′ =                    κ 02 E  V ar ( Y | T = t, M , X ) f 2 T | M , X ( t ′ | M ,X ) f T | M , X ( t | M , X ) f 2 T | X ( t ′ | X )  + E  V ar { g ( t, M , X ) | T = t ′ , X } /f T | X ( t ′ | X )  ! if h = h 1 = h 2 and K 1 ,h 1 ( · ) = K 2 ,h 2 ( · ); κ 02 E  E  ( Y − µ ( t, t ′ )) 2 | T = t, M , X  f 2 T | M , X ( t ′ | M , X ) f T | M , X ( t | M , X ) f 2 T | X ( t ′ | X )  if h = h 2 < h 1 , wher e K 1 ,h 1 and K 2 ,h 2 ar e two pr especiﬁed kernel function used in the estimator s of Huber et al. ( 2020 ), g ( t, M , X ) = E [ Y | T = t, M , X ] . W e pr ove that V h tt ≤ V N K W tt and V h tt ′ ≤ V N K W tt ′ in section S7 in the Supplementary Material. 38

Nonparametric Estimation of Mediation Effects with A General Treatment

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment