Nonparametric Identification and Inference for Counterfactual Distributions with Confounding

Nonparametric Iden tiﬁcation and Inference for Coun terfactual Distributions with Confounding Jianle Sun 1 & Kun Zhang 1 , 2 1 Departmen t of Philosoph y , Carnegie Mellon Univ ersit y 2 Mac hine Learning Departmen t, Mohamed bin Za y ed Univ ersit y of Artiﬁcial In telligence F ebruary 19, 2026 Abstract W e prop ose nonparametric iden tiﬁcation and semiparametric estimation of join t p oten tial outcome distributions in the presence of confounding. First, in settings with observ ed confounding, we deriv e tigh ter, cov ariate-informed b ounds on the joint dis- tribution b y leveraging conditional copulas. T o o vercome the non-diﬀerentiabilit y of b ounding min/max op erators, we establish the asymptotic prop erties for b oth a di- rect estimator with p olynomial margin condition and a smo oth approximation with log-sum-exp operator, facilitating v alid inference for individual-level eﬀects under the canonical rank-preserving assumption. Second, we tackle the challenge of unmeasured confounding by in tro ducing a causal representation learning framework. By utilizing instrumen tal v ariables, w e pro ve the nonparametric identiﬁabilit y of the latent con- founding subspace under injectivity and completeness conditions. W e develop a “triple mac hine learning” estimator that emplo ys cross-ﬁtting scheme to sequentially handle the learned representation, nuisance parameters, and target functional. W e c harac- terize the asymptotic distribution with v ariance inﬂation induced b y represen tation learning error, and pro vide conditions for semiparametric eﬃciency . W e also prop ose a practical V AE-based algorithm for confounding represen tation learning. Sim ulations and real-world analysis v alidate the eﬀectiveness of prop osed metho ds. By bridging classical semiparametric theory with mo dern represen tation learning, this w ork pro- vides a robust statistical foundation for distributional and coun terfactual inference in complex causal systems. Keywor ds: coun terfactual inference, conditional copulas, instrumen tal v ariable, causal rep- resen tation learning, semiparametric eﬃciency , double machine learning 1 1 In tro duction Causal inference fundamentally aims to predict ho w individuals or p opulations resp ond to comp eting interv entions, thereby concerning the comparison of p otential outcomes under al- ternativ e treatmen t regimes. While classical estimands suc h as the Av erage T reatment Eﬀect (A TE) fo cus on mean diﬀerences, man y scien tiﬁcally relev ant questions, such as the proba- bilit y of beneﬁt, quantile eﬀects, or distributional shifts, depend on the en tire distributions of p oten tial outcomes. How ever, researchers t ypically face a dual hurdle in capturing these dis- tributions. First, in the presence of confounding, ev en the marginal distributions of Y (1) and Y (0) are generally not identiﬁable, and existing instrumen tal v ariable (IV) approaches often rely on restrictiv e parametric assumptions or fo cus only on lo cal eﬀects Angrist et al. (1996), Sw anson et al. (2018). Second, ev en when conditional ignorabilit y holds and marginals are iden tiﬁable, the join t distribution ( Y (1) , Y (0)) remains fundamentally unobserv able without additional structural assumptions. This paper bridges these gaps by providing a uniﬁed, principled framework for distri- butional causal inference. Our ﬁrst contribution addresses the “missing data” problem of the joint distribution under the assumption of no unmeasured confounding. W e dev elop tigh t, cov ariate-informed F r ´ ec het-Ho eﬀding (FH) b ounds Nelsen (2006) on the joint distri- bution under no unmeasured confounding. Leveraging conditional copulas, w e sho w that the sharp upp er bound admits a clear structural in terpretation as c onditional r ank pr eservation (or conditional comonotonicit y) Nelsen (2006), a canonical assumption underlying individ- ual treatmen t eﬀect estimation and coun terfactual reasoning Xie et al. (2023), W u et al. (2025). T o mo ve from theory to practice, we address the non-smo othness of these b ounds via tw o complemen tary paths: a direct estimator under a p olynomial margin condition and a smo oth log-sum-exp approximation. W e further establish their asymptotic properties, enabling v alid frequen tist inference and conﬁdence interv als for rank-preserving structures Levis et al. (2025). Our second con tribution tac kles the more daun ting scenario where confounding is unmea- sured, and in this scenario ev en the non-parametric iden tiﬁcation for marginal distributions is non-trivial. Inspired b y recen t adv ances in causal representation learning Kong et al. (2022), Ng, Bl¨ obaum, Bhandari, Zhang & Kasivisw anathan (2025), Moran & Aragam (2026), we 2 prop ose a representation learning based framework that lev erages IVs to reco v er latent con- founding structures. Under suitable completeness and indep endence assumptions, w e show that the confounding subspace is identiﬁed up to an inv ertible transformation. This allows the learned representation to serv e as a v alid proxy for unobserved confounders, thereb y enabling identiﬁcation of marginal potential outcome distributions in complex settings. T o implemen t this framework, w e in tro duce a triple machine le arning (TML) pro cedure that extends double machine learning Chernozhuk ov et al. (2018), Kennedy (2024) by incor- p orating an additional cross-ﬁtting stage for representation learning. W e rigorously c harac- terize the impact of ﬁrst-stage represen tation error on the v ariance inﬂation in asymptotic distribution, identifying regimes in whic h sup er-conv ergence of the representation learner yields semiparametric eﬃciency . F or practical estimation on confounding represen tations, w e prop ose an Instrumen tal V ariable V ariational Auto enco der (IV-V AE) augmen ted with a Hilb ert-Sc hmidt Indep en- dence Criterion (HSIC) p enalt y Gretton et al. (2005) to ensure that the recov ered la- ten t factors are truly exogenous to the instrument, satisfying the core iden tifying assump- tions. Rather than explicitly modeling instrumen t-dep enden t latent factors, we adopt a reduced-form design that conditions the deco der directly on the observ ed instrumen t, allo w- ing instrumen t-induced v ariation to b e absorb ed while isolating laten t confounding structure. The remainder of the pap er is organized as follo ws. Section 2 details the identiﬁcation of rank-preserving bounds via conditional copulas and presen ts the asymptotic theory for the prop osed estimators. Section 3 introduces the representation learning framework for unmeasured confounding, deriv es the prop erties of the triple machine learning estimator, and prop oses an eﬀective V AE-based learning approac h. Section 4 discusses the synthesis of these metho ds. Section 5 pro vides simulation results. Section 6 applies our prop osed metho ds to analyze the demand for cigarettes in US. Pro ofs and tec hnical details are deferred to the App endix. 3 2 Bound join t coun terfactuals with all confounders ob- serv ed via conditional copulas When there is no unobserved confounding, the iden tiﬁcation of coun terfactual marginals F Y ( a ) is easy . W e will brieﬂy review the identiﬁcation results for the marginal distribu- tions of p otential outcomes and discuss leveraging the conditional copulas of the observed confounding cov ariates to derive tighter b ounds for the join t distribution of the p otential outcomes. W e are particularly interested in the upp er b ound (which corresp onds to the join t distribution under a conditional rank-preserving assumption), as this will pro vide an imp ortan t foundation for p erforming individualized counterfactual inference. How ev er, we m ust also address the challenges to estimation posed by the non-diﬀeren tiable min / max functions. 2.1 Iden tiﬁcation 2.1.1 Iden tiﬁcation of counterfactual marginals W e observe n i.i.d. draws O i = ( Y i , A i , X i ) ∼ P , i = 1 , . . . , n . Here A ∈ { 0 , 1 } is treatment, Y ∈ R is outcome, and X ∈ X ⊂ R d are observed co v ariates. Let Y ( a ) denote the p oten tial outcome (coun terfactual outcome) under treatment lev el a , and its marginal can be iden tiﬁed under the standard assumptions. Deﬁnition 1 (Nuisance F unctions and Distributions) . L et X b e a ve ctor of c ovariates (ob- serve d c onfounders). 1. L et θ a ( X ) := F Y ( a ) | X ( y | X ) = P ( Y ( a ) ≤ y | X ) b e the c onditional CDF of the p otential outc ome Y ( a ) given X . 2. L et F Y ( a ) ( y ) := E X [ θ a ( X )] b e the mar ginal CDF of Y ( a ) , obtaine d by inte gr ating the c onditional CDF over the distribution of X . Theorem 1 (Iden tiﬁcation of coun terfactual marginals) . Assume the standar d c ausal iden- tiﬁc ation c onditions (i) Consistency: Y = Y ( A ) . 4 (ii) Conditional ignor ability (No unme asur e d c onfounding): ( Y (1) , Y (0)) ⊥ A | X . (iii) Positivity: P ( A = a | X ) > 0 almost sur ely for a = 0 , 1 . Under Assumption 1, the conditional marginals F Y ( a ) | X ( y | x ) are identiﬁable from ob- serv ed data, θ a ( x ) := F Y ( a ) | X ( y | x ) = P ( Y ≤ y | A = a, X = x ) . And the marginals F Y ( a ) ( y ) = E [ θ a ( X )] = E X  F Y ( a ) | X ( y | X = x )  = Z F Y ( a ) | X ( y | x ) dF X ( x ) . 2.1.2 Conditional F r ´ ec het–Ho eﬀding b ounds on coun terfactual joint distribu- tions W e aim to mov e b ey ond counterfactual marginals to the join t distribution. Generally sp eak- ing, by Sklar’s Theorem, any join t distribution can b e represented b y its marginals F Y (1) ( y 1 ) and F Y (0) ( y 0 ) and a copula function C , such that F Y (1) ,Y (0) ( y 1 , y 0 ) = C ( F Y (1) ( y 1 ) , F Y (0) ( y 0 )). Without further assumptions, the copula C is only bounded b y the F r ´ ec het-Ho eﬀding b ounds L ( u 1 , u 0 ) ≤ C ( u 1 , u 0 ) ≤ U ( u 1 , u 0 ), where L ( u 1 , u 0 ) = max( u 1 + u 0 − 1 , 0) (the coun termonotonicit y copula) and U ( u 1 , u 0 ) = min( u 1 , u 0 ) (the comonotonicity copula). This implies the mar ginal b ounds on the join t distribution max { F Y (1) ( y 1 ) + F Y (0) ( y 0 ) − 1 , 0 } ≤ F Y (1) ,Y (0) ( y 1 , y 0 ) ≤ min { F Y (1) ( y 1 ) , F Y (0) ( y 0 ) } . W e can also b ound the joint distribution via conditional copulas, which yields a tighter result. Using a conditional version of Sklar’s theorem, we can write F Y (1) ,Y (0) | X ( y 1 , y 0 | x ) = C x ( θ 1 ( x ) , θ 0 ( x )), where θ a ( x ) = F Y ( a ) | X ( y a | x ) and C x is a conditional copula that ma y dep end on x . F or an y x , this copula is b ounded b y the same F r´ ec het–Ho eﬀding limits max { θ 1 ( x ) + θ 0 ( x ) − 1 , 0 } ≤ F Y (1) ,Y (0) | X ( y 1 , y 0 | x ) ≤ min { θ 1 ( x ) , θ 0 ( x ) } . In tegrating o ver X (b y the la w of total exp ectation) giv es the unconditional bounds L ( y 1 , y 0 ) ≤ F Y (1) ,Y (0) ( y 1 , y 0 ) ≤ U ( y 1 , y 0 ) . 5 where L ( y 1 , y 0 ) := E X  max { θ 1 ( X ) + θ 0 ( X ) − 1 , 0 }  , (1) U ( y 1 , y 0 ) := E X  min { θ 1 ( X ) , θ 0 ( X ) }  . (2) In particular, the conditional upp er b ound M ( θ 1 ( x ) , θ 0 ( x )) = min { θ 1 ( x ) , θ 0 ( x ) } is ac hiev ed under the c onditional r ank-pr eserving (or conditional comonotonicity) assumption. This assumption states that for each X = x , Y (1) and Y (0) are comonotone functions of the same laten t rank v ariable U ∼ Unif[0 , 1], suc h that Y ( a ) = F − 1 Y ( a ) | X ( U | X ). The integrated upp er bound U ( y 1 , y 0 ) th us corresp onds to assuming this conditional rank-preserv ation holds across all x . Theorem 2. Consider b ounds on F Y (1) ,Y (0) ( y 1 , y 0 ) 1. The c ovariate-c onditional b ounds ar e derive d by ﬁrst applying the F r´ echet-Ho eﬀding b ounds c onditional on X , and then inte gr ating over X : L ( y 1 , y 0 ) := E X  max { θ 1 ( X ) + θ 0 ( X ) − 1 , 0 }  U ( y 1 , y 0 ) := E X  min { θ 1 ( X ) , θ 0 ( X ) }  2. The mar ginal b ounds ar e derive d by applying the F r´ echet-Ho eﬀding b ounds dir e ctly to the mar ginal distributions F Y (1) ( y 1 ) and F Y (0) ( y 0 ) L mar g ( y 1 , y 0 ) := max { F Y (1) ( y 1 ) + F Y (0) ( y 0 ) − 1 , 0 } U mar g ( y 1 , y 0 ) := min { F Y (1) ( y 1 ) , F Y (0) ( y 0 ) } The b ounds on the joint cumulative distribution function (CDF) F Y (1) ,Y (0) ( y 1 , y 0 ) derive d fr om c ovariate-c onditional distributions ar e always tighter than, or e qual to, the b ounds de- rive d fr om the mar ginal distributions, i.e., L mar g ( y 1 , y 0 ) ≤ L ( y 1 , y 0 ) and U ( y 1 , y 0 ) ≤ U mar g ( y 1 , y 0 ) This implies that [ L ( y 1 , y 0 ) , U ( y 1 , y 0 )] ⊆ [ L mar g ( y 1 , y 0 ) , U mar g ( y 1 , y 0 )] . The pro of relies on a direct application of Jensen’s Inequalit y (see app endix A). Equality holds if and only if the conditional CDFs θ a ( X ) are constant almost surely with resp ect to X , implying that the cov ariates X provide no additional information beyond the marginals. 6 2.2 Estimation and Inference In what follo ws w e presen t estimation and inference for functional of interest Ψ( P ) := E X ∼ P X  ϕ  θ 0 ( X ) , θ 1 ( X )  . (3) where ϕ U ( x, y ) = min { x, y } and ϕ L ( x, y ) = max { x + y − 1 , 0 } correspond to the upper and lo w er b ound, respectively . W e will mainly fo cus on the upper b ound Ψ( P ) = U ( y 1 , y 0 ) since it corresp onds to the rank-preserving joint distribution when conditioning on all co v ariates, whic h is plausible and meaningful in man y real counterfactual reasoning problems. Results for L ( y 1 , y 0 ) can b e obtained analogously . W e will use standard notation: P n is empirical measure, ∥ · ∥ L 2 ( P ) denotes L 2 ( P )-norm, ∥ · ∥ P, ∞ the essential sup norm, and p → conv ergence in probability . Let ˆ θ a ( X ) denote estimators of the nuisance functions θ a ( X ), for example via conditional CDF regression or ﬂexible mac hine learning metho ds with cross-ﬁtting. A natural w ay is the plug-in estimator ˆ Ψ plug − in = P n ϕ { ˆ θ 1 ( X i ) , ˆ θ 0 ( X i ) } while it suﬀers from the estimation eﬃciency of n uisance functional θ a ( x ). Instead, w e consider the double- robust estimand augmen ted in v erse-prop ensity w eighted (AIPW) form. W e ﬁrst giv e explicit inﬂuence-function comp onen ts for the conditional CDFs. When ϕ is non-smooth (min or max), the partial deriv atives do not exist on the boundary set, and we presen t (A) the direct (nonsmo oth) route under a p olynomial margin condition, and (B) the smo oth appro ximation route (log-sum-exp) to handle the issue Levis et al. (2025). 2.2.1 Direct Estimation under a Margin Condition W e w ant to estimate Ψ( P ) = E [min { θ 0 ( X ) , θ 1 ( X ) } ] = E [ θ d ( X ) ( X )] . W rite π a ( x ) = P ( A = a | X = x ). Deﬁne the unkno wn “oracle” selector d ( x ) = arg min a ∈{ 0 , 1 } θ a ( x ) , θ a ( x ) = P ( Y ≤ y | A = a, X = x ) . and we replace it b y the plug-in selector ˆ d ( x ) = arg min a ˆ θ a ( x ) . T o analyze the asymptotic b eha viors of the h ybrid estimator, w e introduce the follo wing p olynomial margin. 7 Assumption 2.1 (P olynomial margin) . Ther e exist c onstants α > 0 and C < ∞ such that for al l t > 0 , P  | θ 1 ( X ) − θ 0 ( X ) | ≤ t  ≤ C t α . In fact, the parameter α characterizes the separation degree b et ween the tw o p otential outcome distributions at the unit lev el. Geometrically , it quan tiﬁes the probabilit y mass of the cov ariate space where the tw o conditional CDFs, θ 1 ( X ) and θ 0 ( X ), are nearly iden tical. This assumption controls the mass of X near the ”tie” or ”ambiguit y” region { x : θ 0 ( x ) = θ 1 ( x ) } . A large α implies a strong margin, where the p opulation is clearly partitioned into regions where one p oten tial outcome is strictly more likely to b e b elow the threshold than the other. On the con trary , a small α signiﬁes a weak margin, indicating a high density of individuals whose conditional ranks are indistinguishable. This creates a noise-sensitive b oundary where small estimation errors in ˆ θ a can lead to frequen t mis-selection in ˆ d ( x ), whic h w e will see result in a more stringen t sup-norm con vergence rates required to control the bias. Theorem 3 (Asymptotic prop erties of margin - direct estimator) . Under Assumption 2.1 and the b ounde dness c onditions that ther e exist c onstants 0 < c < c < ∞ such that c < ˆ π a ( x ) < c almost sur ely, supp ose cr oss-ﬁtting is use d (or Donsker c onditions hold). Then the estimator ˆ Ψ = P n [ φ ( O ; ˆ P , ˆ d )] = P n h 1 X a =0 1 { ˆ d ( X ) = a }  ˆ θ a ( X ) + 1 { A = a } ˆ π a ( X ) ( 1 { Y ≤ y } − ˆ θ a ( X )) i • (c onsistency) is c onsistent | ˆ Ψ − Ψ | = o p (1) when the nuisanc e estimators satisfy max a ∥ ˆ θ a − θ a ∥ ∞ = o p (1) , ∥ ˆ θ a − θ a ∥ L 2 ( P ) ∥ ˆ π a − π a ∥ L 2 ( P ) = o p (1) , a = 0 , 1 , • (r o ot- n c onsistency and asymptotic normality) satisﬁes ˆ Ψ − Ψ = ( P n − P ) φ ( O ; P , d ) + o p ( n − 1 / 2 ) , henc e √ n ( ˆ Ψ − Ψ) d → N (0 , V ar  φ ( O ; P , d ))  . when nuisanc e estimators satisfy max a ∥ ˆ θ a − θ a ∥ ∞ = o p  n − 1 / (2(1+ α ))  , ∥ ˆ θ a − θ a ∥ L 2 ( P ) ∥ ˆ π a − π a ∥ L 2 ( P ) = o p ( n − 1 / 2 ) , a = 0 , 1 . 8 The rate conditions can b e established by following the standard semiparametric argu- men t, i.e., the follo wing v on-Mises decomp osition ˆ Ψ − Ψ( P ) = P n [ φ ( O ; ˆ P , ˆ d )] − P [ φ ( O ; P , d )] (4) = ( P n − P ) φ ( O ; P , d ) + ( P n − P )[ φ ( O ; ˆ P , ˆ d ) − φ ( O ; P , d )] + P [ φ ( O ; ˆ P , ˆ d ) − φ ( O ; P , d )] , = S + R 1 + R 2 , where S is the standard CL T term, R 1 is the empirical pro cess term and can b e easily b ounded when conducting sample splitting, the bias R 2 can be handled b y further separating errors in n uisance functionals and selectors P h φ ( O ; ˆ P , ˆ d ) − φ ( O ; P , d ) i = P h φ ( O ; ˆ P , d ) − φ ( O ; P , d ) i | {z } nuisance error B nuis + P h φ ( O ; ˆ P , ˆ d ) − φ ( O ; ˆ P , d ) i | {z } selector error B sel . Detailed pro of is shown in B.2. Remark 1. The sele ctor-r ate r e quir ement ∥ ˆ θ − θ ∥ ∞ = o p ( n − 1 / (2(1+ α )) ) is the main extr a pric e p aid by the dir e ct metho d when we plug in the estimate d sele ctor is that one ne e ds uniform err or c ontr ol. The p ar ameter α quantiﬁes the density of the p opulation ne ar the “tie” r e gion wher e θ 1 ( X ) = θ 0 ( X ) , r epr esenting the de gr e e of sep ar ation b etwe en c onditional p otential outc ome r anks. A lar ger α indic ates a str onger mar gin with fewer ambiguous c ases, which enhanc es the stability of the plug-in sele ctor ˆ d ( x ) and yields faster c onver genc e r ates for the non-smo oth dir e ct estimator. In applic ations one typic al ly implements cr oss-ﬁtting and uses mo dern ML estimators for θ a and π a ; verifying the sup-norm c ondition may r e quir e sp e cial ly tailor e d estimators (series/sieve with tune d c omplexity or uniformly c onsistent kernel estimators). 2.2.2 Smo oth Approximation via Log-Sum-Exp In the direct metho d, the non-diﬀerentiabilit y of ϕ ( u, v ) = min( u, v ) creates b oundary com- plications. As an alternativ e, w e appro ximate min b y a smo oth function based on the log-sum-exp op erator, i.e., g t ( u, v ) := − 1 t log  e − tu + e − tv  , ( u, v ) ∈ [0 , 1] 2 , 9 where g t ( u, v ) is a smooth function satisfying min( u, v ) − log 2 t ≤ g t ( u, v ) ≤ min( u, v ) and lim t →∞ g t ( u, v ) = min( u, v ). Then Ψ( P ) can b e approximated via Ψ t ( P ) = E  g t ( θ 0 ( X ) , θ 1 ( X ))  , whic h is con tinuously Gateaux-diﬀeren tiable in P for an y ﬁnite t , and Ψ t ( P ) ↑ Ψ( P ) as t → ∞ . Theorem 4 (Prop erties of the smo oth log-sum-exp estimator) . L et Ψ t = E [ g t ( θ 0 ( X ) , θ 1 ( X ))] , g t ( u, v ) = − t − 1 log( e − tu + e − tv ) , b e the smo oth appr oximation of Ψ , let ˆ Ψ t b e the plug-in estimator using nuisanc e estimators ˆ θ a and ˆ π a , a = 0 , 1 , obtaine d on an indep endent sample (sample splitting) Ψ t = P n φ t ( O ; ˆ P ) = P n h 1 X a =0 ˆ w a,t ( X ) 1 { A = a } ˆ π a ( X ) ( 1 { Y ≤ y } − ˆ θ a ( X )) + g t ( ˆ θ 0 ( X ) , ˆ θ 1 ( X )) i , wher e ˆ w a,t ( X ) = e − t ˆ θ a ( X ) e − t ˆ θ 0 ( X ) + e − t ˆ θ 1 ( X ) . Supp ose the b ounde dness assumption that ther e exist c onstants 0 < c < c < ∞ such that c < ˆ π a ( x ) < c almost sur ely holds. Then for a ﬁxe d t , the estimator ˆ Ψ t satisﬁes the fol lowing pr op erties • (c onsistency) Under r ate c ondition on the nuisanc e estimators, ∥ ˆ θ a − θ a ∥ L 2 ( P ) · ∥ ˆ π a − π a ∥ L 2 ( P ) = o p (1) , ∥ ˆ θ a − θ a ∥ 2 L 2 ( P ) = o p (1) , a = 0 , 1 , ˆ Ψ t is c onsistent for Ψ t ( P ) . • (r o ot- n c onsistency and asymptotic normality) under pr o duct r ate c ondition on the nuisanc e estimators, ∥ ˆ θ a − θ a ∥ L 2 ( P ) · ∥ ˆ π a − π a ∥ L 2 ( P ) = o p ( n − 1 / 2 ) , ∥ ˆ θ a − θ a ∥ 2 L 2 ( P ) = o p ( n − 1 / 2 ) , a = 0 , 1 , i.e., ∥ ˆ θ a − θ a ∥ L 2 ( P ) = o p ( n − 1 / 4 ) and ∥ ˆ π a − π a ∥ L 2 ( P ) = o p ( n − 1 / 4 ) the estimator ˆ Ψ t admits the line ar exp ansion ˆ Ψ t − Ψ t ( P ) = ( P n − P ) φ t ( O ; P ) + o p ( n − 1 / 2 ) , 10 and ther efor e √ n ( ˆ Ψ t − Ψ t ( P )) d → N (0 , V ar( φ t ( O ; P ))) . so that the estimator c onver ges at the √ n r ate √ n ( ˆ Ψ t − Ψ t ( P )) = O p (1) . Similarly , the asymptotic normality is established on the von-Mises decomp osition ˆ Ψ t − Ψ t ( P ) = ( P n − P ) φ t ( O ; P ) | {z } S + P [ φ t ( O ; ˆ P ) − φ t ( O ; P )] | {z } R nuis + ( P n − P )[ φ t ( O ; ˆ P ) − φ t ( O ; P )] | {z } R n , (5) where S is the standard CL T term, R n is the empirical process term, and R nuis is the remaining bias. Detailed proof is sho wn in Appendix B.3. Note that the abov e asymptotic prop erties of smo oth-approximation estimator ˆ Ψ t is established with a ﬁxed smo oth parameter t . W e are curious ab out the performance on how the smo oth estimator ˆ Ψ t appro ximate the true parameter Ψ( P ), i.e., the b eha vior when t go es to inﬁnit y . It app ears that our smo oth estimator will b e closer and closer to the truth but we need pa y price for it. Remark 2 (bias-v ariance trade-oﬀ in limit as t → ∞ ) . T o achieve the semip ar ametric eﬃciency b ound, we de c omp ose √ n ( ˆ Ψ t − Ψ( P )) = √ n ( ˆ Ψ t − Ψ t ( P )) | {z } Statistic al Err or + √ n (Ψ t ( P ) − Ψ( P )) | {z } Appr oximation Bias The term √ n (Ψ( P ) − Ψ t ( P )) is the smo othing bias and b ounde d by O ( √ n t ) , and to b ound it we ne e d √ n/t → 0 , i.e., t = ω ( √ n ) . The term √ n ( ˆ Ψ t − Ψ t ( P )) is the statistic al err or, c onver ging to N (0 , V ar( φ t ( O ; P ))) for a ﬁxe d t . When t is moving √ n ( B nuis + R n ) ≈ √ n · O p ( t n ∥ ˆ θ − θ ∥ 2 2 ) + √ n · O p ( n − 1 / 2 t n ∥ ˆ θ − θ ∥ 2 ) = O p ( √ nt n ∥ ˆ θ − θ ∥ 2 2 ) + O p ( t n ∥ ˆ θ − θ ∥ 2 ) T o make Ψ t ( P ) ↑ Ψ( P ) then φ t ( O ; P ) → φ oracle ( O ; P ) in L 2 ( P ) , we ne e d t n · ∥ ˆ θ − θ ∥ L 2 ( P ) = o p (1) and √ n · t n · ∥ ˆ θ − θ ∥ 2 L 2 ( P ) = o p (1) , which r e quir es ∥ ˆ θ − θ ∥ L 2 ( P ) = o p ( n − 1 / 2 ) r ather than ∥ ˆ θ − θ ∥ L 2 ( P ) = o p ( n − 1 / 4 ) when t = ω ( √ n ) . Her e’s an example of bias-varianc e tr ade-oﬀ. 11 Remark 3 (Lo wer b ound) . F or lower b ound ϕ ( θ 0 , θ 1 ) = max( θ 0 ( X ) + θ 1 ( X ) − 1 , 0) , similar lo gic applies. We c an appr oximate via g t ( θ 0 , θ 1 ) = 1 t log  1 + e t [ θ 0 ( X )+ θ 1 ( X ) − 1]  , and the c orr esp onding IF-b ase d estimator is ˆ Ψ L t = P n " 1 X a =0 ˆ w t ( X ) 1 { A = a } ˆ π a ( X )  1 { Y ≤ y } − ˆ θ a ( X )  + g t  ˆ θ 0 ( X ) , ˆ θ 1 ( X )  # , wher e ˆ w t ( X ) = σ ( t ˆ S ) = e t ( ˆ θ 0 ( X )+ ˆ θ 1 ( X ) − 1) 1+ e t ( ˆ θ 0 ( X )+ ˆ θ 1 ( X ) − 1) , ˆ S = ˆ θ 0 ( X ) + ˆ θ 1 ( X ) − 1 . Bey ond addressing non-diﬀeren tiability , the smo oth approximation oﬀers a crucial infer- en tial adv an tage. When the true parameter is exactly on the b oundary , classical normal appro ximations and standard b o otstrap pro cedures are known to b e inconsistent Andrews (2000), while g t ( · ) pulls the estimand strictly into the interior of the parameter space, en- suring the asymptotic v alidity of W ald-type conﬁdence interv als. 3 Coun terfactual marginals with unmeasured confound- ing: T riple Mac hine Learning via a Represen tation Learning based IV Approac h W e then mov e b eyond the simple case with no unobserved confounding to the more com- plex scenario where unobserved confounding is presen t. In this setting, ev en the marginal distributions of the p oten tial outcomes are t ypically non-iden tiﬁable. F urthermore, even the in tro duction of auxiliary v ariables, suc h as IVs in tw o-stage least squares (2SLS) regressions, either need (strong) parametric assumption lik e linear structural equation mo dels (SEMs) to iden tify A TE, or can only identify local a verage treatment eﬀect (LA TE) for compliers in a nonparametric mo del with additional assumptions such as monotonic relev ance and binary treatmen ts Angrist et al. (1996), Swanson et al. (2018). Nonparametric IV metho ds hav e gained signiﬁcant atten tion in recen t y ears New ey (2013), Levis et al. (2024), but they hav e fo cused more on the identiﬁcation and estimation of av erage treatment eﬀects rather than the full coun terfactual distribution. 12 The c hallenge of unmeasured confounding has traditionally b een view ed as an identiﬁca- tion b ottleneck that can only be breached b y rigid functional forms. Ho wev er, the emerging ﬁeld of Causal Represen tation Learning (CRL) Sch¨ olk opf et al. (2021), Moran & Aragam (2026) oﬀers a alternativ e p ersp ective by framing the reco very of hidden confounders as an iden tiﬁable laten t v ariable mo deling problem. Unlik e standard representation learning that fo cuses on extracting features for optimal prediction, CRL aims to identify the underlying causal v ariables and their structural relations from high-dimensional observ ations through structural constrain ts giv en b y auxiliary information like exogenous instrumen ts. This shift from treating confounders as unreac hable n uisances to treatable laten t v ariables provides the conceptual inspiration for our framew ork: using exogenous instrumen t as the k ey identifying constrain t to recov er the latent confounding subspace. In this section, w e prop ose a no vel method that lev erages IVs and represen tation learning to construct a latent representation of the unmeasured confounding, follow ed by the non- parametric identiﬁcation and semi-parametric estimation of causal functionals. W e ﬁrst fo cus on the iden tiﬁcation and inference of the marginals or a verage eﬀects of the latent structure. This approach will b e conv eniently in tegrated with the b ounding metho ds discussed in the previous section 2, whic h utilized conditional copulas to transition from marginal to join t distributions, thereb y enabling inference on the en tire join t distribution and individual-lev el eﬀects, which will b e discussed in Section 4. Sp eciﬁcally , we observe ( S, A, Y ) with exogenous instrument ( S ), treatmen t ( A ), and outcome ( Y ). W e susp ect unobserv ed confounders Z C suc h that A ← Z C → Y . Our goal is to estimate coun terfactual marginals and Average T reatment Eﬀects (A TE). W e aim to use a representation learning based approac h that uses S to help identify a represen tation b Z C of the unobserved Z C and then iden tify the A TE (Fig.1). 3.1 Iden tiﬁcation T o iden tify the marginal distribution of p oten tial outcomes in the presence of unobserv ed confounding, w e exploit recent dev elopments in iden tiﬁable CRL with exogenous auxiliary v ariables Kong et al. (2022), Ng, Xie, Dong, Spirtes & Zhang (2025), Ng, Bl¨ obaum, Bhandari, Zhang & Kasivisw anathan (2025). Sp eciﬁcally , we introduce an exogenous instrumen t S to induce a decomp osition of the laten t represen tation of the observ ed data into comp onen ts 13 S Z S Z C A Y ε C ε S Instrument T reatment Outcome Confounder S -dep endent Noise Noise Figure 1: Causal and laten t structure underlying the IV-based representation-learning mo del with Exogenous Noises. ε C and ε S are the independent, exogenous noise sources. Z C (pure confounder) is deﬁned by ε C . Z S (IV-related laten ts) is deﬁned by Z C , S , and ε S . The absence of direct edges S → Y and Z S → Y satisﬁes the Exclusion Restriction. driv en b y the exogenous v ariation in S ( Z S ), and comp onen ts whic h captures v ariation asso ciated with the unobserv ed confounding ( Z C ) Since the confounders are not observed, Z C and Z S cannot b e simply assumed to b e conditionally independent. Nevertheless, under a suﬃcien tly v ariability condition on p ( A | S ), Z C remains identiﬁable. In practice, since b oth S and the treatmen t A are observ ed, it is unnecessary to estimate Z S explicitly . Instead, w e adopt a reduced-form construction that directly targets Z C , thereby alleviating the additional randomness that w ould arise from sim ultaneously estimating multiple laten t comp onen ts (Section 3.3). Note that, giv en the dimensionality of the observ ed v ariables (which constraints the laten t dimensionalit y), Z C should not b e in terpreted as a direct representation of the high- dimensional confounders themselves. Rather, it serves as a low-dimensional suﬃcien t score summarizing the aggregate inﬂuence of the unobserv ed confounding. W e ha ve the follo wing iden tiﬁabilit y results. Theorem 5 (Iden tiﬁcation) . L et the observe d data b e i.i.d. samples of ( S , A, Y ) , wher e S is an instrument, A is the tr e atment, and Y is the observe d outc ome. Assume the data- gener ating pr o c ess admits latent variables Z = ( Z C , Z S ) and exo genous noises ε C , ε S satisfy- ing the fol lowing assumptions: Assumption 3.1 (Causal structural equations) . The latent variables ar e gener ate d as Z C = 14 f ( ε C ) and Z S = h ( Z C , S, ε S ) . The observe d variables ( A, Y ) ar e gener ate d via A = g A ( Z S ) , Y = g Y ( A, Z C ) , wher e • g A is an arbitr ary me asur able function (al lowing for discr ete/binary tr e atments). • g Y ( a, · ) is inje ctive with r esp e ct to Z C for e ach observe d tr e atment level a ∈ A . Conse quently, the joint mapping ( Z C , Z S ) 7→ ( A, Y ) al lows the unique r e c overy of Z C fr om observables ( A, Y ) (i.e., Z C is observable-me asur able). Al l exo genous noises ar e mutual ly indep endent and Z C ⊥ ⊥ S . Assumption 3.2 (IV conditions) . V ariable S is a valid instrument satisfying • Exclusion R estriction: S do es not dir e ctly aﬀe ct Y , i.e., Y = g Y ( A, Z C ) do es not dep end on S , or in the notations of p otential outc omes Y ( s, a ) = Y ( a ) . • Unc onfounde dness: S ⊥ ⊥ ( ε C , ε S ) , or e quivalently S ⊥ ⊥ ( A ( s ) , Y ( s )) . Assumption 3.3 (Suﬃcient V ariability of IV) . The instrument S induc es suﬃcient varia- tion in the latent variable Z S c onditional on Z C . F ormal ly, for almost al l z c , the family of c onditional distributions P z c = { P ( Z S ∈ · | Z C = z c , S = s ) : s ∈ S } is c omplete. In terms of sets, this implies that for any me asur able set B ⊆ Z S with p ositive me asur e ( P ( B ) > 0 ), the pr ob ability mass assigne d to this set varies with S s 7→ P ( Z S ∈ B | Z C = z c , S = s ) is not a c onstant function a.s. Assumption 3.4 (Positivit y and regularit y) . F or almost al l z C in the supp ort of Z C , 0 < P ( A = a | Z C = z C ) < 1 , and r e quir e d c onditional exp e ctations exist and ar e ﬁnite. Then the c onfounding subsp ac e Z C is subsp ac e-identiﬁable . In other wor ds, if ther e exists a le arne d r epr esentation b Z C that satisﬁes indep endenc e c onstr aints b Z C ⊥ ⊥ S and pr e- dictive suﬃciency Y ⊥ ⊥ Z C | ( A, b Z C ) or p ( Y | A, b Z C , Z C ) = p ( Y | A, b Z C ) almost sur ely, then ther e exists an invertible map ψ such that b Z C = ψ ( Z C ) almost sur ely. A nd subse quently, the c ounterfactual mar ginals and the A TE is nonp ar ametric al ly iden- tiﬁable E [ Y ( a )] = E  E [ Y | A = a, b Z C ]  , A TE( a, a ′ ) = E  E [ Y | A = a, b Z C ] − E [ Y | A = a ′ , b Z C ]  . 15 Remark 4 (Completeness v.s. V ariabilit y) . Assumption 3.3 is formal ly e quivalent to the Bounded Completeness c ondition often use d in e c onometrics. We adopt the terminolo gy of Suﬃcien t V ariabilit y (Kong et al. 2022, Ng, Bl¨ ob aum, Bhandari, Zhang & Kasiviswanathan 2025) to highlight the ge ometric intuition: the instrument S must cr e ate diverse distributional shifts in Z S such that no subset of the latent sp ac e r emains “invariant” to S . This pr op erty is crucial in Step 1 of the pr o of to rule out r epr esentations that entangle Z S . Remark 5 (On the injectivity Assumption) . We emphasize that the latent variables Z C and Z S should not b e interpr ete d as the high-dimensional r aw state of the world (e.g., al l genetic factors). Inste ad, fol lowing the philosophy of Suﬃcient Dimension R e duction or Contr ol F unctions, they r epr esent the low-dimensional (or sc alar) pr oje ctions or sc or es of these factors (so that we c an let dim( Z C ) ≤ dim( Y ) ) that actively drive the variation in A and Y . Under this interpr etation, the inje ctivity assumption in 3.1 implies that the c ausal me chanism r elies on a low-dimensional b ottlene ck, which is a standar d structur al assumption in r epr esentation le arning Kong et al. (2022), Ng, Xie, Dong, Spirtes & Zhang (2025), Ng, Bl¨ ob aum, Bhandari, Zhang & Kasiviswanathan (2025). Remark 6 (Identiﬁabilit y of counterfactual marginals) . The or em 5 implies the identiﬁability of mar ginal distribution of p otential outc ome Y ( a ) , i.e. P ( Y ( a ) ≤ y ) = E h P ( Y ≤ y | A = a, ˆ Z c ) i . The proof of Theorem 5 consists of tw o main stages. First, based on the injectivit y and completeness assumptions, we establish the identiﬁabilit y of the confounding subspace Z C up to an in vertible transformation (Step 1). Second, w e utilize this identiﬁed representation to identify the marginal counterfactual distributions F Y ( a ) ( y ) and subsequently , the Average T reatment Eﬀect (A TE) via the bac k-do or adjustmen t formula (Steps 2-4). Detailed pro of is shown in App endix C. 3.2 Estimation and Inference Generally sp eaking, w e can estimate the av erage p oten tial outcome ψ ( a ) = E [ Y ( a )] using similar approac hes but replace the observ ed confounding Z C to the confounding representa- tions ˆ Z C learned from a separated dataset. W e quic kly review the standard wa ys as follows. 16 3.2.1 Standard estimators with observ ed confounding F or binary treatment A ∈ { 0 , 1 } , the outcome regression (OR) estimator is ˆ ψ OR ( a ) = 1 n P n i =1 ˆ µ ( a, Z C,i ), where ˆ µ ( a, z ) = E [ Y | A = a, Z C = z ] is the estimated outcome mo del. The in v erse prop ensity w eighting (IPW) estimator is ˆ ψ IPW ( a ) = 1 n P n i =1 1 ( A i = a ) ˆ π ( a | Z C,i ) Y i , where ˆ π ( a | z ) = P ( A = a | Z C = z ) is the prop ensity score. The doubly robust (DR) estimator com- bines b oth mo dels as ˆ ψ DR ( a ) = 1 n P n i =1 h ˆ µ ( a, Z i ) + 1 ( A i = a ) ˆ π ( a | Z C,i ) ( Y i − ˆ µ ( A i , Z C,i )) i and admits the double robustness (Bang & Robins 2005). F or con tin uous treatment A ∈ R , the indicator function 1 ( A i = a ) has zero probabilit y , requiring densit y or k ernel-based adaptations. Let ˆ r ( a | z ) = ˆ f ( a | z ) denote the estimated conditional densit y (generalized prop ensit y score, GPS) and K h ( · ) b e a k ernel function with bandwidth h . The outcome regression estimator remains ˆ ψ OR ( a ) = 1 n P n i =1 ˆ µ ( a, Z C,i ). The GPS-IPW estimator (Hirano & Imbens 2004) is ˆ ψ GPS ( a ) = P n i =1 K h ( A i − a ) Y i / ˆ r ( A i | Z C,i ) P n i =1 K h ( A i − a ) / ˆ r ( A i | Z C,i ) , where K h ( A i − a ) w eights observ ations near a and 1 / ˆ r ( A i | Z C,i ) provides inv erse densit y w eigh t- ing. The densit y estimation-based standard DR estimator (Kennedy et al. 2017) can b e constructed as ˆ ψ D ( a ) = 1 n P n i =1 h ˆ µ ( a, Z C,i ) + ˆ r ( a | Z C,i ) ˆ r ( A i | Z C,i ) ( Y i − ˆ µ ( A i , Z C,i )) i , where the density ratio ˆ r ( a | Z C,i ) / ˆ r ( A i | Z C,i ) replaces the indicator-based w eigh t from binary DR, maintain- ing doubly robust consistency . The DR-k ernel estimator tends to emplo y kernel meth- o ds and simpliﬁes this as ˆ ψ K ( a ) = 1 n P n i =1 [ ˆ µ ( a, Z C,i ) + n · w i ( a )( Y i − ˆ µ ( A i , Z C,i ))], where w i ( a ) = K h ( A i − a ) / P n j =1 K h ( A j − a ) are normalized kernel weigh ts that av oid explicit den- sit y estimation, making it computationally eﬃcient while maintaining approximate doubly robust prop erties (Flores et al. 2012, F ong et al. 2018). The GPS can b e estimated using k ernel densit y estimation (KDE) on residuals ˆ r ( a | Z ) ≈ 1 n P n j =1 K h (( a − ˆ µ ( Z )) − r j ), where r j = A j − ˆ µ ( Z j ) are training residuals and ˆ µ ( Z ) = E [ A | Z ] is the conditional mean of treatment estimated via gradien t b o osting. This nonparametric approac h accommo dates non-Gaussian treatmen t distributions. F or n umerical stabilit y , ex- treme w eigh ts can b e trimed. Bandwidth selection follows Silv erman’s rule h = 1 . 06 ˆ σ A n − 1 / 5 Silv erman (2018), where ˆ σ A is the sample standard deviation of treatment. 17 3.2.2 T riple Machine Learning estimator W e no w resort to learning confounding representations when there are unmeasured con- founders, and subsequently , we can emplo y all the ab ov e standard estimators b y replacing the observ ed cov ariates with learned confounding representations. It mimic the standard semiparametric/double-robust/double-ML w a y Chernozh uko v et al. (2018), Kennedy (2024), and we call it “T riple Machine Learning” (TML) since we need an additional fold to learn the confounding representation ﬁrst b efore the standard double mac hine learning or double robust estimation. As an example, for binary treatmen ts, we can construct an estimator b E [ Y ( a )] using three-fold sample splitting or more complex cross-ﬁtting. W e summarize the pro cedure as follows • F old 1 to learn represen tation b Z C = b ψ ( A, Y ). • F old 2 to estimate n uisance parameters – b m a ( z ) = ˆ E [ Y | A = a, ˆ Z C = z ] in the Outcome mo del. – b π a ( z ) = ˆ P ( A = a | ˆ Z C = z ) in the T reatment (Prop ensity) mo del. • F old 3 to estimate the causal parameter b E [ Y ( a )] = P n h b m a ( b Z C ) + 1 { A = a } b π a ( b Z C )  Y i − b m a ( b Z C )  i It can easily generalize to other distributional targets b y mo difying the outcome mo del to other functionals lik e b m a ( z , y ) = b P ( Y ≤ y | A = a, b Z C = z ). Moreo ver, similar logic can b e applied to the con tin uous scenario with GPS, density or k ernel based estimators. In principle, the prop osed TML framew ork is suﬃcien tly general to accommo date a v ariety of estimators, including outcome regression, inv erse probabilit y weigh ting, and doubly robust estimators. W e ﬁrst establish the theoretical prop erties of the TML estimator under binary treatment assignmen t using a doubly robust construction (6), and it can b e easily generalized to more complex con tinuous scenarios. W e subsequen tly in tro duce a v ariational approac h for eﬀective confounding represen tation learning (Section 3.3), and assess the n umerical p erformance of diﬀeren t estimators within the TML framew ork through simulation studies (Section 5). 18 Theorem 6 (Asymptotic Properties of the Doubly-Robust-Type Represen tation-IV Esti- mator) . L et the identiﬁc ation assumptions of The or em 5 hold. L et the estimator b E [ Y ( a )] b e c onstructe d using K -fold cr oss-ﬁtting, wher e b Z C = b ψ ( Z C ) , b m , and b π ar e estimate d on sep ar ate folds. L et the b ounde dness and overlap assumption hold, i.e., ther e exist c onstants 0 < c < c < ∞ such that c < π ( a | z ) < c almost sur ely, and E [ Y 2 ] < ∞ . Then we have 1. Consistency (double robustness) The estimator b E [ Y ( a )] is c onsistent for E [ Y ( a )] , i.e., b E [ Y ( a )] p − → E [ Y ( a )] , if the fol lowing c onditions hold (C1a) The r epr esentation b Z C is c onsistent ∥ b Z C − ψ ( Z C ) ∥ L 2 ,P = o p (1) . (C1b) A t le ast one of the nuisanc e estimators is c onsistent for the true function deﬁne d on Z C , despite b eing tr aine d on b Z C (i.e., it is EIV-r obust and c onsistent) ∥ b m ( ˆ Z C ) − m 0 ( Z C ) ∥ L 2 ,P = o p (1) or ∥ b π ( ˆ Z C ) − π 0 ( Z C ) ∥ L 2 ,P = o p (1) . 2. Asymptotic Normalit y The estimator is √ n -asymptotic al ly nor mal with a c orr e cte d varianc e, √ n  b E [ Y ( a )] − E [ Y ( a )]  d − → N (0 , V total ( a )) , wher e V total ( a ) = V ar[IF a ( O ; η 0 ) + IF ϕ, rep ( a )] , if the fol lowing hold (C2a) The nuisanc e functions (assume d EIV-r obust) satisfy the double machine le arning (DML) r ate c ondition ∥ b m ( ˆ Z C ) − m 0 ( Z C ) ∥ L 2 ,P · ∥ b π ( ˆ Z C ) − π 0 ( Z C ) ∥ L 2 ,P = o p ( n − 1 / 2 ) . (C2b) The r epr esentation b Z C c onver ges as ∥ b Z C − ψ ( Z C ) ∥ L 2 ,P = O p ( n − 1 / 2 ) . In this c ase, the ﬁrst-stage estimation err or in b Z C c ontributes a ﬁrst-or der c orr e ction term IF ϕ, rep ( a ) to the inﬂuenc e function. 19 3. Eﬃciency The estimator is √ n -asymptotic al ly normal and achieves the or dinary eﬃ- cient varianc e lower b ound, √ n  b E [ Y ( a )] − E [ Y ( a )]  d − → N (0 , V a ) , wher e V a = V ar[IF a ( O ; η 0 )] , if the fol lowing hold (C3a) The nuisanc e functions (assume d EIV-r obust) satisfy the DML r ate c ondition ∥ b m ( ˆ Z C ) − m 0 ( Z C ) ∥ L 2 ,P · ∥ b π ( ˆ Z C ) − π 0 ( Z C ) ∥ L 2 ,P = o p ( n − 1 / 2 ) . (C3b) The r epr esentation b Z C c onver ges at a “sup er-c onver genc e” r ate ∥ b Z C − ψ ( Z C ) ∥ L 2 ,P = o p ( n − 1 / 2 ) . In this c ase, the ﬁrst-stage estimation err or is asymptotic al ly ne gligible, and the c orr e ction term IF ϕ, rep ( a ) disapp e ars fr om the asymptotic exp ansion. The pro of relies on the von-Mises decomp osition √ n ( b ψ − ψ 0 ) = √ n ( P n, 3 φ ( b η ) − P φ ( η 0 )) = √ n ( P n, 3 − P ) φ ( η 0 ) | {z } T 1 :Oracle CL T + √ n ( P n, 3 − P )( φ ( b η ) − φ ( η 0 )) | {z } T 2 :Empirical Pro cess + √ nP ( φ ( b η ) − φ ( η 0 )) | {z } T 3 :Bias T erm , (6) where the bias consists of a nuisance estimation error and a represen tation learning error T 3 = √ nP ( φ ( b η ) − φ ( ˜ η )) | {z } T 3 a (Nuisance Bias) + √ nP ( φ ( ˜ η ) − φ ( η 0 )) | {z } T 3 b (Representation Bias) , where the in termediate parameter ˜ η = ( m 0 , π 0 , b ϕ ) represents the ideal n uisance parameters giv en the learned represen tation b ϕ . Note that T 3 a can b e handled via ordinary Neyman orthogonalit y tec hnique while T 3 b fails since ψ is NOT orthogonal w.r.t. Z C . Hence, let M ( ϕ ) = E [ φ ( O ; m 0 , π 0 , ϕ )] b e the exp ected score functional, then w e ha ve T 3 b = √ n ( M ( b ϕ ) − M ( ϕ 0 )) = √ n ∇ ϕ M ( ϕ 0 )[ b ϕ − ϕ 0 ] | {z } Linear T erm (I) + √ n R ( b ϕ, ϕ 0 ) | {z } Remainder T erm (II) , 20 where the v ariance of the linear term is V ar(IF ϕ, rep ( O )). In other words, the representation error accum ulates in the subsequen t causal estimation, resulting in inﬂated v ariance induce b y IF ϕ, rep ( O ) = E O ′ [ ∇ ϕ φ ( O ′ ; η 0 )] · ξ ϕ ( O ) where ξ ϕ ( O ) is the inﬂuence function for the repre- sen tation learner, even when the represen tation has a parametric con vergence rate. Detailed pro of is shown in Appendix D. In practice, the explicit calculation of the v ariance inﬂation term V rep = V ar(IF ϕ, rep ) presen ts a signiﬁcant challenge, as the correction inﬂuence function IF ϕ, rep dep ends on the inﬁnite-dimensional sensitivity of the causal functional and the algorithmic resp onse of the represen tation learner. Several empirical metho ds can b e applied to facilitate robust infer- ence. First, one ma y employ a Hessian-based numerical appro ximation leveraging the recent adv ances in neural inﬂuence functions Koh & Liang (2017). By treating the enco der as a parametric mo del ϕ θ , the term ξ ϕ ( O ) can b e approximated using the in verse Hessian-v ector pro duct (HVP). Sp eciﬁcally , b IF ϕ, rep ≈ P n [ ∇ θ φ ] ⊤ H − 1 θ ∇ θ L , where the in verse Hessian H − 1 θ is eﬃcien tly computed via sto chastic estimation (e.g., the LiSSA algorithm). This approac h pro vides a ﬁrst-order approximation of ho w small p erturbations in the representation training set propagate to the ﬁnal causal estimate. Alternativ ely , a more computationally intensiv e but non-parametric approac h is ensemble- based uncertain t y quan tiﬁcation with Bo otstrap or Inﬁnitesimal Jac kknife (IJ) b y rep eating pro cedure o v er B random sample splits W ager & A they (2018), and the excess v ariance observ ed across diﬀeren t representation learning folds eﬀectively captures the sto chasticit y enco ded in V rep . While computationally demanding, this ensemble approach b ypasses the need for second-order deriv atives and is inherently more robust to the non-con vex landscap e of deep neural net works. Remark 7 (The EIV-Robustness Assumption) . The EIV-r obustness underpinning al l thr e e sc enarios r e quir es the existenc e of estimators for b m and b π that c an b e tr aine d on the err or-laden pr oxy b Z C inste ad of true Z C and yet stil l c onver ge to the true functions m 0 and π 0 at the sp e ciﬁe d r ates. This is a non-trivial r e quir ement as standar d estimators in non- p ar ametric EIV settings often suﬀer fr om sever e attenuation bias and de gr ade d lo garithmic c onver genc e r ates F an & T ruong (1993). Satisfying this c ondition implicitly r e quir es the underlying true functional sp ac e to b e suﬃciently smo oth, or ne c essitates the deployment of sp e cialize d de c onvolutional algorithms during the nuisanc e tr aining phase to actively c orr e ct 21 for the me asur ement err or in b Z C . W e summarize the observ ations as follows. Remark 8. Within TML fr amework, A DR-typ e estimator c an b e c onstructe d, but its asymp- totic pr op erties critic al ly dep end on two factors 1. The EIV-R obustness Assumption: We ne e d b m, b π to b e tr aine d c onsistently on the estimate d r epr esentation b Z C inste ad of the or acle one. 2. The Conver genc e R ate of b Z C : under the EIV-R obustness Assumption, • if b Z C admits a O p ( n − 1 / 2 ) r ate = ⇒ the c ausal estimator has an inﬂate d varianc e V total . • if b Z C admits a sup er smo oth o p ( n − 1 / 2 ) r ate = ⇒ the c ausal estimator achieve the semip ar ametric eﬃcient varianc e V a . It is gener al ly very har d for a nonp ar ametric ﬂexible estimators like ML metho ds. 3.3 IV-V AE: Learning represen tation Z C The learning of Z C is a classic problem in disentangled representation learning. A natural w a y is to consider the following regularized β -V AE mo del L = − E q ( Z | A,Y ) h log p ( Y | Z C , A ) + log p ( A | Z S ) i + β K L  q ( Z | A, Y ) || p ( Z S | S, Z C ) p ( Z C )  + λ E q h HSIC( Z C , S ) + HSIC( Z C , A | Z S ) i , (7) where the (conditional) dep endence can be measured via (conditional) HSIC F ukumizu et al. (2004, 2007), Sheng & Srip erum budur (2023). This in v olv es the simultaneous optimization of tw o coupled ob jects, ˆ Z C and ˆ Z S (since there exists a link Z C → Z S ), and relies on the in trinsically diﬃcult measuremen t and constrain t of conditional independence Shah & P eters (2020), He et al. (2025), making it tric ky to achiev e stable learning. Ho w ever, in fact, explicitly inferring Z S is computationally redundant for the purp ose of confounder iden tiﬁcation, given the fact that the instruments S is observed and exogenous. W e therefore em ploy a reduced-form speciﬁcation for the treatment decoder. By substituting the mec hanism of Z S in to the treatment assignmen t function, we appro ximate the comp osite 22 function A = g ( h ( S ) , Z C ) = ˜ g ( S, Z C ). Consequen tly , w e can in tro duce the IV-V AE, where the generative mo del (Deco der) is deﬁned as p θ ( A | S, Z C ) = N ( µ A ( S, Z C ) , σ 2 A ) p θ ( Y | A, Z C ) = N ( µ Y ( A, Z C ) , σ 2 Y ) Conditioning directly on S allo ws the deco der to capture the v ariation in A induced b y the instrumen t pathw ay without the need to explicitly mo del the intermediate latent v ariable Z S , while Z C captures the remaining confounding v ariation. W e in tro duce an inference net work (V AE Encoder) q ϕ ( Z C | A, Y , S ), whic h conditions on S to facilitate the separation of instrument-induced v ariation from confounder-induced v aria- tion. T o enforce the structural assumption that the reco vered confounder is exogenous to the instrumen t, we utilize the HSIC Gretton et al. (2005) as a regularization term to explicitly p enalize statistical dep endence b et ween the learned laten t space ˆ Z C and the instrument S , ensuring that ˆ Z C satisﬁes the prop erties of a v alid confounder. The ﬁnal optimization goal b ecomes L ( θ , ϕ ) = − E q ϕ [log p θ ( A | S, Z C ) + log p θ ( Y | A, Z C )] | {z } Reconstruction Loss + β K L ( q ϕ ( Z C |· ) ∥ p ( Z C )) | {z } KL Divergence + λ HSIC( Z C , S ) | {z } Independence Constraint . (8) This new ob jectiv e fo cuses solely on the learning of ˆ Z C and only necessitates constrain- ing unconditional indep endence, which signiﬁcan tly facilitates stable, eﬃcien t learning and optimization. 4 F rom A TE to ITE: Bounds of coun terfactual join t distribution in the presence of unobserv ed confound- ing Iden tiﬁcation Theorem 5 guarantees the iden tiﬁabilit y of conditional and marginal distribu- tions of p otential outcomes F Y ( a ) | ˆ Z C ( y ) and F Y ( a ) ( y ). W e can then follo w results in section 2 to use (conditional) copulas and FH inequality to further b ound their (conditional) join t distributions. 23 W e now w ork with estimated representation ˆ Z C instead of observing X , and hence we need to extra price on the representation learning error. Therefore, the estimation qualit y will b e join tly aﬀected b y 1) Con vergence rate of learning ˆ Z C ; 2) EIV robustness in estimating n uisance parameters ˆ m and ˆ θ ; 3) The realization of the margin condition for direct estimator or the bias-v ariance trade-oﬀ for the smo oth approximation with diﬀeren t smo oth parameter t . Cross-ﬁtting will b e generally helpful to separate these error of diﬀerent sources. 5 Sim ulations In the follo wing sections, w e conducted empirical ev aluations on b oth sim ulated and real- w orld data to further highligh t our prop osed metho ds. W e ﬁrst ev aluated the conditional rank-preserving FH upp er b ounds of the direct estimator and the log-sum-exp approxima- tion under v arying smo othing parameters. Subsequently , w e demonstrated the accuracy of confounding representation learning in the presence of unobserved confounders, along with the estimation p erformance of A TE and con tinuous treatmen t resp onse curves for diﬀerent estimators within the TML framework. 5.1 Sim ulations on b oundary estimation of coun terfactual join t distribution T o v alidate the theoretical prop erties and estimation p erformance of the prop osed b ounds under observed confounding, we conducted a sim ulation study with N = 2 , 000 samples o v er 100 replications. The co v ariates w ere generated as X ∼ N (0 , I 2 ) and treatmen t assignment follo w ed a logistic mo del P ( A = 1 | X ) = σ (0 . 5 X 1 − 0 . 3 X 2 + ϵ S ) , ϵ S ∼ N (0 , 0 . 1 2 ) where σ ( · ) denotes the sigmoid function. W e designed t w o diﬀerent data generating pro cesses (DGPs): a Linear SCM where Y (0) = X 1 + 0 . 5 X 2 + ϵ Y Y (1) = Y (0) + 1 . 0 , and a Nonlinear setting Y (0) = sin(2 X 1 ) + X 2 2 + ϵ Y Y (1) = cos(2 X 1 ) + X 2 2 + 0 . 5 + ϵ Y , 24 where ϵ Y ∼ N (0 , 1). First, to demonstrate the gain in iden tiﬁcation p ow er (tightness), w e compared the width of the standard marginal b ounds (Mak aro v bounds) against the proposed conditional bounds calculated using the true oracle distributions. The width reduction is deﬁned as the diﬀerence b et ween the marginal width W marg = min( F 1 , F 0 ) − max( F 1 + F 0 − 1 , 0) and the exp ected conditional width W cond = E X [min( F 1 | X , F 0 | X ) − max( F 1 | X + F 0 | X − 1 , 0)]. W e can observ e that co v ariates-assisted conditional copulas lead to a signiﬁcan t tigh tening of b ounds (Fig.2a). Second, w e ev aluated the ﬁnite-sample p erformance of the prop osed estimators. W e im- plemen ted the Doubly Robust (DR-Direct) estimator and the DR-Smo oth estimator (with v arying smo othing parameters) using 5-fold cross-ﬁtting to minimize ov erﬁtting. W e as- sessed Bias and Mean Squared Error (MSE=bias 2 +SE 2 ) in eac h replicate (Fig.2b). Sim ula- tion results demonstrate the clear sup eriority of the smo oth appro ximation o v er the direct estimator, provided the smo othing parameter t is suﬃciently large. (a) Bound width gained via marginal copulas and conditional copulas. (b) Boxplots of bias and MSE of estimators in each repli- cate. Figure 2: Estimations on the b ounds of joint distribution of p otential outcomes. 25 5.2 Sim ulations on marginal estimation with triple matc hing learn- ing approac h T o ev aluate the eﬃcacy of the prop osed represen tation learning-based IV estimator, we con- ducted sim ulations iwith unobserv ed confounding. The DGP in tro duces a v alid instrumen t S ∼ U ( − 2 , 2) and a latent confounder Z C ∼ N (0 , 1), ensuring the crucial indep endence assumption S ⊥ Z C . The treatment is generated with suﬃcien t v ariability and nonlinear dep endence on the instrumen t Z S = 0 . 5 S + 0 . 1 tanh( S ) + 0 . 3 σ (2 S ) + 0 . 2 Z C + ϵ S where σ ( · ) is the sigmoid function and ϵ S ∼ N (0 , 0 . 2 2 ). W e considered t w o treatmen t regimes: a binary treatmen t where A ∼ Bernoulli( σ ( Z S )), and a con tinuous treatmen t where A = Z S . The outcome Y w as generated under tw o distinct scenarios: In the linear setting, the outcome is a simple additiv e function: Y lin = 1 + 2 A + 3 Z C + ϵ Y In the nonlinear setting, the outcome includes sinusoidal eﬀects, sigmoid transformations, and treatment-confounder in teractions ( A × Z C ) to violate standard linear IV assumptions: Y nonlin = 1 + (0 . 3 A + 0 . 2 sin(2 A + 0 . 5)) + (0 . 3 Z C + 2 σ ( Z C )) + 0 . 2 AZ C + ϵ Y W e sim ulated N = 6000 samples and employ ed a triple-split cross-ﬁtting strategy ( K = 6 folds) to separate represen tation learning (V AE training), n uisance parameter estimation, and ﬁnal estimation (can apply diﬀeren t estimators). 5.2.1 Represen tation learning quality W e ﬁrst demonstrated the quality of latent confounding represen tation learning. W e illus- trated the correlation betw een true Z C and learned ˆ Z C , and the dep endence b etw een learned ˆ Z C and instrument S (Fig.3). As Z C is identiﬁable only up to an in vertible transformation, b oth strong p ositive and strong negativ e correlations b etw een Z C and ˆ Z C pro vide evidence of successful represen tation learning. 26 (a) Linear DGP . (b) Nonlinear DGP . Figure 3: Representation learning qualit y , demonstrated b y correlation b etw een true Z C and learned ˆ Z C , and the dependence b etw een learned ˆ Z C and instrument S . 5.2.2 Benc hmarking A TE estimation Within the triple mac hine learning (TML, i.e., IV-based representation learning) framew ork, w e can implement diﬀeren t estimators in the ﬁnal estimation fold (including outcome regres- sion, IPW, and double robust estimators). W e apply weigh t clipping to b oth the IPW and doubly robust estimators to ensure n umerical stability . W e b enc hmarked the A TE estima- tion ( A = 1 vs A = 0). W e ev aluated diﬀeren t v ariants of TML estimators, against 2SLS baselines ov er 100 replications, using bias and MSE (bias 2 +SE 2 ) in each replication as the ev aluation metrics, sho wn in Fig.4. W e can ﬁnd for contin uous treatments, diﬀerent esti- mators within the IV-based representation learning framew ork all signiﬁcan tly outp erform 2SLS baseline. Even in settings with a binary treatmen t and linear eﬀects, the p erformance of our nonparametric method is comparable to that of parametric approac hes lik e 2SLS. 5.2.3 Dose-resp onse curve estimation T o ev aluate the estimator’s capability in recov ering the full structural relationship b etw een the treatment and the outcome, w e conducted a dose-resp onse curv e estimation exp eriment with con tinuous treatment regime. Shown in Fig.5a, TML can not only pro vide accurate es- timates of the a v erage treatmen t eﬀect, but also recov er the dose–resp onse function E [ Y ( a )] under contin uous treatmen t assignmen t, with particularly strong p erformance observed for the k ernel-based doubly robust estimator, esp ecially in the treatment region with high ob- serv ation density (Fig.5b). 27 (a) Boxplots of bias in each replicate of diﬀeren t estimators under diﬀeren t DGPs (b) Boxplots of MSE in each replicate of diﬀerent estimators under diﬀeren t DGPs Figure 4: Sim ulation results of A TE estimation with unmeasured confounding. W e ev aluate 2SLS baseline with diﬀeren t estimators (outcome regression, IPW, and double robust) within triple machine learning (TML) framew ork. (a) Dose resp onse ( E [ Y ( a )]) curves estimated via diﬀeren t TML esti- mators. (b) Observed treatmen t distribution in simula- tion Figure 5: Dose-resp onse curv e estimation 28 6 Real-w orld analysis W e illustrate the prop osed metho dology by analyzing the demand for cigarettes using the CigarettesSW dataset from the R pack age AER Kleib er et al. (2020). This dataset comprises panel data for 48 U.S. states o v er the p erio d 1985–1995. The primary ob jectiv e is to estimate the price elasticity of demand while accounting for unobserv ed heterogeneity in state-level consumption b eha viors. Let Y denote the logarithm of cigarette consumption (packs p er capita) and A b e the logarithm of the real price. The relationship b et w een price and consumption is confounded b y unobserv ed state-sp eciﬁc factors, Z C , suc h as regional health aw areness and cultural habits. T o address this endogeneity , w e employ the logarithm of the a verage excise tax as the instrument S , whic h satisﬁes the strong relev ance condition through its direct impact on price and is plausibly exogenous to individual preferences. Figure 6: Causal analysis of cigarette demand. (a) Estimated Average Dose-Resp onse F unc- tion with 95% point wise conﬁdence in terv als. The conv ex shape indicates that price elasticit y increases (b ecomes more negativ e) at higher price levels. (b) The join t CDF surface estimated via the prop osed smo oth estimator, targeting the FH upp er b ound. The concen tration of probabilit y mass along the diagonal visualizes the theoretical limit of rank preserv ation. (c) Scatter plot of individual counterfactual pairs ( ˆ Y ( a low ) , ˆ Y ( a high )) reconstructed by the ﬁtted outcome mo del for a low = 4 . 63 and a high = 5 . 17. By ﬁxing the learned latent heterogeneit y ˆ z c,i for each state, the strict alignment with the identit y b enchmark (dashed line) demon- strates that the treatmen t induces a homogeneous structural shift across the p opulation. W e apply the proposed TML framew ork to reco ver the latent confounding structure Z C 29 and estimate the causal mec hanism. Figure 6 summarizes the estimated av erage and dis- tributional eﬀects based on the learned representation. Panel (a) presen ts the estimated Av erage Dose-Resp onse F unction, E [ Y ( a )], across the observ ed supp ort of log prices. The curv e exhibits a monotonic decrease, consisten t with the fundamental la w of demand. The narro w 95% p oint wise conﬁdence in terv als indicate precise estimation of the mean eﬀect. Notably , the resp onse exhibits non-linearit y: the slop e steep ens for prices ab ov e a > 5 . 1, suggesting an increasing price elasticit y in the upper price range. This implies that consump- tion b ecomes more sensitive to marginal price increases when prices are already elev ated. Bey ond the a v erage eﬀect, P anels (b) and (c) examine the join t dep endence structure of p oten tial outcomes. P anel (b) visualizes the joint CDF estimated via the prop osed smo oth estimator, sp eciﬁcally targeting the FH upp er b ound P ( Y ( a high ) ≤ y 1 , Y ( a low ) ≤ y 0 ). This estimation reﬂects the limit of p erfect rank preserv ation. P anel (c) displays the empirical coun terpart at the individual lev el. Since the true counterfactuals are unobserv ed, we re- construct them using the ﬁtted outcome mo del ˆ Y ( a, z c ). F or eac h state i , we ﬁx its learned laten t represen tation ˆ z c,i and compute the predicted p otential outcomes under tw o quartile v alues of observed price, a low = 4 . 63 and a high = 5 . 17 using the outcome mo del w e trained. The resulting scatter plot of these predicted pairs ( ˆ Y ( a low , ˆ z c,i ) , ˆ Y ( a high , ˆ z c,i )) aligns strictly with the rank-preserving b enchmark (dashed line), conﬁrming that the latent confounder Z C induces a comonotonic dependence structure. Crucially , the linearit y observed in P anel (c) on the logarithmic scale implies a homoge- neous relative treatmen t eﬀect. While the laten t confounder Z C captures signiﬁcant hetero- geneit y in baseline consumption lev els (in tercepts), the treatment eﬀect manifests as a uni- form structural shift across the distribution. This suggests that price in terven tions damp en consumption proportionally for both hea vy and light smokers, preserving the rank ordering of states b y consumption intensit y . 7 Discussion This pap er has dev elop ed a uniﬁed framew ork for the iden tiﬁcation and estimation of coun- terfactual distributions, addressing tw o fundamental c hallenges: the b ounding of join t dis- tributions under observed confounding and the reco v ery of marginal distributions under 30 unobserv ed confounding. By bridging copula theory with causal representation learning, w e provide a pathw ay from iden tifying p opulation-level eﬀects to b ounding individual-lev el coun terfactual dependencies. Our analysis of conditional copulas underscores the informative v alue of co v ariates in tigh tening the FH bounds. The upp er b ound, corresponding to conditional rank preserv ation, serv es as a particularly pragmatic b enc hmark for individualized treatment eﬀect estimation. W e proposed t wo estimation strategies, the direct estimator under margin conditions and the smo oth log-sum-exp approximation, to handle the non-diﬀeren tiable inﬂuence function and highlight a fundamen tal bias-v ariance trade-oﬀ. In the con text of unmeasured confounding, our in tegration of instrumen tal v ariables with causal representation learning oﬀers a nonparametric alternative to classical SEMs. Unlik e traditional IV metho ds that fo cus on lo cal av erage treatmen t eﬀects or rely on linearity , our approach lev erages the instrument to iden tify the latent confounding subspace itself. The triple machine learning pro cedure extends the double mac hine learning paradigm to accoun t for the representation learning stage. A k ey theoretical ﬁnding of this w ork is the characterization of the v ariance inﬂation induced by the estimated representation. As sho wn in Theorem 6, unless the represen tation learner achiev es super-conv ergence rates, a formidable requiremen t for deep neural net w orks standard errors must b e corrected to accoun t for the ﬁrst-stage estimation uncertain ty . Sev eral limitations of the curren t framework w arrant further in vestigation. First, the iden tiﬁcation of the laten t confounding space relies on injectivity and completeness condi- tions, and relaxing these conditions to allow for partial iden tiﬁcation of the latent space w ould b e a v aluable extension. Mean while, while we prop ose a HSIC regularized V AE-based algorithm with iden tiﬁability guaran tees, ﬁnite sample size, constrained function approxi- mators, and the non-con vex nature of such optimization problems still p oses challenges in real-w orld estimation Hyv¨ arinen et al. (2023). F urthermore, a practical c hallenge arises in the inference stage regarding the represen ta- tion learning correction. The v ariance of the proposed estimator is inﬂated b y the estimation error of the represen tation b Z C . Ho w ev er, quantifying this inﬂation explicitly requires the gradien ts of the inv ertible mapping ψ linking the learned and true representations, whic h is unfortunately analytically unknown and implicitly deﬁned b y the neural netw ork training 31 dynamics, making estimating the correction term IF ϕ,corr non-trivial. Relying on uncorrected v ariance estimators ma y lead an inﬂation of Type I error rates in h yp othesis testing. Develop- ing feasible metho ds, suc h as Hessian approximation Koh & Liang (2017) or ensem ble-based quan tiﬁcation W ager & Athey (2018), to v alidly capture this uncertaint y remains an op en and critical area for future researc h Ga wliko wski et al. (2023). In summary , this w ork bridges the gap b etw een classical semiparametric theory and mo dern represen tation learning. By formally c haracterizing the identiﬁcation limits of laten t confounding through instrumen tal v ariables and establishing the asymptotic distribution of the “triple machine learning” estimator, w e provide a principled alternativ e to ad ho c deep causal estimation. Simultaneously , our copula-based b ounds reﬁne the understanding of treatmen t eﬀect heterogeneit y b eyond marginal aggregates, oﬀering a rigorous to ol for assessing individual-level risks. Collectiv ely , these contributions la y the groundw ork for a more robust statistical foundation for causal inference with high-dimensional, unstructured data. Ac kno wledgmen t W e gratefully ac knowledge Prof. Edw ard Kennedy , JungHo Lee, and Igna vier Ng for their v aluable comments and suggestions on the man uscript. References Andrews, D. W. (2000), ‘Inconsistency of the b o otstrap when a parameter is on the boundary of the parameter space’, Ec onometric a pp. 399–405. Angrist, J. D., Im b ens, G. W. & Rubin, D. B. (1996), ‘Identiﬁcation of causal eﬀects using instrumen tal v ariables’, Journal of the Americ an statistic al Asso ciation 91 (434), 444–455. Bang, H. & Robins, J. M. (2005), ‘Doubly robust estimation in missing data and causal inference mo dels’, Biometrics 61 (4), 962–973. Chernozh uk ov, V., Chetverik ov, D., Demirer, M., Duﬂo, E., Hansen, C., Newey , W. & 32 Robins, J. (2018), ‘Double/debiased mac hine learning for treatmen t and structural pa- rameters’. F an, J. & T ruong, Y. K. (1993), ‘Nonparametric regression with errors in v ariables’, The A nnals of Statistics pp. 1900–1925. Flores, C. A., Flores-Lagunes, A., Gonzalez, A. & Neumann, T. C. (2012), ‘Estimating the eﬀects of length of exp osure to instruction in a training program: The case of job corps’, R eview of Ec onomics and Statistics 94 (1), 153–171. F ong, C., Hazlett, C. & Imai, K. (2018), ‘Cov ariate balancing prop ensit y score for a con- tin uous treatment: Application to the eﬃcacy of p olitical advertisemen ts’, The A nnals of Applie d Statistics 12 (1), 156–177. F ukumizu, K., Bach, F. R. & Jordan, M. I. (2004), ‘Dimensionality reduction for sup ervised learning with repro ducing k ernel hilb ert spaces’, Journal of Machine L e arning R ese ar ch 5 (Jan), 73–99. F ukumizu, K., Gretton, A., Sun, X. & Sc h¨ olkopf, B. (2007), ‘Kernel measures of conditional dep endence’, A dvanc es in neur al information pr o c essing systems 20 . Ga wlik owski, J., T assi, C. R. N., Ali, M., Lee, J., Humt, M., F eng, J., Krusp e, A., T rieb el, R., Jung, P ., Rosc her, R. et al. (2023), ‘A survey of uncertaint y in deep neural net works’, A rtiﬁcial Intel ligenc e R eview 56 (Suppl 1), 1513–1589. Gretton, A., Bousquet, O., Smola, A. & Sch¨ olk opf, B. (2005), Measuring statistical dep en- dence with hilb ert-schmidt norms, in ‘International conference on algorithmic learning theory’, Springer, pp. 63–77. He, Z., P ogo din, R., Li, Y., Dek a, N., Gretton, A. & Sutherland, D. J. (2025), On the hardness of conditional indep endence testing in practice, in ‘The Thirty-nin th Annual Conference on Neural Information Pro cessing Systems’. Hirano, K. & Im b ens, G. W. (2004), The prop ensit y score with con tinuous treatments, in A. Gelman & X.-L. Meng, eds, ‘Applied Ba y esian Modeling and Causal Inference from Incomplete-Data Perspectives’, WileyInterScience, W est Sussex, England, pp. 73–84. 33 Hyv¨ arinen, A., Khemakhem, I. & Moriok a, H. (2023), ‘Nonlinear indep enden t comp onent analysis for principled disen tanglement in unsup ervised deep learning’, Patterns 4 (10). Jacot, A., Gabriel, F. & Hongler, C. (2018), ‘Neural tangent k ernel: Conv ergence and gen- eralization in neural net works’, A dvanc es in neur al information pr o c essing systems 31 . Kennedy , E. H. (2024), ‘Semiparametric doubly robust targeted double mac hine learning: a review’, Handb o ok of statistic al metho ds for pr e cision me dicine pp. 207–236. Kennedy , E. H., Ma, Z., McHugh, M. D. & Small, D. S. (2017), ‘Non-parametric methods for doubly robust estimation of contin uous treatmen t eﬀects’, Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy 79 (4), 1229–1245. Kleib er, C., Zeileis, A. & Zeileis, M. A. (2020), ‘Pac k age ‘aer”, R p ackage version 1 (4). Koh, P . W. & Liang, P . (2017), Understanding blac k-b ox predictions via inﬂuence functions, in ‘International conference on mac hine learning’, PMLR, pp. 1885–1894. Kong, L., Xie, S., Y ao, W., Zheng, Y., Chen, G., Sto jano v, P ., Akin wande, V. & Zhang, K. (2022), Partial disen tanglement for domain adaptation, in ‘In ternational conference on mac hine learning’, PMLR, pp. 11455–11472. Levis, A. W., Bonvini, M., Zeng, Z., Keele, L. & Kennedy , E. H. (2025), ‘Co v ariate-assisted b ounds on causal eﬀects with instrumental v ariables’, Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy p. qk af028. Levis, A. W., Kennedy , E. H. & Keele, L. (2024), ‘Nonparametric identiﬁcation and eﬃcient estimation of causal eﬀects with instrumental v ariables’, arXiv pr eprint . Moran, G. & Aragam, B. (2026), ‘T ow ards interpretable deep generativ e mo dels via causal represen tation learning’, Journal of the Americ an Statistic al Asso ciation 0 (ja), 1–32. Nelsen, R. B. (2006), An intr o duction to c opulas , Springer. New ey , W. K. (2013), ‘Nonparametric instrumental v ariables estimation’, A meric an Ec o- nomic R eview 103 (3), 550–556. 34 Ng, I., Bl¨ obaum, P ., Bhandari, S., Zhang, K. & Kasiviswanathan, S. (2025), ‘Debiasing re- w ard models b y represen tation learning with guaran tees’, arXiv pr eprint . Ng, I., Xie, S., Dong, X., Spirtes, P . & Zhang, K. (2025), Causal represen tation learning from general environmen ts under nonparametric mixing, in ‘The 28th In ternational Conference on Artiﬁcial In telligence and Statistics’. Sc h¨ olk opf, B., Lo catello, F., Bauer, S., Ke, N. R., Kalch brenner, N., Goy al, A. & Bengio, Y. (2021), ‘T o ward causal represen tation learning’, Pr o c e e dings of the IEEE 109 (5), 612–634. Shah, R. D. & P eters, J. (2020), ‘The hardness of conditional independence testing and the generalised cov ariance measure’, The A nnals of Statistics 48 (3), 1514–1538. Sheng, T. & Sriperumbudur, B. K. (2023), ‘On distance and k ernel measures of conditional dep endence’, Journal of Machine L e arning R ese ar ch 24 (7), 1–16. Silv erman, B. W. (2018), Density estimation for statistics and data analysis , Routledge. Sw anson, S. A., Hern´ an, M. A., Miller, M., Robins, J. M. & Richardson, T. S. (2018), ‘P artial identiﬁcation of the av erage treatment eﬀect using instrumen tal v ariables: review of metho ds for binary instrumen ts, treatments, and outcomes’, Journal of the Americ an Statistic al Asso ciation 113 (522), 933–947. W ager, S. & A they , S. (2018), ‘Estimation and inference of heterogeneous treatmen t eﬀects using random forests’, Journal of the Americ an Statistic al Asso ciation 113 (523), 1228– 1242. W u, P ., Li, H., Zheng, C., Zeng, Y., Chen, J., Liu, Y., Guo, R. & Zhang, K. (2025), Learn- ing coun terfactual outcomes under rank preserv ation, in ‘Adv ances in neural information pro cessing systems’. Xie, S., Huang, B., Gu, B., Liu, T. & Zhang, K. (2023), ‘Adv ancing counterfac tual inference through nonlinear quan tile regression’, arXiv pr eprint arXiv:2306.05751 . 35 T ec hnical App endix A Pro of of Theorem 2 Pr o of. 1) Pro of for the Upp er Bound ( U ≤ U marg ) Deﬁne the function g : R 2 → R as g ( a, b ) = min( a, b ). The function g is the p oint wise minim um of t wo linear (and th us conca v e) functions, f 1 ( a, b ) = a and f 2 ( a, b ) = b . Therefore, g is a conca v e function . By Jensen’s Inequalit y for conca v e functions, for an y random v ector Y , w e hav e E [ g ( Y )] ≤ g ( E [ Y ]). Let the random v ector b e ( θ 1 ( X ) , θ 0 ( X )). Applying the inequality: E X [ g ( θ 1 ( X ) , θ 0 ( X ))] ≤ g ( E X [ θ 1 ( X )] , E X [ θ 0 ( X )]) Substituting the deﬁnitions of g , θ a , U , and U marg : E X  min { θ 1 ( X ) , θ 0 ( X ) }  | {z } U ( y 1 ,y 0 ) ≤ min  E X [ θ 1 ( X )] , E X [ θ 0 ( X )]  Since F Y ( a ) ( y a ) = E X [ θ a ( X )], the righ t-hand side is min  F Y (1) ( y 1 ) , F Y (0) ( y 0 )  = U marg ( y 1 , y 0 ) Th us, w e ha ve sho wn that U ( y 1 , y 0 ) ≤ U marg ( y 1 , y 0 ). 2) Pro of for the Low er Bound ( L marg ≤ L ) Deﬁne the function h : R 2 → R as h ( a, b ) = max( a + b − 1 , 0). The function h is the p oin twise maxim um of tw o linear (and th us con v ex) functions, f 1 ( a, b ) = a + b − 1 and f 2 ( a, b ) = 0. Therefore, h is a con vex function . By Jensen’s Inequalit y for con v ex functions, for an y random v ector Y , we ha v e E [ h ( Y )] ≥ h ( E [ Y ]). Let the random v ector again b e ( θ 1 ( X ) , θ 0 ( X )). Applying the inequality: E X [ h ( θ 1 ( X ) , θ 0 ( X ))] ≥ h ( E X [ θ 1 ( X )] , E X [ θ 0 ( X )]) Substituting the deﬁnitions of h and θ a E X  max { θ 1 ( X ) + θ 0 ( X ) − 1 , 0 }  | {z } L ( y 1 ,y 0 ) ≥ max { E X [ θ 1 ( X )] + E X [ θ 0 ( X )] − 1 , 0 } 36 (W e used the linearit y of expectation inside the max function on the righ t-hand side). Since F Y ( a ) ( y a ) = E X [ θ a ( X )], the righ t-hand side is max  F Y (1) ( y 1 ) + F Y (0) ( y 0 ) − 1 , 0  = L marg ( y 1 , y 0 ) Th us, w e ha ve sho wn that L ( y 1 , y 0 ) ≥ L marg ( y 1 , y 0 ). W e hav e established L marg ≤ L and U ≤ U marg , whic h prov es that the in terv al [ L ( y 1 , y 0 ) , U ( y 1 , y 0 )] is a (p oten tially tighter) sub-in terv al of [ L marg ( y 1 , y 0 ) , U marg ( y 1 , y 0 )]. Equality holds if and only if the conditional CDFs θ a ( X ) are constant almost surely with resp ect to X , implying that the co v ariates X provide no additional information b ey ond the marginals. B Pro of on eﬃcien t estimator of conditional FH b ound W e ﬁrst derive the IF for the conditional CDF functional θ a ( · ), then use chain rule to obtain EIF for smo oth functional ϕ , and ﬁnally handle the nonsmo oth min function using (i) smo othing and (ii) direct subgradient under margin condition. B.1 Inﬂuence function comp onen ts under regular condition B.1.1 EIF for θ a ( · ) and for E [ θ a ( X )] Fix y ∈ R and a ∈ { 0 , 1 } . Deﬁne the functional on the full distribution P Θ a ( P )( x ) := θ a ( x ) = P ( Y ≤ y | A = a, X = x ) . W e are interested in the scalar functional Ψ a ( P ) := E X [ θ a ( X )]. T o obtain the pathwise deriv ativ e (inﬂuence function) w e follo w the standard parametric-submo del / Gateaux- deriv ativ e route. Let { P ε : ε ∈ ( − ε 0 , ε 0 ) } b e any regular parametric submodel with score s = o (1) at ε = 0, i.e. dP ε /dP = 1 + εs + o ( ε ) with E P [ s ] = 0. W rite θ a,ε ( x ) for the conditional CDF under P ε . W e compute d dε    ε =0 Ψ a ( P ε ) = d dε    ε =0 E P ε  θ a,ε ( X )  . Diﬀeren tiate using pro duct rule d dε E P ε [ θ a,ε ( X )] = E P  ˙ θ a ( X )  + E P  θ a ( X ) s ( O )  , 37 where ˙ θ a ( x ) := d dε   ε =0 θ a,ε ( x ). W e now compute ˙ θ a ( x ) by diﬀerentiating the conditional CDF θ a,ε ( x ) = P ε ( Y ≤ y | A = a, X = x ) = E P ε  1 { A = a } 1 { Y ≤ y } | X = x  P ε ( A = a | X = x ) . Diﬀeren tiate at ε = 0; denote π a ( x ) = P ( A = a | X = x ). Using quotien t rule we obtain ˙ θ a ( x ) = 1 π a ( x ) E P  ( 1 { A = a } 1 { Y ≤ y } ) s ( O ) | X = x  − θ a ( x ) π a ( x ) E P  1 { A = a } s ( O ) | X = x  . (9) Multiply both sides b y the marginal densit y of X and in tegrate to get the Gateaux deriv ativ e of Ψ a . Rearranging and using standard score calculations yields that the inﬂuence function for Ψ a is IF a (Ψ a ( P )) = θ a ( X ) − E [ θ a ( X )] + 1 { A = a } π a ( X )  1 { Y ≤ y } − θ a ( X )  . (10) W e can chec k E [IF a (Ψ a ( P ))] = 0 and that the pathwise deriv ative equals E [ s ( O )IF a (Ψ a ( P ))] for all scores s , which conﬁrms (10) is the canonical gradient for Ψ a . B.1.2 Chain rule: EIF for smo oth ϕ Let ϕ : R 2 → R be con tin uously diﬀeren tiable. Deﬁne Ψ( P ) = E X  ϕ  θ 0 ( X ) , θ 1 ( X )  . Consider a parametric submo del P ε with score s and let θ a,ε ( x ) b e the conditional CDF under P ε . Diﬀerentiate d dε    ε =0 Ψ( P ε ) = E P h 1 X a =0 ∂ a ϕ ( θ 0 ( X ) , θ 1 ( X )) ˙ θ a ( X ) i + E P  ϕ ( θ 0 , θ 1 ) s ( O )  . Using (10) which gives the deriv ative of E [ θ a ( X )], we can rewrite the ab ov e as d dε    ε =0 Ψ( P ε ) = E P  s ( O ) · IF(Ψ( P ))  , with IF(Ψ( P )) = 1 X a =0 ∂ a ϕ ( θ 0 ( X ) , θ 1 ( X )) 1 { A = a } π a ( X ) ( 1 { Y ≤ y } − θ a ( X )) + ϕ ( θ 0 ( X ) , θ 1 ( X )) − Ψ( P ) . (11) 38 Again we chec k E P [IF(Ψ( P ))] = 0, and for an y score s , d dε Ψ( P ε ) | ε =0 = R (IF · s ) dP , verifying (11) is the canonical gradient. W e no w treat ϕ ( u, v ) = min { u, v } . Since min is not diﬀeren tiable on the diagonal u = v , w e consider tw o strategies: Direct Estimation under a Margin Condition or approximation via a smo oth function. B.2 Pro of of asymptotic prop ert y of direct estimator under mar- gin condition W e ﬁrst prov e the statistical prop erties of direct estimator (Theorem 3) by introducing a p olynomial margin condition (Assumption 2.1). Pr o of. W e follo w the standard semi-parametric argumen t. Step 1. Oracle represen tation and pathwise deriv ativ e (canonical gradien t). Assume for the momen t the selector d ( x ) is known (“oracle” case). Then Ψ( P ) = E  1 { d ( X ) = 0 } θ 0 ( X ) + 1 { d ( X ) = 1 } θ 1 ( X )  . Let { P ε : ε } be a regular parametric submo del with score function s ( O ) at ε = 0 so that d dε    ε =0 log dP ε dP = s ( O ) , E [ s ( O )] = 0 . Diﬀeren tiate Ψ( P ε ) using pro duct rule and the conditional structure. W e compute the deriv a- tiv e of each term. Fix a ∈ { 0 , 1 } . F or the term E [ 1 { d ( X ) = a } θ a,ε ( X )] we hav e d dε    ε =0 E P ε [ 1 { d ( X ) = a } θ a,ε ( X )] = E  1 { d ( X ) = a } ˙ θ a ( X )  + E  1 { d ( X ) = a } θ a ( X ) s ( O )  , where ˙ θ a ( x ) = d dε   ε =0 θ a,ε ( x ). Using deriv ed equation 9 for ˙ θ a ( x ) ˙ θ a ( x ) = 1 π a ( x ) E  ( 1 { A = a } 1 { Y ≤ y } ) s ( O ) | X = x  − θ a ( x ) π a ( x ) E  1 { A = a } s ( O ) | X = x  , and in tegrating the pathwise deriv ative of Ψ along the score s d dε    ε =0 Ψ( P ε ) = E  s ( O ) IF oracle (Ψ( P ; d ))  , w e obtain the (cen tered) oracle inﬂuence function IF oracle (Ψ( P ; d )) = 1 X a =0 1 { d ( X ) = a }  1 { A = a } π a ( X ) ( 1 { Y ≤ y } − θ a ( X )) + θ a ( X ) − Ψ( P )  . (12) 39 W e can c heck E [IF oracle ] = 0. Thus (12) is the canonical gradien t under the oracle selector. Step 2. F easible estimator and v on-Mises decomp osition. In practice d ( x ) is unkno wn; replace it by the plug-in selector in a separated indep endent data (by data splitting or cross-ﬁtting) ˆ d ( x ) = arg min a ˆ θ a ( x ) . W e denote the uncentered inﬂuence function corresp onding to IF oracle + Ψ( P ) as φ ( O ; P , d ) = 1 X a =0 1 { d ( X ) = a }  1 { A = a } π a ( X ) ( 1 { Y ≤ y } − θ a ( X )) + θ a ( X )  , plug-in ev aluated at true selector φ ( O ; ˆ P , d ) = X a 1 { d ( X ) = a }  ˆ θ a ( X ) + 1 { A = a } ˆ π a ( X ) ( 1 { Y ≤ y } − ˆ θ a ( X ))  , and φ ( O ; ˆ P , ˆ d ) denotes the same expression with ˆ d in place of d , then deﬁne the feasible doubly-robust estimator ˆ Ψ = P n [ φ ( O ; ˆ P , ˆ d )] = P n h 1 X a =0 1 { ˆ d ( X ) = a }  ˆ θ a ( X ) + 1 { A = a } ˆ π a ( X ) ( 1 { Y ≤ y } − ˆ θ a ( X )) i . (13) W e compare ˆ Ψ to Ψ b y adding and subtracting the oracle inﬂuence function ˆ Ψ − Ψ( P ) = P n [ φ ( O ; ˆ P , ˆ d )] − P [ φ ( O ; P , d )] (14) = ( P n − P ) φ ( O ; P , d ) + ( P n − P )[ φ ( O ; ˆ P , ˆ d ) − φ ( O ; P , d )] + P [ φ ( O ; ˆ P , ˆ d ) − φ ( O ; P , d )] , = S + R 1 + R 2 , W e b ound R 2 b y separating nuisance and selector errors P h φ ( O ; ˆ P , ˆ d ) − φ ( O ; P , d ) i = P h φ ( O ; ˆ P , d ) − φ ( O ; P , d ) i | {z } nuisance error B nuis + P h φ ( O ; ˆ P , ˆ d ) − φ ( O ; ˆ P , d ) i | {z } selector error B sel . Step 3. Bound for the nuisance remainder B nuis . W e ﬁrst aim to b ound B nuis = P h φ ( O ; ˆ P , d ) − φ ( O ; P , d ) i . Recall that P [ · ] denotes the expectation E O [ · ]. W e use the la w of iterated exp ectations, E O [ · ] = E X  E A,Y | X [ · | X ]  , and ﬁrst compute the inner conditional exp ectation. 40 F or a ∈ { 0 , 1 } , let ψ a ( X ; ˆ P ) b e the conditional exp ectation of the a -th comp onen t ψ a ( X ; ˆ P ) := E A,Y | X h φ ( O ; ˆ P , a ) | X i = E A,Y | X  ˆ θ a ( X ) + 1 { A = a } ˆ π a ( X ) ( 1 { Y ≤ y } − ˆ θ a ( X )) | X  = ˆ θ a ( X ) + E A | X [ 1 { A = a } | X ] ˆ π a ( X )  E Y | A,X [ 1 { Y ≤ y } | A = a, X ] − ˆ θ a ( X )  = ˆ θ a ( X ) + π a ( X ) ˆ π a ( X )  θ a ( X ) − ˆ θ a ( X )  . The corresp onding conditional exp ectation ψ a ( X ; P ) ev aluated at the true parameters P is ψ a ( X ; P ) := E A,Y | X [ φ ( O ; P , a ) | X ] = θ a ( X ) + π a ( X ) π a ( X ) ( θ a ( X ) − θ a ( X )) = θ a ( X ) . No w w e can rewrite B nuis b y substituting these conditional exp ectations B nuis = E X " E A,Y | X " 1 X a =0 1 { d ( X ) = a }  φ ( O ; ˆ P , a ) − φ ( O ; P , a )  | X ## = E X " 1 X a =0 1 { d ( X ) = a }  ψ a ( X ; ˆ P ) − ψ a ( X ; P )  # . W e compute the diﬀerence ψ a ( X ; ˆ P ) − ψ a ( X ; P ) ψ a ( X ; ˆ P ) − ψ a ( X ; P ) =  ˆ θ a ( X ) + π a ( X ) ˆ π a ( X ) ( θ a ( X ) − ˆ θ a ( X ))  − θ a ( X ) = ( ˆ θ a ( X ) − θ a ( X )) − π a ( X ) ˆ π a ( X ) ( ˆ θ a ( X ) − θ a ( X )) = ( ˆ θ a ( X ) − θ a ( X ))  1 − π a ( X ) ˆ π a ( X )  = ( ˆ θ a ( X ) − θ a ( X ))  ˆ π a ( X ) − π a ( X ) ˆ π a ( X )  . This iden tity shows that the n uisance remainder is a pro duct of the estimation errors in θ a and π a . This is the k ey “Neyman-Orthogonal” or “Doubly-Robust” structure, whic h ensures the ﬁrst-order (linear) error terms cancel exactly . Substituting this bac k in to the expression for B nuis yields B nuis = E X " 1 X a =0 1 { d ( X ) = a } ( ˆ θ a ( X ) − θ a ( X ))  ˆ π a ( X ) − π a ( X ) ˆ π a ( X )  # . 41 W e no w b ound this remainder. Assuming the estimators are b ounded inf x ˆ π a ( x ) ≥ π > 0, b y Cauc hy-Sc hw arz, w e ha ve | B nuis | ≤ E X " 1 X a =0 1 { d ( X ) = a }    ˆ θ a − θ a    ·     ˆ π a − π a ˆ π a     # ≲ 1 X a =0 E X h    ˆ θ a ( X ) − θ a ( X )    · | ˆ π a ( X ) − π a ( X ) | i ≤ 1 X a =0 ∥ ˆ θ a − θ a ∥ L 2 ( P ) · ∥ ˆ π a − π a ∥ L 2 ( P ) . Th us, w e ha ve the tigh t, second-order b ound | B nuis | = O p  1 X a =0 ∥ ˆ θ a − θ a ∥ L 2 ( P ) ∥ ˆ π a − π a ∥ L 2 ( P )  . (15) This b ound sho ws that B nuis = o p ( n − 1 / 2 ) under the pro duct-rate condition ∥ ˆ θ a − θ a ∥ L 2 ( P ) ∥ ˆ π a − π a ∥ L 2 ( P ) = o p ( n − 1 / 2 ). Step 4. Bound for the selector remainder B sel . The selector remainder arises b ecause w e use ˆ d instead of d . Note φ ( O ; ˆ P , a ) = ˆ θ a ( X ) + 1 { A = a } ˆ π a ( X ) ( 1 { Y ≤ y } − ˆ θ a ( X )). F or brevit y we write ∆ d ( X ) := 1 { ˆ d ( X )  = d ( X ) } ∈ { 0 , 1 } , ∆ φ ( O ; ˆ P ) = φ ( O ; ˆ P , 1) − φ ( O ; ˆ P , 0), ∆( X ) = θ 1 ( X ) − θ 0 ( X ) and similarly ˆ ∆( X ) = ˆ θ 1 ( X ) − ˆ θ 0 ( X ). Using the p oint wise iden tity φ ( O ; ˆ P , ˆ d ) − φ ( O ; ˆ P , d ) = X a ∈{ 0 , 1 } [ 1 ( ˆ d ( X ) = a ) φ ( O ; ˆ P , 1)] − X a ∈{ 0 , 1 } [ 1 ( d ( X ) = a ) φ ( O ; ˆ P , 1)] =  1 { b d ( X ) = 1 } − 1 { d ( X ) = 1 }  ·  φ ( O ; ˆ P , 1) − φ ( O ; ˆ P , 0)  = ∆ d ( X ) sgn( ˆ d ( X ) − d ( X )) · ∆ φ ( O ; ˆ P ) (16) Recall ψ a ( X ; ˆ θ , ˆ π ) := E A,Y | X h φ ( O ; ˆ P , a ) | X i = ˆ θ a ( X ) + π a ( X ) ˆ π a ( X )  θ a ( X ) − ˆ θ a ( X )  . Apply the La w of Iterated Exp ectations and substitute this into the expression for B sel , we hav e B sel = P h φ ( O ; ˆ P , ˆ d ) − φ ( O ; ˆ P , d ) i = E X h E A,Y | X h φ ( O ; ˆ P , ˆ d ) − φ ( O ; ˆ P , d ) | X ii = E X h ∆ d ( X ) sgn( ˆ d ( X ) − d ( X )) · ( ψ 1 ( X ; ˆ θ , ˆ π ) − ψ 0 ( X ; ˆ θ , ˆ π )) i ≤ E X h 1 { ˆ d  = d } · | M ∗ ( X ) | i , 42 where M ∗ ( X ) = ψ 1 ( X ; ˆ θ , ˆ π ) − ψ 0 ( X ; ˆ θ , ˆ π ) = E A,Y | X [∆ φ ( O ; ˆ P )]. Assume the prop ensity estimators are uniformly b ounded aw ay from zero, inf x min a ˆ π a ( x ) ≥ π > 0, and note | π a | ≤ 1, | θ a | ≤ 1 and | ˆ θ a | ≤ 1. Observe that point wise | ψ a | ≤ max a    ˆ θ a ( X ) + π a ( X ) ˆ π a ( X )  θ a ( X ) − ˆ θ a ( X )     ≤ 1 + 1 π , so | M ∗ | is uniformly in tegrable and sup X | M ∗ | < ∞ . Thus, | B sel | ≤ P ( ˆ d ( X )  = d ( X )) · sup | M ∗ ( X ) | . Note that ˆ d ( X )  = d ( X ) implies that the sign of ∆( X ) is ﬂipp ed by estimation error large enough to o vercome the gap. Indeed, { ˆ d ( X )  = d ( X ) } ⊆ n | ∆( X ) | ≤ | ˆ ∆( X ) − ∆( X ) | o = n | ∆( X ) | ≤ ∥ ˆ ∆( X ) − ∆( X ) ∥ ∞ o . Therefore, by Assumption 2.1 (polynomial margin of exp onent α ), P  ˆ d ( X )  = d ( X )  ≤ P  | ∆( X ) | ≤ ∥ ˆ ∆( X ) − ∆( X ) ∥ ∞  ≲ ∥ ˆ ∆ − ∆ ∥ α ∞ . W e kno w ∥ ˆ ∆ − ∆ ∥ ∞ ≤ ∥ ˆ θ 0 − θ 0 ∥ ∞ + ∥ ˆ θ 1 − θ 1 ∥ ∞ ≤ 2 max a ∥ ˆ θ a − θ a ∥ ∞ . Hence P ( ˆ d  = d ) ≲  max a ∥ ˆ θ a − θ a ∥ ∞  α . W e then decompose | M ∗ ( X ) | =      ˆ θ 1 + π 1 ˆ π 1 ( θ 1 − ˆ θ 1 )  −  ˆ θ 0 + π 0 ˆ π 0 ( θ 0 − ˆ θ 0 )      =     ( ˆ θ 1 − ˆ θ 0 ) +  π 1 ˆ π 1 ( θ 1 − ˆ θ 1 ) − π 0 ˆ π 0 ( θ 0 − ˆ θ 0 )      = | ˆ ∆( X ) + R IPW ( X ) | = | ∆( X ) + ( ˆ ∆( X ) − ∆( X )) + R IPW ( X ) | ≤ | ∆( X ) | + | ( ˆ ∆( X ) − ∆( X )) | + | R IPW ( X ) | . On the ev ent ˆ d  = d , w e kno w | ∆( X ) | ≤ | ˆ ∆( X ) − ∆( X ) | ≲ max a ∥ ˆ θ a − θ a ∥ ∞ . Also, | R IPW ( X ) | ≤    π 1 ˆ π 1 ( θ 1 − ˆ θ 1 )    +    π 0 ˆ π 0 ( θ 0 − ˆ θ 0 )    ≲ max a ∥ ˆ θ a − θ a ∥ ∞ . Hence w e hav e sup | M ∗ ( X ) | ≲ max a ∥ ˆ θ a − θ a ∥ ∞ . 43 Inserting the margin bound for P ( ˆ d  = d ) yields | B sel | ≲  max a ∥ ˆ θ a − θ a ∥ ∞  1+ α . (17) This is the k ey selector-bias b ound: the cost of using the plug-in selector is controlled by a (1 + α ) p ow er of the sup-norm estimation error. Step 5. Bound the empirical pro cess term W e no w b ound the empirical process term R 1 = ( P n − P )  φ ( O ; ˆ P , ˆ d ) − φ ( O ; P , d )  . T o a void the need for Donsker conditions, w e assume sample splitting: the n uisance estimates ( ˆ θ a , ˆ π a ) and selector ˆ d are trained on an auxiliary sample indep enden t of the one used to ev aluate P n . Conditional on this training sample, φ ( O ; ˆ P , ˆ d ) is a ﬁxed measurable function of O , so that standard empirical pro cess inequalities apply . By standard empirical pro cess argumen t, conditional on the indep endent training sample w e ha ve E  | R 1 | | ˆ P , ˆ d  ≲ 1 √ n ∥ φ ( O ; ˆ P , ˆ d ) − φ ( O ; P , d ) ∥ L 2 ( P ) . Therefore, it suﬃces to con trol the L 2 ( P )–distance ∥ φ ( O ; ˆ P , ˆ d ) − φ ( O ; P , d ) ∥ L 2 ( P ) = o p (1). Similarly , ∥ φ ( O ; ˆ P , ˆ d ) − φ ( O ; P , d ) ∥ L 2 ( P ) ≤ ∥ φ ( O ; ˆ P , ˆ d ) − φ ( O ; ˆ P , d ) ∥ L 2 ( P ) + ∥ φ ( O ; ˆ P , d ) − φ ( O ; P , d ) ∥ L 2 ( P ) =: T sel + T nuis . Bound for T nuis . This term corresp onds to p erturbations in the nuisance functions with selector ﬁxed. Similar to expansion as in Step 3 (see Eq. (15)) but no w in L 2 ( P ) norm rather than L 1 ( P ), eac h component is a sum of n uisance estimation errors. φ ( O ; ˆ P , a ) − φ ( O ; P , a ) =  ˆ θ a + 1 { A = a } ˆ π a ( 1 { Y ≤ y } − ˆ θ a )  −  θ a + 1 { A = a } π a ( 1 { Y ≤ y } − θ a )  =( ˆ θ a − θ a )  1 − 1 { A = a } ˆ π a  + 1 { A = a }  1 { Y ≤ y } − θ a ˆ π a − 1 { Y ≤ y } − θ a π a  =( ˆ θ a − θ a )  1 − 1 { A = a } ˆ π a  + 1 { A = a } ( 1 { Y ≤ y } − θ a )  π a − ˆ π a ˆ π a π a  44 Then T nuis =      1 X a =0 1 { d ( X ) = a }  φ ( O ; ˆ P , a ) − φ ( O ; P , a )       L 2 ( P ) ≤ 1 X a =0     ( ˆ θ a − θ a )  1 − 1 { A = a } ˆ π a  + 1 { A = a } ( 1 { Y ≤ y } − θ a )  π a − ˆ π a ˆ π a π a      L 2 ( P ) ≲ 1 X a =0  | ˆ θ a − θ a | L 2 ( P ) + | ˆ π a − π a | L 2 ( P )  = o p (1) , (18) when both n uisance estimators are consisten t | ˆ θ a − θ a | L 2 ( P ) = o p (1) and | ˆ π a − π a | L 2 ( P ) = o p (1). Bound for T sel . Using the p oin twise iden tity equation 16, since it is supp orted only on the set { b d ( X )  = d ( X ) } , taking exp ectation giv es T 2 sel = E h  φ ( O ; ˆ P , ˆ d ) − φ ( O ; ˆ P , d )  2 i = E  ∆ d ( X ) 2 ∆ φ ( O ; ˆ P ) 2  = E X h ∆ d ( X ) · E A,Y | X h ∆ φ ( O ; ˆ P ) 2 | X ii , where ∆ d ( X ) := 1 { ˆ d ( X )  = d ( X ) } ∈ { 0 , 1 } , ∆ φ ( O ; ˆ P ) = φ ( O ; ˆ P , 1) − φ ( O ; ˆ P , 0). Recall ∆ φ ( O ; ˆ P ) is bounded when ˆ π a ≥ π > 0. Consequently , T sel ≲ P ( ˆ d  = d ) 1 / 2 ≲  max a ∥ ˆ θ a − θ a ∥ ∞  α/ 2 = o p (1) . (19) Com bining. Substituting (18) and (19) into the empirical pro cess b ound yields R 1 = o p ( n − 1 / 2 ) . under the mild condition ∥ ˆ θ a − θ a ∥ = o p (1) and ∥ ˆ π a − π a ∥ = o p (1). Step 6. Asymptotic linearit y and normality . Com bining Steps 3–5, all remainders R 1 and R 2 are o p ( n − 1 / 2 ), and th us the estimator ˆ Ψ in (13) admits the asymptotic linear representation √ n ( ˆ Ψ − Ψ) = 1 √ n n X i =1 φ ( O i ; P , d ) + o p (1) , with inﬂuence function giv en in (12). Since IF oracle (Ψ( P ; d )) has ﬁnite v ariance (it is a b ounded com bination of bounded terms under our assumptions), classical CL T giv es √ n ( ˆ Ψ − Ψ) d → N  0 , V ar( φ ( O ; P , d )  . 45 A consistent v ariance estimator is ˆ σ 2 = P n   φ ( O ; ˆ P , ˆ d ) − ˆ Ψ  2  . Under the same remainder conditions one veriﬁes ˆ σ 2 p → V ar( φ ( O ; P , d )). B.3 Pro of of asymptotic prop ert y of smo oth-appro ximation esti- mator W e no w pro ve the statistical prop erties of smooth-approximation estimator with a ﬁxed smo oth parameter t (Theorem 4). Pr o of. The pro of follows standard semiparametric argumen t as well. Step 1. Smo oth approximation and diﬀeren tiabilit y . Deﬁne, for t > 0, g t ( u, v ) := − 1 t log  e − tu + e − tv  , ( u, v ) ∈ [0 , 1] 2 . Then g t ( u, v ) is smo oth and satisﬁes min( u, v ) − log 2 t ≤ g t ( u, v ) ≤ min( u, v ) , and lim t →∞ g t ( u, v ) = min( u, v ) . W e appro ximate the functional Ψ( P ) = E [min( θ 0 ( X ) , θ 1 ( X ))] by Ψ t ( P ) = E  g t ( θ 0 ( X ) , θ 1 ( X ))  , whic h is con tin uously Gateaux-diﬀerentiable in P for any ﬁnite t . As t → ∞ , Ψ t ( P ) ↑ Ψ( P ). Step 2. P ath wise deriv ative and canonical gradien t. Our target parameter is Ψ t ( P ) = E [ g t ( θ 0 ( X ) , θ 1 ( X ))]. Let P ε b e a regular parametric submo del with score s ( O ) at ε = 0, the path wise deriv ative of Ψ t ( P ) is d dε Ψ t ( P ε )    ε =0 = d dε E ε [ g t ( θ 0 ,ε ( X ) , θ 1 ,ε ( X ))]    ε =0 Using previously deriv ed EIF for smo oth functional (equation 11), w e ha v e IF(Ψ t ( P )) = 1 X a =0 w a,t ( X ) 1 { A = a } π a ( X ) ( 1 { Y ≤ y } − θ a ( X )) + g t ( θ 0 ( X ) , θ 1 ( X )) − Ψ t ( P ) , (20) where w a,t ( x ) = ∂ a g t ( θ 0 ( x ) , θ 1 ( x )) = e − tθ a ( x ) e − tθ 0 ( x ) + e − tθ 1 ( x ) are smo oth functions b ounded in [0 , 1] with P 1 a =0 w a,t ( x ) = 1, interpreted as a smo oth w eigh ting function betw een θ 0 ( x ) and θ 1 ( x ). 46 Step 3. Doubly-robust estimator. W e denote φ t ( O ; P ) the uncen tered inﬂuence func- tion φ t ( O ; P ) = 1 X a =0 w a,t ( X ) 1 { A = a } π a ( X ) ( 1 { Y ≤ y } − θ a ( X )) + g t ( θ 0 ( X ) , θ 1 ( X )) , and φ t ( O ; ˆ P ) for using estimated n uisance parameters ˆ θ a ( X ) and ˆ π a ( X ) obtained on an indep enden t training sample. Deﬁne ˆ Ψ t = P n φ t ( O ; ˆ P ) = P n h 1 X a =0 ˆ w a,t ( X ) 1 { A = a } ˆ π a ( X ) ( 1 { Y ≤ y } − ˆ θ a ( X )) + g t ( ˆ θ 0 ( X ) , ˆ θ 1 ( X )) i , (21) where ˆ w a,t ( X ) = e − t ˆ θ a ( X ) e − t ˆ θ 0 ( X ) + e − t ˆ θ 1 ( X ) . W e will then establish the rate conditions when ˆ Ψ t is ro ot- n consistent and asymptotic normalit y . Step 4. von Mises expansion and remainder decomp osition. Let ˆ η denote estimated n uisances from the indep enden t sample. Then ˆ Ψ t − Ψ t ( P ) = ( P n − P ) φ t ( O ; P ) + P [ φ t ( O ; ˆ P ) − φ t ( O ; P )] | {z } B nuis + ( P n − P )[ φ t ( O ; ˆ P ) − φ t ( O ; P )] | {z } R n , (22) where φ t ( O ; ˆ P ) is the uncentered plug-in inﬂuence function with ˆ η . Since the training sample used for ˆ η is indep endent of P n , conditional on ˆ η the term R n is a mean-zero empirical pro cess. W e treat B nuis and R n separately . Step 5. Con trol of B nuis . The n uisance remainder is B nuis = P [ φ t ( O ; ˆ P ) − φ t ( O ; P )]. W e use the La w of Iterated Exp ectations E O [ · ] = E X [ E A,Y | X [ · | X ]]. F or the a -th comp onen t of 47 the uncentered IF φ t ( O , ˆ P ), the conditional exp ectation is E [ φ t ( O , ˆ P ) | X ] = E " 1 X a =0 n ˆ w a,t ( X ) 1 { A = a } ˆ π a ( X ) ( 1 { Y ≤ y } − ˆ θ a ( X )) o + g t ( ˆ θ 0 ( X ) , ˆ θ 1 ( X ))      X # = 1 X a =0  ˆ w a,t ( X ) E [ 1 { A = a } | X ] ˆ π a ( X )  E [ 1 { Y ≤ y } | A = a, X ] − ˆ θ a ( X )   + g t ( ˆ θ 0 ( X ) , ˆ θ 1 ( X )) = 1 X a =0  ˆ w a,t ( X ) π a ( X ) ˆ π a ( X )  θ a ( X ) − ˆ θ a ( X )   + g t ( ˆ θ 0 ( X ) , ˆ θ 1 ( X )) . Similarly , the conditional exp ectation of the φ t ev aluated at the true parameters P is E [ φ t ( O ; P ) | X ] = 1 X a =0 w a,t ( X ) π a ( X ) π a ( X ) ( θ a ( X ) − θ a ( X )) + g t ( θ 0 ( X ) , θ 1 ( X )) = g t ( θ 0 ( X ) , θ 1 ( X )) . Therefore, B nuis is given by the exp ectation of the diﬀerence in conditional means B nuis = E X h E [ φ t ( O ; ˆ P ) | X ] − E [ φ t ( O ; P ) | X ] i , where E [ φ t ( O ; ˆ P ) | X ] − E [ φ t ( O ; P ) | X ] = " 1 X a =0 ˆ w a,t π a ˆ π a ( θ a − ˆ θ a ) + g t ( ˆ θ 0 , ˆ θ 1 ) # − g t ( θ 0 , θ 1 ) = 1 X a =0 ˆ w a,t π a ˆ π a ( θ a − ˆ θ a ) − h g t ( θ 0 , θ 1 ) − g t ( ˆ θ 0 , ˆ θ 1 ) i . No w, w e use the ﬁrst-order T aylor expansion for g t ( θ 0 , θ 1 ) around ˆ θ , g t ( θ 0 , θ 1 ) − g t ( ˆ θ 0 , ˆ θ 1 ) = 1 X a =0 ˆ w a,t ( θ a − ˆ θ a ) + R θ , where R θ is a quadratic remainder term | R θ ( X ) | ≲ t · ∥ ˆ θ ( X ) − θ ( X ) ∥ 2 2 = t · P 1 a =0 ( ˆ θ a ( X ) − θ a ( X )) 2 , since ∂ 2 g t ∂ θ 2 0 = − t · e − tθ 0 e − tθ 1 ( e − tθ 0 + e − tθ 1 ) 2 = − t · w 0 ,t · w 1 ,t . Substituting this expansion bac k E [ φ t ( O ; ˆ P ) | X ] − E [ φ t ( O ; P ) | X ] = 1 X a =0 ˆ w a,t π a ˆ π a ( θ a − ˆ θ a ) − " 1 X a =0 ˆ w a,t ( θ a − ˆ θ a ) + R θ # = 1 X a =0 ˆ w a,t ( θ a − ˆ θ a )  π a ˆ π a − 1  − R θ = 1 X a =0 ˆ w a,t ( ˆ θ a − θ a )  ˆ π a − π a ˆ π a  − R θ . 48 The ﬁrst term is the key second-order in teraction term. The remainder R θ is O p ( t ∥ ˆ θ − θ ∥ 2 L 2 ( P ) ). The term ˆ w a,t / ˆ π a is b ounded almost surely by sup | ˆ w a,t | / inf ˆ π a ≤ 1 /π . Th us, B nuis | B nuis | ≲ 1 X a =0 E X h | ˆ θ a ( X ) − θ a ( X ) | · | ˆ π a ( X ) − π a ( X ) | i + O p  t · ∥ ˆ θ − θ ∥ 2 L 2 ( P )  . Applying the Cauch y-Sch warz inequalit y , the ﬁrst term is b ounded by P 1 a =0 ∥ ˆ θ a − θ a ∥ L 2 ( P ) · ∥ ˆ π a − π a ∥ L 2 ( P ) . W e get | B nuis | ≲ O p ( ∥ ˆ θ − θ ∥ L 2 ( P ) · ∥ ˆ π − π ∥ L 2 ( P ) + t ∥ ˆ θ − θ ∥ 2 L 2 ( P ) ) . (23) F or a ﬁxed t , under condition ∥ ˆ θ a − θ a ∥ L 2 ( P ) ∥ ˆ π a − π a ∥ L 2 ( P ) = o p ( n − 1 / 2 ) , ∥ ˆ θ a − θ a ∥ 2 L 2 ( P ) = o p ( n − 1 / 2 ) , for example ∥ ˆ θ a − θ a ∥ L 2 ( P ) = o p ( n − 1 / 4 ), ∥ ˆ π a − π a ∥ L 2 ( P ) = o p ( n − 1 / 4 ), w e conclude that the n uisance remainder term v anishes faster than the √ n rate B nuis = o p ( n − 1 / 2 ). Step 6. Bound for Empirical Pro cess Remainder R n The empirical process re- mainder is R n = ( P n − P )[ φ t ( O ; ˆ P ) − φ t ( O ; P )]. Let the diﬀerence function b e ∆ φ t ( O ) = φ t ( O ; ˆ P ) − φ t ( O ; P ). W e aim to show √ nR n = o p (1). Since the n uisance parameters ˆ η are obtained on a sample indep enden t of the ev aluation sample used for P n (sample splitting), w e use the conditional √ n concentration b ound E [ R 2 n | ˆ η ] ≤ 1 n E  (∆ φ t ( O )) 2 | ˆ η  = 1 n ∥ ∆ φ t ∥ 2 L 2 ( P ) , where ∥ · ∥ L 2 ( P ) denotes the L 2 ( P ) norm of the function ∆ φ t ( O ) giv en ˆ η . The function φ t ( O ; η ) is a smo oth function of η = ( θ 0 , θ 1 , π 0 , π 1 ). Giv en the assumptions that θ a and ˆ θ a are b ounded in [0 , 1] and π a and ˆ π a are b ounded aw ay from zero (i.e., inf ˆ π a > π > 0 a.s.), φ t is lo cally Lipsc hitz con tinuous in η | ∆ φ t ( O ) | ≲ 1 X a =0 sup η     ∂ φ t ∂ θ a     · | ˆ θ a − θ a | + 1 X a =0 sup η     ∂ φ t ∂ π a     · | ˆ π a − π a | ! ≲ O ( t ) · 1 X a =0 | ˆ θ a ( X ) − θ a ( X ) | + O (1) · 1 X a =0 | ˆ π a ( X ) − π a ( X ) | ! · K ( O ) 49 where K ( O ) is a b ounded function dep ending on A, Y and the b ounding constants for ˆ π a . Squaring the diﬀerence and taking the exp ectation P ∥ ∆ φ t ∥ 2 L 2 ( P ) = E O  (∆ φ t ( O )) 2  ≲ t 2 ∥ ˆ θ − θ ∥ 2 L 2 ( P ) + ∥ ˆ π − π ∥ 2 L 2 ( P ) . Substituting this b ound bac k in to the v ariance of R n E [ R 2 n | ˆ η ] ≲ 1 n h t 2 ∥ ˆ θ − θ ∥ 2 L 2 ( P ) + ∥ ˆ π − π ∥ 2 L 2 ( P ) i . T aking the square ro ot and applying the Mark ov inequality yields R n = O p  n − 1 / 2 h t ∥ ˆ θ − θ ∥ L 2 ( P ) + ∥ ˆ π − π ∥ L 2 ( P ) i . Then for a ﬁxed t , we hav e R n = o p ( n − 1 / 2 ) under the condition ∥ ˆ θ − θ ∥ L 2 ( P ) = o p (1), ∥ ˆ π − π ∥ L 2 ( P ) = o p (1). Step 7. Asymptotic linearit y and normality . Com bining the b ounds abov e, under ∥ ˆ θ a − θ a ∥ L 2 ( P ) · ∥ ˆ π a − π a ∥ L 2 ( P ) = o p ( n − 1 / 2 ) , , we obtain the asymptotic expansion ˆ Ψ t − Ψ t ( P ) = ( P n − P ) φ t ( O ; P ) + o p ( n − 1 / 2 ) , and hence the cen tral limit theorem √ n ( ˆ Ψ t − Ψ t ( P )) d → N (0 , V ar( φ t ( O ; P ))) . A consistent v ariance estimator is ˆ σ 2 t = P n  φ t ( O ; ˆ P ) − ˆ Ψ t  2  . C Pro of of iden tifabilit y of triple matc hing learning estimator (Theorem 5). W e pro ve the iden tiﬁability results in Theorem 5. Pr o of. W e ﬁrst establish the identifabilit y of confounding represen tation Z C , and then use learned Z C to identify p otential outcomes Y ( a ). 50 Step 1: Iden tiﬁabilit y of the Confounding Subspace. Let the true data generating pro cess b e deﬁned by the structural equations A = g A ( Z S ) and Y = g Y ( A, Z C ). W e do not assume g A is injective, but w e assume that for each ﬁxed a , the conditional distribution p ( Y | A = a, Z C = · ) is injectiv e with resp ect to Z C (Assumption 3.1). (Note that if Y is deterministic, this reduces to g Y ( a, · ) b eing an injective function). Let ˆ e : A × Y → Z b Z C b e the learned enco der, deﬁning b Z C = ˆ e ( A, Y ). Deﬁne the composite map ψ ( z c , z s ) := ˆ e ( g A ( z s ) , g Y ( g A ( z s ) , z c )) . W e emplo y the following regularity conditions (i) Indep endence Constraint: The learned represen tation satisﬁes b Z C ⊥ S . (ii) Predictiv e Suﬃciency: The learned represen tation is suﬃcient for predicting Y from A . That is, Z C pro vides no additional information ab out Y once b Z C is known Y ⊥ ⊥ Z C | ( A, b Z C ) almost surely . In terms of densities, this implies p ( Y | A, b Z C , Z C ) = p ( Y | A, b Z C ) almost surely . (iii) Suﬃcien t V ariabilit y: As in Assumption 3.3, the family of conditional distributions { p ( Z S | Z C = z c , S = s ) } s is b oundedly complete. 1(a) F unctional independence ( b Z C dep ends only on Z C ). Fix an y measurable set U ⊆ Z b Z C and let D = ψ − 1 ( U ) b e its preimage. By condition (i), P ( ψ ( Z ) ∈ U | S = s ) is in v arian t to s . Using the factorization p ( z | s ) = p ( z c ) p ( z s | z c , s ), we obtain the identit y Z p ( z c )  p ( z s | z c , s 1 ) − p ( z s | z c , s 2 )  1 D ( z c , z s ) , dz s dz c = 0 . (24) Supp ose for con tradiction that ψ dep ends on z s . Then D is not a Cartesian pro duct almost surely . Let B ∗ = { ( z c , z s ) ∈ D : { z c } × Z S ⊆ D } b e the en tangled region, whic h has p ositiv e measure. By the completeness condition (iii), the in tegral o v er this non-pro duct region cannot v anish for all s 1 , s 2 , contradicting (24). Th us, D = B × Z S almost surely , implying ψ ( z c , z s ) is constan t in z s . Hence, there exists a measurable map ϕ : Z C → Z b Z C suc h that b Z C = ϕ ( Z C ) almost surely . 51 1(b) Injectivit y via Predictiv e Suﬃciency . W e show ϕ is injectiv e by contradiction. Supp ose ϕ is not injectiv e. Then there exist disjoint sets Z 1 , Z 2 ⊂ Z C with p ositiv e measure suc h that ϕ ( Z 1 ) = ϕ ( Z 2 ) = ˆ z . Fix a treatment a ∈ A . By the injectivit y of the generativ e mec hanism (Assumption 3.1), the outcome distributions conditioned on distinct latent v alues m ust diﬀer. Th us, for any z 1 ∈ Z 1 and z 2 ∈ Z 2 , p ( Y | A = a, Z C = z 1 )  = p ( Y | A = a, Z C = z 2 ) . (Note: If Y is deterministic, these are distinct Dirac measures δ y 1  = δ y 2 ). No w consider the distribution conditioned on b Z C = ˆ z . By the Predictive Suﬃciency condition (ii), w e ha v e conditional indep endence Y ⊥ Z C | ( A, b Z C ). This implies that p ( Y | A = a, b Z C = ˆ z , Z C = z ) = p ( Y | A = a, b Z C = ˆ z ) for almost all z in the ﬁb er ϕ − 1 ( ˆ z ). Ho w ever, the left-hand side p ( Y | A = a, b Z C = ˆ z , Z C = z ) simpliﬁes b ecause A and Z C fully determines the true conditional distribution of Y . Thus, for z 1 ∈ Z 1 and z 2 ∈ Z 2 , we m ust ha ve p ( Y | A = a, Z C = z 1 ) = p ( Y | A = a, b Z C = ˆ z ) = p ( Y | A = a, Z C = z 2 ) . This equates t w o distributions that are kno wn to b e distinct (due to injectivity), yielding a contradiction. Therefore, ϕ m ust b e injective almost surely . Under standard regularit y conditions (contin uity and matching dimensions), ϕ is an in vertible transformation. W e ha v e established that b Z C = ϕ ( Z C ) where ϕ is in v ertible, whic h implies σ ( b Z C ) = σ ( Z C ). This completes the pro of of subspace iden tiﬁability of Z C . Step 2: Back-door Criterion. Given the structural equations in Assumption 3.1: A = g A ( Z S ) , Y = g Y ( A, Z C ) , Z S = h ( Z C , S, ε S ) . The only common cause of A and Y is Z C (mediated through Z S to A ). Z C blo c ks all back- do or paths from A to Y . Sp eciﬁcally , the p oten tial outcome Y ( a ) is determined b y g Y ( a, Z C ). Since Z C encapsulates all confounding information, we hav e conditional exchangeabilit y Y ( a ) ⊥ ⊥ A | Z C . 52 Step 3: Replacement with Iden tiﬁed Represen tation. F rom Step 1, we established that b Z C = ψ ( Z C ) where ψ is inv ertible. Th us, σ ( b Z C ) = σ ( Z C ), and conditioning on b Z C is statistically equiv alent to conditioning on Z C . Therefore, Y ( a ) ⊥ ⊥ A | b Z C . Step 4: Identiﬁcation F ormula. W e explicitly deriv e the iden tiﬁcation of E [ Y ( a )]. By the Law of Iterated Expectations and the indep endence sho wn in Step 3 E [ Y ( a )] = E b Z C  E [ Y ( a ) | b Z C ]  = E b Z C  E [ Y ( a ) | A = a, b Z C ]  (b y ignorabilit y Y ( a ) ⊥ ⊥ A | b Z C ) = E b Z C  E [ Y | A = a, b Z C ]  (b y consistency Y ( a ) = Y when A = a ) . Assumption 3.4 (Positivit y) guaran tees that P ( A = a | b Z C ) > 0 (since b Z C is isomorphic to Z C ), ensuring the conditional exp ectation E [ Y | A = a, b Z C ] is well-deﬁned. Similarly , the A TE is iden tiﬁed as A TE( a, a ′ ) = E b Z C h E [ Y | A = a, b Z C ] − E [ Y | A = a ′ , b Z C ] i . This completes the proof. D Pro of of statistical prop erties of triple mac hine learn- ing estimator (Theorem 6) Pr o of. Let the total sample size b e N . W e randomly partition the data D into three disjoint folds I 1 , I 2 , I 3 , each of size n = N / 3. The estimator is constructed sequentially 1. Stage 1 (Represen tation Learning on I 1 ): Construct b ϕ using only data in I 1 . Th us b ϕ ⊥ ⊥ ( I 2 ∪ I 3 ). 2. Stage 2 (Nuisance Estimation on I 2 ): Using b ϕ and data I 2 , estimate b m and b π . Let b η = ( b m, b π , b ϕ ). Crucially , b η ⊥ ⊥ I 3 . 3. Stage 3 (Ev aluation on I 3 ): Compute the estimator on I 3 b ψ = P n, 3 [ φ ( O ; b η )] = 1 n X i ∈I 3 φ ( O i ; b η ) . 53 W e decomp ose the estimation error √ n ( b ψ − ψ 0 ) √ n ( b ψ − ψ 0 ) = √ n ( P n, 3 φ ( b η ) − P φ ( η 0 )) = √ n ( P n, 3 − P ) φ ( η 0 ) | {z } T 1 :Oracle CL T + √ n ( P n, 3 − P )( φ ( b η ) − φ ( η 0 )) | {z } T 2 :Empirical Pro cess + √ nP ( φ ( b η ) − φ ( η 0 )) | {z } T 3 :Bias T erm . (25) Step 1: Oracle CL T ( T 1 ). The term φ ( O ; η 0 ) is a ﬁxed function. Since observ ations in I 3 are i.i.d., b y the standard CL T T 1 d − → N (0 , σ 2 eﬀ ) , where σ 2 eﬀ = V ar( φ ( O ; η 0 )) . Step 2: Empirical Pro cess ( T 2 ). W e must sho w that T 2 = o p (1). This is equiv alen t to sho wing that the unscaled empirical pro cess term ( P n, 3 − P )( φ ( b η ) − φ ( η 0 )) is o p ( n − 1 / 2 ). Let ∆( O ; b η ) = φ ( O ; b η ) − φ ( O ; η 0 ). Conditioning on the training data D train = I 1 ∪ I 2 , the function ∆( · ; b η ) is deterministic. The term T 2 can b e viewed as a sum of i.i.d. random v ariables with mean zero (conditional on D train ) E [ T 2 | D train ] = √ n E O ∼ P [( P n − P )∆( O ; b η ) | D train ] = 0 . W e analyze the conditional v ariance V ar( T 2 | D train ) = n · V ar( P n ∆( O ; b η ) | D train ) = n · 1 n V ar(∆( O ; b η ) | D train ) ≤ E [( φ ( O ; b η ) − φ ( O ; η 0 )) 2 | D train ] = ∥ φ ( b η ) − φ ( η 0 ) ∥ 2 L 2 ( P ) . Under the consistency assumption (C1), we ha ve ∥ b η − η 0 ∥ p − → 0. Assuming φ satisﬁes mild Lipsc hitz contin uity or the n uisances are bounded, consistency implies conv ergence in the L 2 norm of the score ∥ φ ( b η ) − φ ( η 0 ) ∥ 2 L 2 ( P ) = o p (1) . By Chebyshev’s inequality , for any ϵ > 0 P ( | T 2 | > ϵ | D train ) ≤ V ar( T 2 | D train ) ϵ 2 = o p (1) ϵ 2 p − → 0 . Th us, T 2 = o p (1). This conﬁrms that the estimation noise of b η do es not aﬀect the asymptotic distribution via the empirical pro cess term. 54 Step 3: Bias T erm ( T 3 ). W e analyze the drift term T 3 = √ n E [ φ ( O ; b η ) − φ ( O ; η 0 ) | b η ]. Deﬁne the in termediate parameter ˜ η = ( m 0 , π 0 , b ϕ ), which represents the ideal nuisance parameters given the learned represen tation b ϕ . W e decompose the bias in to a nuisance estimation error and a representation learning error T 3 = √ nP ( φ ( b η ) − φ ( ˜ η )) | {z } T 3 a (Nuisance Bias) + √ nP ( φ ( ˜ η ) − φ ( η 0 )) | {z } T 3 b (Representation Bias) . (a) Nuisance Parameter Bias ( T 3 a ). This term captures the error from estimating m and π on I 2 , conditional on the ﬁxed representation b ϕ from I 1 . Utilizing the algebraic prop ert y of the doubly robust score function, for an y m, π and ﬁxed representation z , the diﬀerence satisﬁes the exact identit y E [ φ ( m, π , z ) − φ ( m 0 , π 0 , z )] = E  1 { A = a } π ( z ) ( Y − m ( z )) − 1 { A = a } π 0 ( z ) ( Y − m 0 ( z )) + ( m ( z ) − m 0 ( z ))  = − E  π ( z ) − π 0 ( z ) π ( z )  m ( z ) − m 0 ( z )   . Applying this to our estimator b η given b ϕ T 3 a = − √ n E " ( b π ( b Z C ) − π 0 ( b Z C ))( b m ( b Z C ) − m 0 ( b Z C )) b π ( b Z C )      I 1 # . Note that the ﬁrst-order terms v anish iden tically due to Neyman orthogonality . The remain- ing term is strictly second-order. By the Cauc hy-Sc h warz inequality and the b oundedness of 1 / b π | T 3 a | ≲ √ n ∥ b π − π 0 ∥ L 2 ( b ϕ ) ∥ b m − m 0 ∥ L 2 ( b ϕ ) . Under the robustness assumption (pro duct of rates is o p ( n − 1 / 2 )), we hav e T 3 a = o p (1). (b) Representation Bias ( T 3 b ). This term captures the sto chastic error propagated from the representation learning step (Stage 1) to the ﬁnal estimation (Stage 3). Let M ( ϕ ) = E [ φ ( O ; m 0 , π 0 , ϕ )] b e the exp ected score functional. Since the score φ is gener- ally not orthogonal with resp ect to ϕ , w e p erform a functional T aylor expansion around the true representation ϕ 0 , T 3 b = √ n ( M ( b ϕ ) − M ( ϕ 0 )) = √ n ∇ ϕ M ( ϕ 0 )[ b ϕ − ϕ 0 ] | {z } Linear T erm (I) + √ n R ( b ϕ, ϕ 0 ) | {z } Remainder T erm (II) , where ∇ ϕ M ( ϕ 0 )[ h ] is the Gˆ ateaux deriv ative of M in direction h . 55 The remainder T erm (I I) is b ounded b y the square of the estimation error |R| ≤ C ∥ b ϕ − ϕ 0 ∥ 2 L 2 . F or the remainder to b e asymptotically negligible (i.e., o p (1)), we only require the quarter-rate condition ∥ b ϕ − ϕ 0 ∥ L 2 = o p ( n − 1 / 4 ) . Assuming this holds, the asymptotic behavior of T 3 b is entirely determined by the Linear T erm (I). W e consider tw o regimes • Case 1: Sup er-Eﬃciency (Oracle V ariance). Supp ose the representation is learned on a massiv e auxiliary dataset or conv erges strictly faster than the parametric rate ∥ b ϕ − ϕ 0 ∥ L 2 = o p ( n − 1 / 2 ) . Then, the Linear T erm (I) satisﬁes | √ n ∇ ϕ M ( ϕ 0 )[ b ϕ − ϕ 0 ] | ≤ C √ n ∥ b ϕ − ϕ 0 ∥ L 2 = √ n · o p ( n − 1 / 2 ) = o p (1) . The representation bias v anishes. The estimator achiev es the Oracle Eﬃciency Bound , with asymptotic v ariance V eﬀ = V ar( φ ( O ; η 0 )). • Case 2: Standard Rate (V ariance Inﬂation). Suppose the representation is learned at the standard parametric rate (e.g., via regression or standard ML on F old 1) ∥ b ϕ − ϕ 0 ∥ L 2 = O p ( n − 1 / 2 ) . In this case, assume that b ϕ admits an asymptotic linear expansion characterized by its o wn inﬂuence function ξ ϕ b ϕ ( z ) − ϕ 0 ( z ) = 1 n 1 X j ∈I 1 ξ ϕ ( O j , z ) + o p ( n − 1 / 2 ) . and then deﬁne IF ϕ, rep ( O ) := ⟨∇ ϕ M ( ϕ 0 ) , ξ ϕ ( O ) ⟩ = E O ′ [ ∇ ϕ φ ( O ′ ; η 0 )] · ξ ϕ ( O ) . By the linearit y of the deriv ativ e, √ nP ( φ ( ˜ η ) − φ ( η 0 )) ≃ 1 √ n n X i =1 IF ϕ, rep ( O i ) , 56 and the Linear T erm (I) conv erges in distribution √ n ∇ ϕ M ( ϕ 0 )[ b ϕ − ϕ 0 ] d − → N (0 , V rep ) , where V rep = V ar(IF ϕ, rep ( O )) is the v ariance con tribution from the represen tation learn- ing step. Because b ϕ is estimated on I 1 and the ev aluation score φ is computed on the indep enden t fold I 3 , the error term ∇ ϕ M ( ϕ 0 )[ b ϕ − ϕ 0 ] and the oracle inﬂuence function IF oracle ( O i ) are uncorrelated. This justiﬁes the decoupled summation of v ariances in V total . The total asymptotic v ariance hence inﬂates to V total = V eﬀ + ρ · V rep , where ρ accoun ts for the ratio of sample sizes betw een folds. Standard errors m ust be corrected to accoun t for V rep . On thing w e need to emphasize is that w e ac knowledge that establishing the exact asymp- totic linearity for highly non-conv ex deep learning mo dels lik e V AEs remains an op en theoret- ical c hallenge. Our deriv ation of IF ϕ, rep op erates under the premise that the representation learner con v erges to an isolated lo cal optim um, b ehaving asymptotically as a regularized M-estimator, or alternativ ely , op erates in a regime where the neural tangen t kernel (NTK) aﬀords linear resp onsiveness Jacot et al. (2018). Com bining the steps, if the bias terms v anish ( o p (1)), w e ha v e √ n ( b ψ − ψ 0 ) = T 1 + o p (1) d − → N (0 , σ 2 eﬀ ). 57

Nonparametric Identification and Inference for Counterfactual Distributions with Confounding

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment