Nonparametric Identification and Inference for Counterfactual Distributions with Confounding
We propose nonparametric identification and semiparametric estimation of joint potential outcome distributions in the presence of confounding. First, in settings with observed confounding, we derive tighter, covariate-informed bounds on the joint dis…
Authors: Jianle Sun, Kun Zhang
Nonparametric Iden tification and Inference for Coun terfactual Distributions with Confounding Jianle Sun 1 & Kun Zhang 1 , 2 1 Departmen t of Philosoph y , Carnegie Mellon Univ ersit y 2 Mac hine Learning Departmen t, Mohamed bin Za y ed Univ ersit y of Artificial In telligence F ebruary 19, 2026 Abstract W e prop ose nonparametric iden tification and semiparametric estimation of join t p oten tial outcome distributions in the presence of confounding. First, in settings with observ ed confounding, we deriv e tigh ter, cov ariate-informed b ounds on the joint dis- tribution b y leveraging conditional copulas. T o o vercome the non-differentiabilit y of b ounding min/max op erators, we establish the asymptotic prop erties for b oth a di- rect estimator with p olynomial margin condition and a smo oth approximation with log-sum-exp operator, facilitating v alid inference for individual-level effects under the canonical rank-preserving assumption. Second, we tackle the challenge of unmeasured confounding by in tro ducing a causal representation learning framework. By utilizing instrumen tal v ariables, w e pro ve the nonparametric identifiabilit y of the latent con- founding subspace under injectivity and completeness conditions. W e develop a “triple mac hine learning” estimator that emplo ys cross-fitting scheme to sequentially handle the learned representation, nuisance parameters, and target functional. W e c harac- terize the asymptotic distribution with v ariance inflation induced b y represen tation learning error, and pro vide conditions for semiparametric efficiency . W e also prop ose a practical V AE-based algorithm for confounding represen tation learning. Sim ulations and real-world analysis v alidate the effectiveness of prop osed metho ds. By bridging classical semiparametric theory with mo dern represen tation learning, this w ork pro- vides a robust statistical foundation for distributional and coun terfactual inference in complex causal systems. Keywor ds: coun terfactual inference, conditional copulas, instrumen tal v ariable, causal rep- resen tation learning, semiparametric efficiency , double machine learning 1 1 In tro duction Causal inference fundamentally aims to predict ho w individuals or p opulations resp ond to comp eting interv entions, thereby concerning the comparison of p otential outcomes under al- ternativ e treatmen t regimes. While classical estimands suc h as the Av erage T reatment Effect (A TE) fo cus on mean differences, man y scien tifically relev ant questions, such as the proba- bilit y of benefit, quantile effects, or distributional shifts, depend on the en tire distributions of p oten tial outcomes. How ever, researchers t ypically face a dual hurdle in capturing these dis- tributions. First, in the presence of confounding, ev en the marginal distributions of Y (1) and Y (0) are generally not identifiable, and existing instrumen tal v ariable (IV) approaches often rely on restrictiv e parametric assumptions or fo cus only on lo cal effects Angrist et al. (1996), Sw anson et al. (2018). Second, ev en when conditional ignorabilit y holds and marginals are iden tifiable, the join t distribution ( Y (1) , Y (0)) remains fundamentally unobserv able without additional structural assumptions. This paper bridges these gaps by providing a unified, principled framework for distri- butional causal inference. Our first contribution addresses the “missing data” problem of the joint distribution under the assumption of no unmeasured confounding. W e dev elop tigh t, cov ariate-informed F r ´ ec het-Ho effding (FH) b ounds Nelsen (2006) on the joint distri- bution under no unmeasured confounding. Leveraging conditional copulas, w e sho w that the sharp upp er bound admits a clear structural in terpretation as c onditional r ank pr eservation (or conditional comonotonicit y) Nelsen (2006), a canonical assumption underlying individ- ual treatmen t effect estimation and coun terfactual reasoning Xie et al. (2023), W u et al. (2025). T o mo ve from theory to practice, we address the non-smo othness of these b ounds via tw o complemen tary paths: a direct estimator under a p olynomial margin condition and a smo oth log-sum-exp approximation. W e further establish their asymptotic properties, enabling v alid frequen tist inference and confidence interv als for rank-preserving structures Levis et al. (2025). Our second con tribution tac kles the more daun ting scenario where confounding is unmea- sured, and in this scenario ev en the non-parametric iden tification for marginal distributions is non-trivial. Inspired b y recen t adv ances in causal representation learning Kong et al. (2022), Ng, Bl¨ obaum, Bhandari, Zhang & Kasivisw anathan (2025), Moran & Aragam (2026), we 2 prop ose a representation learning based framework that lev erages IVs to reco v er latent con- founding structures. Under suitable completeness and indep endence assumptions, w e show that the confounding subspace is identified up to an inv ertible transformation. This allows the learned representation to serv e as a v alid proxy for unobserved confounders, thereb y enabling identification of marginal potential outcome distributions in complex settings. T o implemen t this framework, w e in tro duce a triple machine le arning (TML) pro cedure that extends double machine learning Chernozhuk ov et al. (2018), Kennedy (2024) by incor- p orating an additional cross-fitting stage for representation learning. W e rigorously c harac- terize the impact of first-stage represen tation error on the v ariance inflation in asymptotic distribution, identifying regimes in whic h sup er-conv ergence of the representation learner yields semiparametric efficiency . F or practical estimation on confounding represen tations, w e prop ose an Instrumen tal V ariable V ariational Auto enco der (IV-V AE) augmen ted with a Hilb ert-Sc hmidt Indep en- dence Criterion (HSIC) p enalt y Gretton et al. (2005) to ensure that the recov ered la- ten t factors are truly exogenous to the instrument, satisfying the core iden tifying assump- tions. Rather than explicitly modeling instrumen t-dep enden t latent factors, we adopt a reduced-form design that conditions the deco der directly on the observ ed instrumen t, allo w- ing instrumen t-induced v ariation to b e absorb ed while isolating laten t confounding structure. The remainder of the pap er is organized as follo ws. Section 2 details the identification of rank-preserving bounds via conditional copulas and presen ts the asymptotic theory for the prop osed estimators. Section 3 introduces the representation learning framework for unmeasured confounding, deriv es the prop erties of the triple machine learning estimator, and prop oses an effective V AE-based learning approac h. Section 4 discusses the synthesis of these metho ds. Section 5 pro vides simulation results. Section 6 applies our prop osed metho ds to analyze the demand for cigarettes in US. Pro ofs and tec hnical details are deferred to the App endix. 3 2 Bound join t coun terfactuals with all confounders ob- serv ed via conditional copulas When there is no unobserved confounding, the iden tification of coun terfactual marginals F Y ( a ) is easy . W e will briefly review the identification results for the marginal distribu- tions of p otential outcomes and discuss leveraging the conditional copulas of the observed confounding cov ariates to derive tighter b ounds for the join t distribution of the p otential outcomes. W e are particularly interested in the upp er b ound (which corresp onds to the join t distribution under a conditional rank-preserving assumption), as this will pro vide an imp ortan t foundation for p erforming individualized counterfactual inference. How ev er, we m ust also address the challenges to estimation posed by the non-differen tiable min / max functions. 2.1 Iden tification 2.1.1 Iden tification of counterfactual marginals W e observe n i.i.d. draws O i = ( Y i , A i , X i ) ∼ P , i = 1 , . . . , n . Here A ∈ { 0 , 1 } is treatment, Y ∈ R is outcome, and X ∈ X ⊂ R d are observed co v ariates. Let Y ( a ) denote the p oten tial outcome (coun terfactual outcome) under treatment lev el a , and its marginal can be iden tified under the standard assumptions. Definition 1 (Nuisance F unctions and Distributions) . L et X b e a ve ctor of c ovariates (ob- serve d c onfounders). 1. L et θ a ( X ) := F Y ( a ) | X ( y | X ) = P ( Y ( a ) ≤ y | X ) b e the c onditional CDF of the p otential outc ome Y ( a ) given X . 2. L et F Y ( a ) ( y ) := E X [ θ a ( X )] b e the mar ginal CDF of Y ( a ) , obtaine d by inte gr ating the c onditional CDF over the distribution of X . Theorem 1 (Iden tification of coun terfactual marginals) . Assume the standar d c ausal iden- tific ation c onditions (i) Consistency: Y = Y ( A ) . 4 (ii) Conditional ignor ability (No unme asur e d c onfounding): ( Y (1) , Y (0)) ⊥ A | X . (iii) Positivity: P ( A = a | X ) > 0 almost sur ely for a = 0 , 1 . Under Assumption 1, the conditional marginals F Y ( a ) | X ( y | x ) are identifiable from ob- serv ed data, θ a ( x ) := F Y ( a ) | X ( y | x ) = P ( Y ≤ y | A = a, X = x ) . And the marginals F Y ( a ) ( y ) = E [ θ a ( X )] = E X F Y ( a ) | X ( y | X = x ) = Z F Y ( a ) | X ( y | x ) dF X ( x ) . 2.1.2 Conditional F r ´ ec het–Ho effding b ounds on coun terfactual joint distribu- tions W e aim to mov e b ey ond counterfactual marginals to the join t distribution. Generally sp eak- ing, by Sklar’s Theorem, any join t distribution can b e represented b y its marginals F Y (1) ( y 1 ) and F Y (0) ( y 0 ) and a copula function C , such that F Y (1) ,Y (0) ( y 1 , y 0 ) = C ( F Y (1) ( y 1 ) , F Y (0) ( y 0 )). Without further assumptions, the copula C is only bounded b y the F r ´ ec het-Ho effding b ounds L ( u 1 , u 0 ) ≤ C ( u 1 , u 0 ) ≤ U ( u 1 , u 0 ), where L ( u 1 , u 0 ) = max( u 1 + u 0 − 1 , 0) (the coun termonotonicit y copula) and U ( u 1 , u 0 ) = min( u 1 , u 0 ) (the comonotonicity copula). This implies the mar ginal b ounds on the join t distribution max { F Y (1) ( y 1 ) + F Y (0) ( y 0 ) − 1 , 0 } ≤ F Y (1) ,Y (0) ( y 1 , y 0 ) ≤ min { F Y (1) ( y 1 ) , F Y (0) ( y 0 ) } . W e can also b ound the joint distribution via conditional copulas, which yields a tighter result. Using a conditional version of Sklar’s theorem, we can write F Y (1) ,Y (0) | X ( y 1 , y 0 | x ) = C x ( θ 1 ( x ) , θ 0 ( x )), where θ a ( x ) = F Y ( a ) | X ( y a | x ) and C x is a conditional copula that ma y dep end on x . F or an y x , this copula is b ounded b y the same F r´ ec het–Ho effding limits max { θ 1 ( x ) + θ 0 ( x ) − 1 , 0 } ≤ F Y (1) ,Y (0) | X ( y 1 , y 0 | x ) ≤ min { θ 1 ( x ) , θ 0 ( x ) } . In tegrating o ver X (b y the la w of total exp ectation) giv es the unconditional bounds L ( y 1 , y 0 ) ≤ F Y (1) ,Y (0) ( y 1 , y 0 ) ≤ U ( y 1 , y 0 ) . 5 where L ( y 1 , y 0 ) := E X max { θ 1 ( X ) + θ 0 ( X ) − 1 , 0 } , (1) U ( y 1 , y 0 ) := E X min { θ 1 ( X ) , θ 0 ( X ) } . (2) In particular, the conditional upp er b ound M ( θ 1 ( x ) , θ 0 ( x )) = min { θ 1 ( x ) , θ 0 ( x ) } is ac hiev ed under the c onditional r ank-pr eserving (or conditional comonotonicity) assumption. This assumption states that for each X = x , Y (1) and Y (0) are comonotone functions of the same laten t rank v ariable U ∼ Unif[0 , 1], suc h that Y ( a ) = F − 1 Y ( a ) | X ( U | X ). The integrated upp er bound U ( y 1 , y 0 ) th us corresp onds to assuming this conditional rank-preserv ation holds across all x . Theorem 2. Consider b ounds on F Y (1) ,Y (0) ( y 1 , y 0 ) 1. The c ovariate-c onditional b ounds ar e derive d by first applying the F r´ echet-Ho effding b ounds c onditional on X , and then inte gr ating over X : L ( y 1 , y 0 ) := E X max { θ 1 ( X ) + θ 0 ( X ) − 1 , 0 } U ( y 1 , y 0 ) := E X min { θ 1 ( X ) , θ 0 ( X ) } 2. The mar ginal b ounds ar e derive d by applying the F r´ echet-Ho effding b ounds dir e ctly to the mar ginal distributions F Y (1) ( y 1 ) and F Y (0) ( y 0 ) L mar g ( y 1 , y 0 ) := max { F Y (1) ( y 1 ) + F Y (0) ( y 0 ) − 1 , 0 } U mar g ( y 1 , y 0 ) := min { F Y (1) ( y 1 ) , F Y (0) ( y 0 ) } The b ounds on the joint cumulative distribution function (CDF) F Y (1) ,Y (0) ( y 1 , y 0 ) derive d fr om c ovariate-c onditional distributions ar e always tighter than, or e qual to, the b ounds de- rive d fr om the mar ginal distributions, i.e., L mar g ( y 1 , y 0 ) ≤ L ( y 1 , y 0 ) and U ( y 1 , y 0 ) ≤ U mar g ( y 1 , y 0 ) This implies that [ L ( y 1 , y 0 ) , U ( y 1 , y 0 )] ⊆ [ L mar g ( y 1 , y 0 ) , U mar g ( y 1 , y 0 )] . The pro of relies on a direct application of Jensen’s Inequalit y (see app endix A). Equality holds if and only if the conditional CDFs θ a ( X ) are constant almost surely with resp ect to X , implying that the cov ariates X provide no additional information beyond the marginals. 6 2.2 Estimation and Inference In what follo ws w e presen t estimation and inference for functional of interest Ψ( P ) := E X ∼ P X ϕ θ 0 ( X ) , θ 1 ( X ) . (3) where ϕ U ( x, y ) = min { x, y } and ϕ L ( x, y ) = max { x + y − 1 , 0 } correspond to the upper and lo w er b ound, respectively . W e will mainly fo cus on the upper b ound Ψ( P ) = U ( y 1 , y 0 ) since it corresp onds to the rank-preserving joint distribution when conditioning on all co v ariates, whic h is plausible and meaningful in man y real counterfactual reasoning problems. Results for L ( y 1 , y 0 ) can b e obtained analogously . W e will use standard notation: P n is empirical measure, ∥ · ∥ L 2 ( P ) denotes L 2 ( P )-norm, ∥ · ∥ P, ∞ the essential sup norm, and p → conv ergence in probability . Let ˆ θ a ( X ) denote estimators of the nuisance functions θ a ( X ), for example via conditional CDF regression or flexible mac hine learning metho ds with cross-fitting. A natural w ay is the plug-in estimator ˆ Ψ plug − in = P n ϕ { ˆ θ 1 ( X i ) , ˆ θ 0 ( X i ) } while it suffers from the estimation efficiency of n uisance functional θ a ( x ). Instead, w e consider the double- robust estimand augmen ted in v erse-prop ensity w eighted (AIPW) form. W e first giv e explicit influence-function comp onen ts for the conditional CDFs. When ϕ is non-smooth (min or max), the partial deriv atives do not exist on the boundary set, and we presen t (A) the direct (nonsmo oth) route under a p olynomial margin condition, and (B) the smo oth appro ximation route (log-sum-exp) to handle the issue Levis et al. (2025). 2.2.1 Direct Estimation under a Margin Condition W e w ant to estimate Ψ( P ) = E [min { θ 0 ( X ) , θ 1 ( X ) } ] = E [ θ d ( X ) ( X )] . W rite π a ( x ) = P ( A = a | X = x ). Define the unkno wn “oracle” selector d ( x ) = arg min a ∈{ 0 , 1 } θ a ( x ) , θ a ( x ) = P ( Y ≤ y | A = a, X = x ) . and we replace it b y the plug-in selector ˆ d ( x ) = arg min a ˆ θ a ( x ) . T o analyze the asymptotic b eha viors of the h ybrid estimator, w e introduce the follo wing p olynomial margin. 7 Assumption 2.1 (P olynomial margin) . Ther e exist c onstants α > 0 and C < ∞ such that for al l t > 0 , P | θ 1 ( X ) − θ 0 ( X ) | ≤ t ≤ C t α . In fact, the parameter α characterizes the separation degree b et ween the tw o p otential outcome distributions at the unit lev el. Geometrically , it quan tifies the probabilit y mass of the cov ariate space where the tw o conditional CDFs, θ 1 ( X ) and θ 0 ( X ), are nearly iden tical. This assumption controls the mass of X near the ”tie” or ”ambiguit y” region { x : θ 0 ( x ) = θ 1 ( x ) } . A large α implies a strong margin, where the p opulation is clearly partitioned into regions where one p oten tial outcome is strictly more likely to b e b elow the threshold than the other. On the con trary , a small α signifies a weak margin, indicating a high density of individuals whose conditional ranks are indistinguishable. This creates a noise-sensitive b oundary where small estimation errors in ˆ θ a can lead to frequen t mis-selection in ˆ d ( x ), whic h w e will see result in a more stringen t sup-norm con vergence rates required to control the bias. Theorem 3 (Asymptotic prop erties of margin - direct estimator) . Under Assumption 2.1 and the b ounde dness c onditions that ther e exist c onstants 0 < c < c < ∞ such that c < ˆ π a ( x ) < c almost sur ely, supp ose cr oss-fitting is use d (or Donsker c onditions hold). Then the estimator ˆ Ψ = P n [ φ ( O ; ˆ P , ˆ d )] = P n h 1 X a =0 1 { ˆ d ( X ) = a } ˆ θ a ( X ) + 1 { A = a } ˆ π a ( X ) ( 1 { Y ≤ y } − ˆ θ a ( X )) i • (c onsistency) is c onsistent | ˆ Ψ − Ψ | = o p (1) when the nuisanc e estimators satisfy max a ∥ ˆ θ a − θ a ∥ ∞ = o p (1) , ∥ ˆ θ a − θ a ∥ L 2 ( P ) ∥ ˆ π a − π a ∥ L 2 ( P ) = o p (1) , a = 0 , 1 , • (r o ot- n c onsistency and asymptotic normality) satisfies ˆ Ψ − Ψ = ( P n − P ) φ ( O ; P , d ) + o p ( n − 1 / 2 ) , henc e √ n ( ˆ Ψ − Ψ) d → N (0 , V ar φ ( O ; P , d )) . when nuisanc e estimators satisfy max a ∥ ˆ θ a − θ a ∥ ∞ = o p n − 1 / (2(1+ α )) , ∥ ˆ θ a − θ a ∥ L 2 ( P ) ∥ ˆ π a − π a ∥ L 2 ( P ) = o p ( n − 1 / 2 ) , a = 0 , 1 . 8 The rate conditions can b e established by following the standard semiparametric argu- men t, i.e., the follo wing v on-Mises decomp osition ˆ Ψ − Ψ( P ) = P n [ φ ( O ; ˆ P , ˆ d )] − P [ φ ( O ; P , d )] (4) = ( P n − P ) φ ( O ; P , d ) + ( P n − P )[ φ ( O ; ˆ P , ˆ d ) − φ ( O ; P , d )] + P [ φ ( O ; ˆ P , ˆ d ) − φ ( O ; P , d )] , = S + R 1 + R 2 , where S is the standard CL T term, R 1 is the empirical pro cess term and can b e easily b ounded when conducting sample splitting, the bias R 2 can be handled b y further separating errors in n uisance functionals and selectors P h φ ( O ; ˆ P , ˆ d ) − φ ( O ; P , d ) i = P h φ ( O ; ˆ P , d ) − φ ( O ; P , d ) i | {z } nuisance error B nuis + P h φ ( O ; ˆ P , ˆ d ) − φ ( O ; ˆ P , d ) i | {z } selector error B sel . Detailed pro of is shown in B.2. Remark 1. The sele ctor-r ate r e quir ement ∥ ˆ θ − θ ∥ ∞ = o p ( n − 1 / (2(1+ α )) ) is the main extr a pric e p aid by the dir e ct metho d when we plug in the estimate d sele ctor is that one ne e ds uniform err or c ontr ol. The p ar ameter α quantifies the density of the p opulation ne ar the “tie” r e gion wher e θ 1 ( X ) = θ 0 ( X ) , r epr esenting the de gr e e of sep ar ation b etwe en c onditional p otential outc ome r anks. A lar ger α indic ates a str onger mar gin with fewer ambiguous c ases, which enhanc es the stability of the plug-in sele ctor ˆ d ( x ) and yields faster c onver genc e r ates for the non-smo oth dir e ct estimator. In applic ations one typic al ly implements cr oss-fitting and uses mo dern ML estimators for θ a and π a ; verifying the sup-norm c ondition may r e quir e sp e cial ly tailor e d estimators (series/sieve with tune d c omplexity or uniformly c onsistent kernel estimators). 2.2.2 Smo oth Approximation via Log-Sum-Exp In the direct metho d, the non-differentiabilit y of ϕ ( u, v ) = min( u, v ) creates b oundary com- plications. As an alternativ e, w e appro ximate min b y a smo oth function based on the log-sum-exp op erator, i.e., g t ( u, v ) := − 1 t log e − tu + e − tv , ( u, v ) ∈ [0 , 1] 2 , 9 where g t ( u, v ) is a smooth function satisfying min( u, v ) − log 2 t ≤ g t ( u, v ) ≤ min( u, v ) and lim t →∞ g t ( u, v ) = min( u, v ). Then Ψ( P ) can b e approximated via Ψ t ( P ) = E g t ( θ 0 ( X ) , θ 1 ( X )) , whic h is con tinuously Gateaux-differen tiable in P for an y finite t , and Ψ t ( P ) ↑ Ψ( P ) as t → ∞ . Theorem 4 (Prop erties of the smo oth log-sum-exp estimator) . L et Ψ t = E [ g t ( θ 0 ( X ) , θ 1 ( X ))] , g t ( u, v ) = − t − 1 log( e − tu + e − tv ) , b e the smo oth appr oximation of Ψ , let ˆ Ψ t b e the plug-in estimator using nuisanc e estimators ˆ θ a and ˆ π a , a = 0 , 1 , obtaine d on an indep endent sample (sample splitting) Ψ t = P n φ t ( O ; ˆ P ) = P n h 1 X a =0 ˆ w a,t ( X ) 1 { A = a } ˆ π a ( X ) ( 1 { Y ≤ y } − ˆ θ a ( X )) + g t ( ˆ θ 0 ( X ) , ˆ θ 1 ( X )) i , wher e ˆ w a,t ( X ) = e − t ˆ θ a ( X ) e − t ˆ θ 0 ( X ) + e − t ˆ θ 1 ( X ) . Supp ose the b ounde dness assumption that ther e exist c onstants 0 < c < c < ∞ such that c < ˆ π a ( x ) < c almost sur ely holds. Then for a fixe d t , the estimator ˆ Ψ t satisfies the fol lowing pr op erties • (c onsistency) Under r ate c ondition on the nuisanc e estimators, ∥ ˆ θ a − θ a ∥ L 2 ( P ) · ∥ ˆ π a − π a ∥ L 2 ( P ) = o p (1) , ∥ ˆ θ a − θ a ∥ 2 L 2 ( P ) = o p (1) , a = 0 , 1 , ˆ Ψ t is c onsistent for Ψ t ( P ) . • (r o ot- n c onsistency and asymptotic normality) under pr o duct r ate c ondition on the nuisanc e estimators, ∥ ˆ θ a − θ a ∥ L 2 ( P ) · ∥ ˆ π a − π a ∥ L 2 ( P ) = o p ( n − 1 / 2 ) , ∥ ˆ θ a − θ a ∥ 2 L 2 ( P ) = o p ( n − 1 / 2 ) , a = 0 , 1 , i.e., ∥ ˆ θ a − θ a ∥ L 2 ( P ) = o p ( n − 1 / 4 ) and ∥ ˆ π a − π a ∥ L 2 ( P ) = o p ( n − 1 / 4 ) the estimator ˆ Ψ t admits the line ar exp ansion ˆ Ψ t − Ψ t ( P ) = ( P n − P ) φ t ( O ; P ) + o p ( n − 1 / 2 ) , 10 and ther efor e √ n ( ˆ Ψ t − Ψ t ( P )) d → N (0 , V ar( φ t ( O ; P ))) . so that the estimator c onver ges at the √ n r ate √ n ( ˆ Ψ t − Ψ t ( P )) = O p (1) . Similarly , the asymptotic normality is established on the von-Mises decomp osition ˆ Ψ t − Ψ t ( P ) = ( P n − P ) φ t ( O ; P ) | {z } S + P [ φ t ( O ; ˆ P ) − φ t ( O ; P )] | {z } R nuis + ( P n − P )[ φ t ( O ; ˆ P ) − φ t ( O ; P )] | {z } R n , (5) where S is the standard CL T term, R n is the empirical process term, and R nuis is the remaining bias. Detailed proof is sho wn in Appendix B.3. Note that the abov e asymptotic prop erties of smo oth-approximation estimator ˆ Ψ t is established with a fixed smo oth parameter t . W e are curious ab out the performance on how the smo oth estimator ˆ Ψ t appro ximate the true parameter Ψ( P ), i.e., the b eha vior when t go es to infinit y . It app ears that our smo oth estimator will b e closer and closer to the truth but we need pa y price for it. Remark 2 (bias-v ariance trade-off in limit as t → ∞ ) . T o achieve the semip ar ametric efficiency b ound, we de c omp ose √ n ( ˆ Ψ t − Ψ( P )) = √ n ( ˆ Ψ t − Ψ t ( P )) | {z } Statistic al Err or + √ n (Ψ t ( P ) − Ψ( P )) | {z } Appr oximation Bias The term √ n (Ψ( P ) − Ψ t ( P )) is the smo othing bias and b ounde d by O ( √ n t ) , and to b ound it we ne e d √ n/t → 0 , i.e., t = ω ( √ n ) . The term √ n ( ˆ Ψ t − Ψ t ( P )) is the statistic al err or, c onver ging to N (0 , V ar( φ t ( O ; P ))) for a fixe d t . When t is moving √ n ( B nuis + R n ) ≈ √ n · O p ( t n ∥ ˆ θ − θ ∥ 2 2 ) + √ n · O p ( n − 1 / 2 t n ∥ ˆ θ − θ ∥ 2 ) = O p ( √ nt n ∥ ˆ θ − θ ∥ 2 2 ) + O p ( t n ∥ ˆ θ − θ ∥ 2 ) T o make Ψ t ( P ) ↑ Ψ( P ) then φ t ( O ; P ) → φ oracle ( O ; P ) in L 2 ( P ) , we ne e d t n · ∥ ˆ θ − θ ∥ L 2 ( P ) = o p (1) and √ n · t n · ∥ ˆ θ − θ ∥ 2 L 2 ( P ) = o p (1) , which r e quir es ∥ ˆ θ − θ ∥ L 2 ( P ) = o p ( n − 1 / 2 ) r ather than ∥ ˆ θ − θ ∥ L 2 ( P ) = o p ( n − 1 / 4 ) when t = ω ( √ n ) . Her e’s an example of bias-varianc e tr ade-off. 11 Remark 3 (Lo wer b ound) . F or lower b ound ϕ ( θ 0 , θ 1 ) = max( θ 0 ( X ) + θ 1 ( X ) − 1 , 0) , similar lo gic applies. We c an appr oximate via g t ( θ 0 , θ 1 ) = 1 t log 1 + e t [ θ 0 ( X )+ θ 1 ( X ) − 1] , and the c orr esp onding IF-b ase d estimator is ˆ Ψ L t = P n " 1 X a =0 ˆ w t ( X ) 1 { A = a } ˆ π a ( X ) 1 { Y ≤ y } − ˆ θ a ( X ) + g t ˆ θ 0 ( X ) , ˆ θ 1 ( X ) # , wher e ˆ w t ( X ) = σ ( t ˆ S ) = e t ( ˆ θ 0 ( X )+ ˆ θ 1 ( X ) − 1) 1+ e t ( ˆ θ 0 ( X )+ ˆ θ 1 ( X ) − 1) , ˆ S = ˆ θ 0 ( X ) + ˆ θ 1 ( X ) − 1 . Bey ond addressing non-differen tiability , the smo oth approximation offers a crucial infer- en tial adv an tage. When the true parameter is exactly on the b oundary , classical normal appro ximations and standard b o otstrap pro cedures are known to b e inconsistent Andrews (2000), while g t ( · ) pulls the estimand strictly into the interior of the parameter space, en- suring the asymptotic v alidity of W ald-type confidence interv als. 3 Coun terfactual marginals with unmeasured confound- ing: T riple Mac hine Learning via a Represen tation Learning based IV Approac h W e then mov e b eyond the simple case with no unobserved confounding to the more com- plex scenario where unobserved confounding is presen t. In this setting, ev en the marginal distributions of the p oten tial outcomes are t ypically non-iden tifiable. F urthermore, even the in tro duction of auxiliary v ariables, suc h as IVs in tw o-stage least squares (2SLS) regressions, either need (strong) parametric assumption lik e linear structural equation mo dels (SEMs) to iden tify A TE, or can only identify local a verage treatment effect (LA TE) for compliers in a nonparametric mo del with additional assumptions such as monotonic relev ance and binary treatmen ts Angrist et al. (1996), Swanson et al. (2018). Nonparametric IV metho ds hav e gained significant atten tion in recen t y ears New ey (2013), Levis et al. (2024), but they hav e fo cused more on the identification and estimation of av erage treatment effects rather than the full coun terfactual distribution. 12 The c hallenge of unmeasured confounding has traditionally b een view ed as an identifica- tion b ottleneck that can only be breached b y rigid functional forms. Ho wev er, the emerging field of Causal Represen tation Learning (CRL) Sch¨ olk opf et al. (2021), Moran & Aragam (2026) offers a alternativ e p ersp ective by framing the reco very of hidden confounders as an iden tifiable laten t v ariable mo deling problem. Unlik e standard representation learning that fo cuses on extracting features for optimal prediction, CRL aims to identify the underlying causal v ariables and their structural relations from high-dimensional observ ations through structural constrain ts giv en b y auxiliary information like exogenous instrumen ts. This shift from treating confounders as unreac hable n uisances to treatable laten t v ariables provides the conceptual inspiration for our framew ork: using exogenous instrumen t as the k ey identifying constrain t to recov er the latent confounding subspace. In this section, w e prop ose a no vel method that lev erages IVs and represen tation learning to construct a latent representation of the unmeasured confounding, follow ed by the non- parametric identification and semi-parametric estimation of causal functionals. W e first fo cus on the iden tification and inference of the marginals or a verage effects of the latent structure. This approach will b e conv eniently in tegrated with the b ounding metho ds discussed in the previous section 2, whic h utilized conditional copulas to transition from marginal to join t distributions, thereb y enabling inference on the en tire join t distribution and individual-lev el effects, which will b e discussed in Section 4. Sp ecifically , we observe ( S, A, Y ) with exogenous instrument ( S ), treatmen t ( A ), and outcome ( Y ). W e susp ect unobserv ed confounders Z C suc h that A ← Z C → Y . Our goal is to estimate coun terfactual marginals and Average T reatment Effects (A TE). W e aim to use a representation learning based approac h that uses S to help identify a represen tation b Z C of the unobserved Z C and then iden tify the A TE (Fig.1). 3.1 Iden tification T o iden tify the marginal distribution of p oten tial outcomes in the presence of unobserv ed confounding, w e exploit recent dev elopments in iden tifiable CRL with exogenous auxiliary v ariables Kong et al. (2022), Ng, Xie, Dong, Spirtes & Zhang (2025), Ng, Bl¨ obaum, Bhandari, Zhang & Kasivisw anathan (2025). Sp ecifically , we introduce an exogenous instrumen t S to induce a decomp osition of the laten t represen tation of the observ ed data into comp onen ts 13 S Z S Z C A Y ε C ε S Instrument T reatment Outcome Confounder S -dep endent Noise Noise Figure 1: Causal and laten t structure underlying the IV-based representation-learning mo del with Exogenous Noises. ε C and ε S are the independent, exogenous noise sources. Z C (pure confounder) is defined by ε C . Z S (IV-related laten ts) is defined by Z C , S , and ε S . The absence of direct edges S → Y and Z S → Y satisfies the Exclusion Restriction. driv en b y the exogenous v ariation in S ( Z S ), and comp onen ts whic h captures v ariation asso ciated with the unobserv ed confounding ( Z C ) Since the confounders are not observed, Z C and Z S cannot b e simply assumed to b e conditionally independent. Nevertheless, under a sufficien tly v ariability condition on p ( A | S ), Z C remains identifiable. In practice, since b oth S and the treatmen t A are observ ed, it is unnecessary to estimate Z S explicitly . Instead, w e adopt a reduced-form construction that directly targets Z C , thereby alleviating the additional randomness that w ould arise from sim ultaneously estimating multiple laten t comp onen ts (Section 3.3). Note that, giv en the dimensionality of the observ ed v ariables (which constraints the laten t dimensionalit y), Z C should not b e in terpreted as a direct representation of the high- dimensional confounders themselves. Rather, it serves as a low-dimensional sufficien t score summarizing the aggregate influence of the unobserv ed confounding. W e ha ve the follo wing iden tifiabilit y results. Theorem 5 (Iden tification) . L et the observe d data b e i.i.d. samples of ( S , A, Y ) , wher e S is an instrument, A is the tr e atment, and Y is the observe d outc ome. Assume the data- gener ating pr o c ess admits latent variables Z = ( Z C , Z S ) and exo genous noises ε C , ε S satisfy- ing the fol lowing assumptions: Assumption 3.1 (Causal structural equations) . The latent variables ar e gener ate d as Z C = 14 f ( ε C ) and Z S = h ( Z C , S, ε S ) . The observe d variables ( A, Y ) ar e gener ate d via A = g A ( Z S ) , Y = g Y ( A, Z C ) , wher e • g A is an arbitr ary me asur able function (al lowing for discr ete/binary tr e atments). • g Y ( a, · ) is inje ctive with r esp e ct to Z C for e ach observe d tr e atment level a ∈ A . Conse quently, the joint mapping ( Z C , Z S ) 7→ ( A, Y ) al lows the unique r e c overy of Z C fr om observables ( A, Y ) (i.e., Z C is observable-me asur able). Al l exo genous noises ar e mutual ly indep endent and Z C ⊥ ⊥ S . Assumption 3.2 (IV conditions) . V ariable S is a valid instrument satisfying • Exclusion R estriction: S do es not dir e ctly affe ct Y , i.e., Y = g Y ( A, Z C ) do es not dep end on S , or in the notations of p otential outc omes Y ( s, a ) = Y ( a ) . • Unc onfounde dness: S ⊥ ⊥ ( ε C , ε S ) , or e quivalently S ⊥ ⊥ ( A ( s ) , Y ( s )) . Assumption 3.3 (Sufficient V ariability of IV) . The instrument S induc es sufficient varia- tion in the latent variable Z S c onditional on Z C . F ormal ly, for almost al l z c , the family of c onditional distributions P z c = { P ( Z S ∈ · | Z C = z c , S = s ) : s ∈ S } is c omplete. In terms of sets, this implies that for any me asur able set B ⊆ Z S with p ositive me asur e ( P ( B ) > 0 ), the pr ob ability mass assigne d to this set varies with S s 7→ P ( Z S ∈ B | Z C = z c , S = s ) is not a c onstant function a.s. Assumption 3.4 (Positivit y and regularit y) . F or almost al l z C in the supp ort of Z C , 0 < P ( A = a | Z C = z C ) < 1 , and r e quir e d c onditional exp e ctations exist and ar e finite. Then the c onfounding subsp ac e Z C is subsp ac e-identifiable . In other wor ds, if ther e exists a le arne d r epr esentation b Z C that satisfies indep endenc e c onstr aints b Z C ⊥ ⊥ S and pr e- dictive sufficiency Y ⊥ ⊥ Z C | ( A, b Z C ) or p ( Y | A, b Z C , Z C ) = p ( Y | A, b Z C ) almost sur ely, then ther e exists an invertible map ψ such that b Z C = ψ ( Z C ) almost sur ely. A nd subse quently, the c ounterfactual mar ginals and the A TE is nonp ar ametric al ly iden- tifiable E [ Y ( a )] = E E [ Y | A = a, b Z C ] , A TE( a, a ′ ) = E E [ Y | A = a, b Z C ] − E [ Y | A = a ′ , b Z C ] . 15 Remark 4 (Completeness v.s. V ariabilit y) . Assumption 3.3 is formal ly e quivalent to the Bounded Completeness c ondition often use d in e c onometrics. We adopt the terminolo gy of Sufficien t V ariabilit y (Kong et al. 2022, Ng, Bl¨ ob aum, Bhandari, Zhang & Kasiviswanathan 2025) to highlight the ge ometric intuition: the instrument S must cr e ate diverse distributional shifts in Z S such that no subset of the latent sp ac e r emains “invariant” to S . This pr op erty is crucial in Step 1 of the pr o of to rule out r epr esentations that entangle Z S . Remark 5 (On the injectivity Assumption) . We emphasize that the latent variables Z C and Z S should not b e interpr ete d as the high-dimensional r aw state of the world (e.g., al l genetic factors). Inste ad, fol lowing the philosophy of Sufficient Dimension R e duction or Contr ol F unctions, they r epr esent the low-dimensional (or sc alar) pr oje ctions or sc or es of these factors (so that we c an let dim( Z C ) ≤ dim( Y ) ) that actively drive the variation in A and Y . Under this interpr etation, the inje ctivity assumption in 3.1 implies that the c ausal me chanism r elies on a low-dimensional b ottlene ck, which is a standar d structur al assumption in r epr esentation le arning Kong et al. (2022), Ng, Xie, Dong, Spirtes & Zhang (2025), Ng, Bl¨ ob aum, Bhandari, Zhang & Kasiviswanathan (2025). Remark 6 (Identifiabilit y of counterfactual marginals) . The or em 5 implies the identifiability of mar ginal distribution of p otential outc ome Y ( a ) , i.e. P ( Y ( a ) ≤ y ) = E h P ( Y ≤ y | A = a, ˆ Z c ) i . The proof of Theorem 5 consists of tw o main stages. First, based on the injectivit y and completeness assumptions, we establish the identifiabilit y of the confounding subspace Z C up to an in vertible transformation (Step 1). Second, w e utilize this identified representation to identify the marginal counterfactual distributions F Y ( a ) ( y ) and subsequently , the Average T reatment Effect (A TE) via the bac k-do or adjustmen t formula (Steps 2-4). Detailed pro of is shown in App endix C. 3.2 Estimation and Inference Generally sp eaking, w e can estimate the av erage p oten tial outcome ψ ( a ) = E [ Y ( a )] using similar approac hes but replace the observ ed confounding Z C to the confounding representa- tions ˆ Z C learned from a separated dataset. W e quic kly review the standard wa ys as follows. 16 3.2.1 Standard estimators with observ ed confounding F or binary treatment A ∈ { 0 , 1 } , the outcome regression (OR) estimator is ˆ ψ OR ( a ) = 1 n P n i =1 ˆ µ ( a, Z C,i ), where ˆ µ ( a, z ) = E [ Y | A = a, Z C = z ] is the estimated outcome mo del. The in v erse prop ensity w eighting (IPW) estimator is ˆ ψ IPW ( a ) = 1 n P n i =1 1 ( A i = a ) ˆ π ( a | Z C,i ) Y i , where ˆ π ( a | z ) = P ( A = a | Z C = z ) is the prop ensity score. The doubly robust (DR) estimator com- bines b oth mo dels as ˆ ψ DR ( a ) = 1 n P n i =1 h ˆ µ ( a, Z i ) + 1 ( A i = a ) ˆ π ( a | Z C,i ) ( Y i − ˆ µ ( A i , Z C,i )) i and admits the double robustness (Bang & Robins 2005). F or con tin uous treatment A ∈ R , the indicator function 1 ( A i = a ) has zero probabilit y , requiring densit y or k ernel-based adaptations. Let ˆ r ( a | z ) = ˆ f ( a | z ) denote the estimated conditional densit y (generalized prop ensit y score, GPS) and K h ( · ) b e a k ernel function with bandwidth h . The outcome regression estimator remains ˆ ψ OR ( a ) = 1 n P n i =1 ˆ µ ( a, Z C,i ). The GPS-IPW estimator (Hirano & Imbens 2004) is ˆ ψ GPS ( a ) = P n i =1 K h ( A i − a ) Y i / ˆ r ( A i | Z C,i ) P n i =1 K h ( A i − a ) / ˆ r ( A i | Z C,i ) , where K h ( A i − a ) w eights observ ations near a and 1 / ˆ r ( A i | Z C,i ) provides inv erse densit y w eigh t- ing. The densit y estimation-based standard DR estimator (Kennedy et al. 2017) can b e constructed as ˆ ψ D ( a ) = 1 n P n i =1 h ˆ µ ( a, Z C,i ) + ˆ r ( a | Z C,i ) ˆ r ( A i | Z C,i ) ( Y i − ˆ µ ( A i , Z C,i )) i , where the density ratio ˆ r ( a | Z C,i ) / ˆ r ( A i | Z C,i ) replaces the indicator-based w eigh t from binary DR, maintain- ing doubly robust consistency . The DR-k ernel estimator tends to emplo y kernel meth- o ds and simplifies this as ˆ ψ K ( a ) = 1 n P n i =1 [ ˆ µ ( a, Z C,i ) + n · w i ( a )( Y i − ˆ µ ( A i , Z C,i ))], where w i ( a ) = K h ( A i − a ) / P n j =1 K h ( A j − a ) are normalized kernel weigh ts that av oid explicit den- sit y estimation, making it computationally efficient while maintaining approximate doubly robust prop erties (Flores et al. 2012, F ong et al. 2018). The GPS can b e estimated using k ernel densit y estimation (KDE) on residuals ˆ r ( a | Z ) ≈ 1 n P n j =1 K h (( a − ˆ µ ( Z )) − r j ), where r j = A j − ˆ µ ( Z j ) are training residuals and ˆ µ ( Z ) = E [ A | Z ] is the conditional mean of treatment estimated via gradien t b o osting. This nonparametric approac h accommo dates non-Gaussian treatmen t distributions. F or n umerical stabilit y , ex- treme w eigh ts can b e trimed. Bandwidth selection follows Silv erman’s rule h = 1 . 06 ˆ σ A n − 1 / 5 Silv erman (2018), where ˆ σ A is the sample standard deviation of treatment. 17 3.2.2 T riple Machine Learning estimator W e no w resort to learning confounding representations when there are unmeasured con- founders, and subsequently , we can emplo y all the ab ov e standard estimators b y replacing the observ ed cov ariates with learned confounding representations. It mimic the standard semiparametric/double-robust/double-ML w a y Chernozh uko v et al. (2018), Kennedy (2024), and we call it “T riple Machine Learning” (TML) since we need an additional fold to learn the confounding representation first b efore the standard double mac hine learning or double robust estimation. As an example, for binary treatmen ts, we can construct an estimator b E [ Y ( a )] using three-fold sample splitting or more complex cross-fitting. W e summarize the pro cedure as follows • F old 1 to learn represen tation b Z C = b ψ ( A, Y ). • F old 2 to estimate n uisance parameters – b m a ( z ) = ˆ E [ Y | A = a, ˆ Z C = z ] in the Outcome mo del. – b π a ( z ) = ˆ P ( A = a | ˆ Z C = z ) in the T reatment (Prop ensity) mo del. • F old 3 to estimate the causal parameter b E [ Y ( a )] = P n h b m a ( b Z C ) + 1 { A = a } b π a ( b Z C ) Y i − b m a ( b Z C ) i It can easily generalize to other distributional targets b y mo difying the outcome mo del to other functionals lik e b m a ( z , y ) = b P ( Y ≤ y | A = a, b Z C = z ). Moreo ver, similar logic can b e applied to the con tin uous scenario with GPS, density or k ernel based estimators. In principle, the prop osed TML framew ork is sufficien tly general to accommo date a v ariety of estimators, including outcome regression, inv erse probabilit y weigh ting, and doubly robust estimators. W e first establish the theoretical prop erties of the TML estimator under binary treatment assignmen t using a doubly robust construction (6), and it can b e easily generalized to more complex con tinuous scenarios. W e subsequen tly in tro duce a v ariational approac h for effective confounding represen tation learning (Section 3.3), and assess the n umerical p erformance of differen t estimators within the TML framew ork through simulation studies (Section 5). 18 Theorem 6 (Asymptotic Properties of the Doubly-Robust-Type Represen tation-IV Esti- mator) . L et the identific ation assumptions of The or em 5 hold. L et the estimator b E [ Y ( a )] b e c onstructe d using K -fold cr oss-fitting, wher e b Z C = b ψ ( Z C ) , b m , and b π ar e estimate d on sep ar ate folds. L et the b ounde dness and overlap assumption hold, i.e., ther e exist c onstants 0 < c < c < ∞ such that c < π ( a | z ) < c almost sur ely, and E [ Y 2 ] < ∞ . Then we have 1. Consistency (double robustness) The estimator b E [ Y ( a )] is c onsistent for E [ Y ( a )] , i.e., b E [ Y ( a )] p − → E [ Y ( a )] , if the fol lowing c onditions hold (C1a) The r epr esentation b Z C is c onsistent ∥ b Z C − ψ ( Z C ) ∥ L 2 ,P = o p (1) . (C1b) A t le ast one of the nuisanc e estimators is c onsistent for the true function define d on Z C , despite b eing tr aine d on b Z C (i.e., it is EIV-r obust and c onsistent) ∥ b m ( ˆ Z C ) − m 0 ( Z C ) ∥ L 2 ,P = o p (1) or ∥ b π ( ˆ Z C ) − π 0 ( Z C ) ∥ L 2 ,P = o p (1) . 2. Asymptotic Normalit y The estimator is √ n -asymptotic al ly nor mal with a c orr e cte d varianc e, √ n b E [ Y ( a )] − E [ Y ( a )] d − → N (0 , V total ( a )) , wher e V total ( a ) = V ar[IF a ( O ; η 0 ) + IF ϕ, rep ( a )] , if the fol lowing hold (C2a) The nuisanc e functions (assume d EIV-r obust) satisfy the double machine le arning (DML) r ate c ondition ∥ b m ( ˆ Z C ) − m 0 ( Z C ) ∥ L 2 ,P · ∥ b π ( ˆ Z C ) − π 0 ( Z C ) ∥ L 2 ,P = o p ( n − 1 / 2 ) . (C2b) The r epr esentation b Z C c onver ges as ∥ b Z C − ψ ( Z C ) ∥ L 2 ,P = O p ( n − 1 / 2 ) . In this c ase, the first-stage estimation err or in b Z C c ontributes a first-or der c orr e ction term IF ϕ, rep ( a ) to the influenc e function. 19 3. Efficiency The estimator is √ n -asymptotic al ly normal and achieves the or dinary effi- cient varianc e lower b ound, √ n b E [ Y ( a )] − E [ Y ( a )] d − → N (0 , V a ) , wher e V a = V ar[IF a ( O ; η 0 )] , if the fol lowing hold (C3a) The nuisanc e functions (assume d EIV-r obust) satisfy the DML r ate c ondition ∥ b m ( ˆ Z C ) − m 0 ( Z C ) ∥ L 2 ,P · ∥ b π ( ˆ Z C ) − π 0 ( Z C ) ∥ L 2 ,P = o p ( n − 1 / 2 ) . (C3b) The r epr esentation b Z C c onver ges at a “sup er-c onver genc e” r ate ∥ b Z C − ψ ( Z C ) ∥ L 2 ,P = o p ( n − 1 / 2 ) . In this c ase, the first-stage estimation err or is asymptotic al ly ne gligible, and the c orr e ction term IF ϕ, rep ( a ) disapp e ars fr om the asymptotic exp ansion. The pro of relies on the von-Mises decomp osition √ n ( b ψ − ψ 0 ) = √ n ( P n, 3 φ ( b η ) − P φ ( η 0 )) = √ n ( P n, 3 − P ) φ ( η 0 ) | {z } T 1 :Oracle CL T + √ n ( P n, 3 − P )( φ ( b η ) − φ ( η 0 )) | {z } T 2 :Empirical Pro cess + √ nP ( φ ( b η ) − φ ( η 0 )) | {z } T 3 :Bias T erm , (6) where the bias consists of a nuisance estimation error and a represen tation learning error T 3 = √ nP ( φ ( b η ) − φ ( ˜ η )) | {z } T 3 a (Nuisance Bias) + √ nP ( φ ( ˜ η ) − φ ( η 0 )) | {z } T 3 b (Representation Bias) , where the in termediate parameter ˜ η = ( m 0 , π 0 , b ϕ ) represents the ideal n uisance parameters giv en the learned represen tation b ϕ . Note that T 3 a can b e handled via ordinary Neyman orthogonalit y tec hnique while T 3 b fails since ψ is NOT orthogonal w.r.t. Z C . Hence, let M ( ϕ ) = E [ φ ( O ; m 0 , π 0 , ϕ )] b e the exp ected score functional, then w e ha ve T 3 b = √ n ( M ( b ϕ ) − M ( ϕ 0 )) = √ n ∇ ϕ M ( ϕ 0 )[ b ϕ − ϕ 0 ] | {z } Linear T erm (I) + √ n R ( b ϕ, ϕ 0 ) | {z } Remainder T erm (II) , 20 where the v ariance of the linear term is V ar(IF ϕ, rep ( O )). In other words, the representation error accum ulates in the subsequen t causal estimation, resulting in inflated v ariance induce b y IF ϕ, rep ( O ) = E O ′ [ ∇ ϕ φ ( O ′ ; η 0 )] · ξ ϕ ( O ) where ξ ϕ ( O ) is the influence function for the repre- sen tation learner, even when the represen tation has a parametric con vergence rate. Detailed pro of is shown in Appendix D. In practice, the explicit calculation of the v ariance inflation term V rep = V ar(IF ϕ, rep ) presen ts a significant challenge, as the correction influence function IF ϕ, rep dep ends on the infinite-dimensional sensitivity of the causal functional and the algorithmic resp onse of the represen tation learner. Several empirical metho ds can b e applied to facilitate robust infer- ence. First, one ma y employ a Hessian-based numerical appro ximation leveraging the recent adv ances in neural influence functions Koh & Liang (2017). By treating the enco der as a parametric mo del ϕ θ , the term ξ ϕ ( O ) can b e approximated using the in verse Hessian-v ector pro duct (HVP). Sp ecifically , b IF ϕ, rep ≈ P n [ ∇ θ φ ] ⊤ H − 1 θ ∇ θ L , where the in verse Hessian H − 1 θ is efficien tly computed via sto chastic estimation (e.g., the LiSSA algorithm). This approac h pro vides a first-order approximation of ho w small p erturbations in the representation training set propagate to the final causal estimate. Alternativ ely , a more computationally intensiv e but non-parametric approac h is ensemble- based uncertain t y quan tification with Bo otstrap or Infinitesimal Jac kknife (IJ) b y rep eating pro cedure o v er B random sample splits W ager & A they (2018), and the excess v ariance observ ed across differen t representation learning folds effectively captures the sto chasticit y enco ded in V rep . While computationally demanding, this ensemble approach b ypasses the need for second-order deriv atives and is inherently more robust to the non-con vex landscap e of deep neural net works. Remark 7 (The EIV-Robustness Assumption) . The EIV-r obustness underpinning al l thr e e sc enarios r e quir es the existenc e of estimators for b m and b π that c an b e tr aine d on the err or-laden pr oxy b Z C inste ad of true Z C and yet stil l c onver ge to the true functions m 0 and π 0 at the sp e cifie d r ates. This is a non-trivial r e quir ement as standar d estimators in non- p ar ametric EIV settings often suffer fr om sever e attenuation bias and de gr ade d lo garithmic c onver genc e r ates F an & T ruong (1993). Satisfying this c ondition implicitly r e quir es the underlying true functional sp ac e to b e sufficiently smo oth, or ne c essitates the deployment of sp e cialize d de c onvolutional algorithms during the nuisanc e tr aining phase to actively c orr e ct 21 for the me asur ement err or in b Z C . W e summarize the observ ations as follows. Remark 8. Within TML fr amework, A DR-typ e estimator c an b e c onstructe d, but its asymp- totic pr op erties critic al ly dep end on two factors 1. The EIV-R obustness Assumption: We ne e d b m, b π to b e tr aine d c onsistently on the estimate d r epr esentation b Z C inste ad of the or acle one. 2. The Conver genc e R ate of b Z C : under the EIV-R obustness Assumption, • if b Z C admits a O p ( n − 1 / 2 ) r ate = ⇒ the c ausal estimator has an inflate d varianc e V total . • if b Z C admits a sup er smo oth o p ( n − 1 / 2 ) r ate = ⇒ the c ausal estimator achieve the semip ar ametric efficient varianc e V a . It is gener al ly very har d for a nonp ar ametric flexible estimators like ML metho ds. 3.3 IV-V AE: Learning represen tation Z C The learning of Z C is a classic problem in disentangled representation learning. A natural w a y is to consider the following regularized β -V AE mo del L = − E q ( Z | A,Y ) h log p ( Y | Z C , A ) + log p ( A | Z S ) i + β K L q ( Z | A, Y ) || p ( Z S | S, Z C ) p ( Z C ) + λ E q h HSIC( Z C , S ) + HSIC( Z C , A | Z S ) i , (7) where the (conditional) dep endence can be measured via (conditional) HSIC F ukumizu et al. (2004, 2007), Sheng & Srip erum budur (2023). This in v olv es the simultaneous optimization of tw o coupled ob jects, ˆ Z C and ˆ Z S (since there exists a link Z C → Z S ), and relies on the in trinsically difficult measuremen t and constrain t of conditional independence Shah & P eters (2020), He et al. (2025), making it tric ky to achiev e stable learning. Ho w ever, in fact, explicitly inferring Z S is computationally redundant for the purp ose of confounder iden tification, given the fact that the instruments S is observed and exogenous. W e therefore em ploy a reduced-form specification for the treatment decoder. By substituting the mec hanism of Z S in to the treatment assignmen t function, we appro ximate the comp osite 22 function A = g ( h ( S ) , Z C ) = ˜ g ( S, Z C ). Consequen tly , w e can in tro duce the IV-V AE, where the generative mo del (Deco der) is defined as p θ ( A | S, Z C ) = N ( µ A ( S, Z C ) , σ 2 A ) p θ ( Y | A, Z C ) = N ( µ Y ( A, Z C ) , σ 2 Y ) Conditioning directly on S allo ws the deco der to capture the v ariation in A induced b y the instrumen t pathw ay without the need to explicitly mo del the intermediate latent v ariable Z S , while Z C captures the remaining confounding v ariation. W e in tro duce an inference net work (V AE Encoder) q ϕ ( Z C | A, Y , S ), whic h conditions on S to facilitate the separation of instrument-induced v ariation from confounder-induced v aria- tion. T o enforce the structural assumption that the reco vered confounder is exogenous to the instrumen t, we utilize the HSIC Gretton et al. (2005) as a regularization term to explicitly p enalize statistical dep endence b et ween the learned laten t space ˆ Z C and the instrument S , ensuring that ˆ Z C satisfies the prop erties of a v alid confounder. The final optimization goal b ecomes L ( θ , ϕ ) = − E q ϕ [log p θ ( A | S, Z C ) + log p θ ( Y | A, Z C )] | {z } Reconstruction Loss + β K L ( q ϕ ( Z C |· ) ∥ p ( Z C )) | {z } KL Divergence + λ HSIC( Z C , S ) | {z } Independence Constraint . (8) This new ob jectiv e fo cuses solely on the learning of ˆ Z C and only necessitates constrain- ing unconditional indep endence, which significan tly facilitates stable, efficien t learning and optimization. 4 F rom A TE to ITE: Bounds of coun terfactual join t distribution in the presence of unobserv ed confound- ing Iden tification Theorem 5 guarantees the iden tifiabilit y of conditional and marginal distribu- tions of p otential outcomes F Y ( a ) | ˆ Z C ( y ) and F Y ( a ) ( y ). W e can then follo w results in section 2 to use (conditional) copulas and FH inequality to further b ound their (conditional) join t distributions. 23 W e now w ork with estimated representation ˆ Z C instead of observing X , and hence we need to extra price on the representation learning error. Therefore, the estimation qualit y will b e join tly affected b y 1) Con vergence rate of learning ˆ Z C ; 2) EIV robustness in estimating n uisance parameters ˆ m and ˆ θ ; 3) The realization of the margin condition for direct estimator or the bias-v ariance trade-off for the smo oth approximation with differen t smo oth parameter t . Cross-fitting will b e generally helpful to separate these error of different sources. 5 Sim ulations In the follo wing sections, w e conducted empirical ev aluations on b oth sim ulated and real- w orld data to further highligh t our prop osed metho ds. W e first ev aluated the conditional rank-preserving FH upp er b ounds of the direct estimator and the log-sum-exp approxima- tion under v arying smo othing parameters. Subsequently , w e demonstrated the accuracy of confounding representation learning in the presence of unobserved confounders, along with the estimation p erformance of A TE and con tinuous treatmen t resp onse curves for different estimators within the TML framework. 5.1 Sim ulations on b oundary estimation of coun terfactual join t distribution T o v alidate the theoretical prop erties and estimation p erformance of the prop osed b ounds under observed confounding, we conducted a sim ulation study with N = 2 , 000 samples o v er 100 replications. The co v ariates w ere generated as X ∼ N (0 , I 2 ) and treatmen t assignment follo w ed a logistic mo del P ( A = 1 | X ) = σ (0 . 5 X 1 − 0 . 3 X 2 + ϵ S ) , ϵ S ∼ N (0 , 0 . 1 2 ) where σ ( · ) denotes the sigmoid function. W e designed t w o different data generating pro cesses (DGPs): a Linear SCM where Y (0) = X 1 + 0 . 5 X 2 + ϵ Y Y (1) = Y (0) + 1 . 0 , and a Nonlinear setting Y (0) = sin(2 X 1 ) + X 2 2 + ϵ Y Y (1) = cos(2 X 1 ) + X 2 2 + 0 . 5 + ϵ Y , 24 where ϵ Y ∼ N (0 , 1). First, to demonstrate the gain in iden tification p ow er (tightness), w e compared the width of the standard marginal b ounds (Mak aro v bounds) against the proposed conditional bounds calculated using the true oracle distributions. The width reduction is defined as the difference b et ween the marginal width W marg = min( F 1 , F 0 ) − max( F 1 + F 0 − 1 , 0) and the exp ected conditional width W cond = E X [min( F 1 | X , F 0 | X ) − max( F 1 | X + F 0 | X − 1 , 0)]. W e can observ e that co v ariates-assisted conditional copulas lead to a significan t tigh tening of b ounds (Fig.2a). Second, w e ev aluated the finite-sample p erformance of the prop osed estimators. W e im- plemen ted the Doubly Robust (DR-Direct) estimator and the DR-Smo oth estimator (with v arying smo othing parameters) using 5-fold cross-fitting to minimize ov erfitting. W e as- sessed Bias and Mean Squared Error (MSE=bias 2 +SE 2 ) in eac h replicate (Fig.2b). Sim ula- tion results demonstrate the clear sup eriority of the smo oth appro ximation o v er the direct estimator, provided the smo othing parameter t is sufficiently large. (a) Bound width gained via marginal copulas and conditional copulas. (b) Boxplots of bias and MSE of estimators in each repli- cate. Figure 2: Estimations on the b ounds of joint distribution of p otential outcomes. 25 5.2 Sim ulations on marginal estimation with triple matc hing learn- ing approac h T o ev aluate the efficacy of the prop osed represen tation learning-based IV estimator, we con- ducted sim ulations iwith unobserv ed confounding. The DGP in tro duces a v alid instrumen t S ∼ U ( − 2 , 2) and a latent confounder Z C ∼ N (0 , 1), ensuring the crucial indep endence assumption S ⊥ Z C . The treatment is generated with sufficien t v ariability and nonlinear dep endence on the instrumen t Z S = 0 . 5 S + 0 . 1 tanh( S ) + 0 . 3 σ (2 S ) + 0 . 2 Z C + ϵ S where σ ( · ) is the sigmoid function and ϵ S ∼ N (0 , 0 . 2 2 ). W e considered t w o treatmen t regimes: a binary treatmen t where A ∼ Bernoulli( σ ( Z S )), and a con tinuous treatmen t where A = Z S . The outcome Y w as generated under tw o distinct scenarios: In the linear setting, the outcome is a simple additiv e function: Y lin = 1 + 2 A + 3 Z C + ϵ Y In the nonlinear setting, the outcome includes sinusoidal effects, sigmoid transformations, and treatment-confounder in teractions ( A × Z C ) to violate standard linear IV assumptions: Y nonlin = 1 + (0 . 3 A + 0 . 2 sin(2 A + 0 . 5)) + (0 . 3 Z C + 2 σ ( Z C )) + 0 . 2 AZ C + ϵ Y W e sim ulated N = 6000 samples and employ ed a triple-split cross-fitting strategy ( K = 6 folds) to separate represen tation learning (V AE training), n uisance parameter estimation, and final estimation (can apply differen t estimators). 5.2.1 Represen tation learning quality W e first demonstrated the quality of latent confounding represen tation learning. W e illus- trated the correlation betw een true Z C and learned ˆ Z C , and the dep endence b etw een learned ˆ Z C and instrument S (Fig.3). As Z C is identifiable only up to an in vertible transformation, b oth strong p ositive and strong negativ e correlations b etw een Z C and ˆ Z C pro vide evidence of successful represen tation learning. 26 (a) Linear DGP . (b) Nonlinear DGP . Figure 3: Representation learning qualit y , demonstrated b y correlation b etw een true Z C and learned ˆ Z C , and the dependence b etw een learned ˆ Z C and instrument S . 5.2.2 Benc hmarking A TE estimation Within the triple mac hine learning (TML, i.e., IV-based representation learning) framew ork, w e can implement differen t estimators in the final estimation fold (including outcome regres- sion, IPW, and double robust estimators). W e apply weigh t clipping to b oth the IPW and doubly robust estimators to ensure n umerical stability . W e b enc hmarked the A TE estima- tion ( A = 1 vs A = 0). W e ev aluated differen t v ariants of TML estimators, against 2SLS baselines ov er 100 replications, using bias and MSE (bias 2 +SE 2 ) in each replication as the ev aluation metrics, sho wn in Fig.4. W e can find for contin uous treatments, different esti- mators within the IV-based representation learning framew ork all significan tly outp erform 2SLS baseline. Even in settings with a binary treatmen t and linear effects, the p erformance of our nonparametric method is comparable to that of parametric approac hes lik e 2SLS. 5.2.3 Dose-resp onse curve estimation T o ev aluate the estimator’s capability in recov ering the full structural relationship b etw een the treatment and the outcome, w e conducted a dose-resp onse curv e estimation exp eriment with con tinuous treatment regime. Shown in Fig.5a, TML can not only pro vide accurate es- timates of the a v erage treatmen t effect, but also recov er the dose–resp onse function E [ Y ( a )] under contin uous treatmen t assignmen t, with particularly strong p erformance observed for the k ernel-based doubly robust estimator, esp ecially in the treatment region with high ob- serv ation density (Fig.5b). 27 (a) Boxplots of bias in each replicate of differen t estimators under differen t DGPs (b) Boxplots of MSE in each replicate of different estimators under differen t DGPs Figure 4: Sim ulation results of A TE estimation with unmeasured confounding. W e ev aluate 2SLS baseline with differen t estimators (outcome regression, IPW, and double robust) within triple machine learning (TML) framew ork. (a) Dose resp onse ( E [ Y ( a )]) curves estimated via differen t TML esti- mators. (b) Observed treatmen t distribution in simula- tion Figure 5: Dose-resp onse curv e estimation 28 6 Real-w orld analysis W e illustrate the prop osed metho dology by analyzing the demand for cigarettes using the CigarettesSW dataset from the R pack age AER Kleib er et al. (2020). This dataset comprises panel data for 48 U.S. states o v er the p erio d 1985–1995. The primary ob jectiv e is to estimate the price elasticity of demand while accounting for unobserv ed heterogeneity in state-level consumption b eha viors. Let Y denote the logarithm of cigarette consumption (packs p er capita) and A b e the logarithm of the real price. The relationship b et w een price and consumption is confounded b y unobserv ed state-sp ecific factors, Z C , suc h as regional health aw areness and cultural habits. T o address this endogeneity , w e employ the logarithm of the a verage excise tax as the instrument S , whic h satisfies the strong relev ance condition through its direct impact on price and is plausibly exogenous to individual preferences. Figure 6: Causal analysis of cigarette demand. (a) Estimated Average Dose-Resp onse F unc- tion with 95% point wise confidence in terv als. The conv ex shape indicates that price elasticit y increases (b ecomes more negativ e) at higher price levels. (b) The join t CDF surface estimated via the prop osed smo oth estimator, targeting the FH upp er b ound. The concen tration of probabilit y mass along the diagonal visualizes the theoretical limit of rank preserv ation. (c) Scatter plot of individual counterfactual pairs ( ˆ Y ( a low ) , ˆ Y ( a high )) reconstructed by the fitted outcome mo del for a low = 4 . 63 and a high = 5 . 17. By fixing the learned latent heterogeneit y ˆ z c,i for each state, the strict alignment with the identit y b enchmark (dashed line) demon- strates that the treatmen t induces a homogeneous structural shift across the p opulation. W e apply the proposed TML framew ork to reco ver the latent confounding structure Z C 29 and estimate the causal mec hanism. Figure 6 summarizes the estimated av erage and dis- tributional effects based on the learned representation. Panel (a) presen ts the estimated Av erage Dose-Resp onse F unction, E [ Y ( a )], across the observ ed supp ort of log prices. The curv e exhibits a monotonic decrease, consisten t with the fundamental la w of demand. The narro w 95% p oint wise confidence in terv als indicate precise estimation of the mean effect. Notably , the resp onse exhibits non-linearit y: the slop e steep ens for prices ab ov e a > 5 . 1, suggesting an increasing price elasticit y in the upper price range. This implies that consump- tion b ecomes more sensitive to marginal price increases when prices are already elev ated. Bey ond the a v erage effect, P anels (b) and (c) examine the join t dep endence structure of p oten tial outcomes. P anel (b) visualizes the joint CDF estimated via the prop osed smo oth estimator, sp ecifically targeting the FH upp er b ound P ( Y ( a high ) ≤ y 1 , Y ( a low ) ≤ y 0 ). This estimation reflects the limit of p erfect rank preserv ation. P anel (c) displays the empirical coun terpart at the individual lev el. Since the true counterfactuals are unobserv ed, we re- construct them using the fitted outcome mo del ˆ Y ( a, z c ). F or eac h state i , we fix its learned laten t represen tation ˆ z c,i and compute the predicted p otential outcomes under tw o quartile v alues of observed price, a low = 4 . 63 and a high = 5 . 17 using the outcome mo del w e trained. The resulting scatter plot of these predicted pairs ( ˆ Y ( a low , ˆ z c,i ) , ˆ Y ( a high , ˆ z c,i )) aligns strictly with the rank-preserving b enchmark (dashed line), confirming that the latent confounder Z C induces a comonotonic dependence structure. Crucially , the linearit y observed in P anel (c) on the logarithmic scale implies a homoge- neous relative treatmen t effect. While the laten t confounder Z C captures significant hetero- geneit y in baseline consumption lev els (in tercepts), the treatment effect manifests as a uni- form structural shift across the distribution. This suggests that price in terven tions damp en consumption proportionally for both hea vy and light smokers, preserving the rank ordering of states b y consumption intensit y . 7 Discussion This pap er has dev elop ed a unified framew ork for the iden tification and estimation of coun- terfactual distributions, addressing tw o fundamental c hallenges: the b ounding of join t dis- tributions under observed confounding and the reco v ery of marginal distributions under 30 unobserv ed confounding. By bridging copula theory with causal representation learning, w e provide a pathw ay from iden tifying p opulation-level effects to b ounding individual-lev el coun terfactual dependencies. Our analysis of conditional copulas underscores the informative v alue of co v ariates in tigh tening the FH bounds. The upp er b ound, corresponding to conditional rank preserv ation, serv es as a particularly pragmatic b enc hmark for individualized treatment effect estimation. W e proposed t wo estimation strategies, the direct estimator under margin conditions and the smo oth log-sum-exp approximation, to handle the non-differen tiable influence function and highlight a fundamen tal bias-v ariance trade-off. In the con text of unmeasured confounding, our in tegration of instrumen tal v ariables with causal representation learning offers a nonparametric alternative to classical SEMs. Unlik e traditional IV metho ds that fo cus on lo cal av erage treatmen t effects or rely on linearity , our approach lev erages the instrument to iden tify the latent confounding subspace itself. The triple machine learning pro cedure extends the double mac hine learning paradigm to accoun t for the representation learning stage. A k ey theoretical finding of this w ork is the characterization of the v ariance inflation induced by the estimated representation. As sho wn in Theorem 6, unless the represen tation learner achiev es super-conv ergence rates, a formidable requiremen t for deep neural net w orks standard errors must b e corrected to accoun t for the first-stage estimation uncertain ty . Sev eral limitations of the curren t framework w arrant further in vestigation. First, the iden tification of the laten t confounding space relies on injectivity and completeness condi- tions, and relaxing these conditions to allow for partial iden tification of the latent space w ould b e a v aluable extension. Mean while, while we prop ose a HSIC regularized V AE-based algorithm with iden tifiability guaran tees, finite sample size, constrained function approxi- mators, and the non-con vex nature of such optimization problems still p oses challenges in real-w orld estimation Hyv¨ arinen et al. (2023). F urthermore, a practical c hallenge arises in the inference stage regarding the represen ta- tion learning correction. The v ariance of the proposed estimator is inflated b y the estimation error of the represen tation b Z C . Ho w ev er, quantifying this inflation explicitly requires the gradien ts of the inv ertible mapping ψ linking the learned and true representations, whic h is unfortunately analytically unknown and implicitly defined b y the neural netw ork training 31 dynamics, making estimating the correction term IF ϕ,corr non-trivial. Relying on uncorrected v ariance estimators ma y lead an inflation of Type I error rates in h yp othesis testing. Develop- ing feasible metho ds, suc h as Hessian approximation Koh & Liang (2017) or ensem ble-based quan tification W ager & Athey (2018), to v alidly capture this uncertaint y remains an op en and critical area for future researc h Ga wliko wski et al. (2023). In summary , this w ork bridges the gap b etw een classical semiparametric theory and mo dern represen tation learning. By formally c haracterizing the identification limits of laten t confounding through instrumen tal v ariables and establishing the asymptotic distribution of the “triple machine learning” estimator, w e provide a principled alternativ e to ad ho c deep causal estimation. Simultaneously , our copula-based b ounds refine the understanding of treatmen t effect heterogeneit y b eyond marginal aggregates, offering a rigorous to ol for assessing individual-level risks. Collectiv ely , these contributions la y the groundw ork for a more robust statistical foundation for causal inference with high-dimensional, unstructured data. Ac kno wledgmen t W e gratefully ac knowledge Prof. Edw ard Kennedy , JungHo Lee, and Igna vier Ng for their v aluable comments and suggestions on the man uscript. References Andrews, D. W. (2000), ‘Inconsistency of the b o otstrap when a parameter is on the boundary of the parameter space’, Ec onometric a pp. 399–405. Angrist, J. D., Im b ens, G. W. & Rubin, D. B. (1996), ‘Identification of causal effects using instrumen tal v ariables’, Journal of the Americ an statistic al Asso ciation 91 (434), 444–455. Bang, H. & Robins, J. M. (2005), ‘Doubly robust estimation in missing data and causal inference mo dels’, Biometrics 61 (4), 962–973. Chernozh uk ov, V., Chetverik ov, D., Demirer, M., Duflo, E., Hansen, C., Newey , W. & 32 Robins, J. (2018), ‘Double/debiased mac hine learning for treatmen t and structural pa- rameters’. F an, J. & T ruong, Y. K. (1993), ‘Nonparametric regression with errors in v ariables’, The A nnals of Statistics pp. 1900–1925. Flores, C. A., Flores-Lagunes, A., Gonzalez, A. & Neumann, T. C. (2012), ‘Estimating the effects of length of exp osure to instruction in a training program: The case of job corps’, R eview of Ec onomics and Statistics 94 (1), 153–171. F ong, C., Hazlett, C. & Imai, K. (2018), ‘Cov ariate balancing prop ensit y score for a con- tin uous treatment: Application to the efficacy of p olitical advertisemen ts’, The A nnals of Applie d Statistics 12 (1), 156–177. F ukumizu, K., Bach, F. R. & Jordan, M. I. (2004), ‘Dimensionality reduction for sup ervised learning with repro ducing k ernel hilb ert spaces’, Journal of Machine L e arning R ese ar ch 5 (Jan), 73–99. F ukumizu, K., Gretton, A., Sun, X. & Sc h¨ olkopf, B. (2007), ‘Kernel measures of conditional dep endence’, A dvanc es in neur al information pr o c essing systems 20 . Ga wlik owski, J., T assi, C. R. N., Ali, M., Lee, J., Humt, M., F eng, J., Krusp e, A., T rieb el, R., Jung, P ., Rosc her, R. et al. (2023), ‘A survey of uncertaint y in deep neural net works’, A rtificial Intel ligenc e R eview 56 (Suppl 1), 1513–1589. Gretton, A., Bousquet, O., Smola, A. & Sch¨ olk opf, B. (2005), Measuring statistical dep en- dence with hilb ert-schmidt norms, in ‘International conference on algorithmic learning theory’, Springer, pp. 63–77. He, Z., P ogo din, R., Li, Y., Dek a, N., Gretton, A. & Sutherland, D. J. (2025), On the hardness of conditional indep endence testing in practice, in ‘The Thirty-nin th Annual Conference on Neural Information Pro cessing Systems’. Hirano, K. & Im b ens, G. W. (2004), The prop ensit y score with con tinuous treatments, in A. Gelman & X.-L. Meng, eds, ‘Applied Ba y esian Modeling and Causal Inference from Incomplete-Data Perspectives’, WileyInterScience, W est Sussex, England, pp. 73–84. 33 Hyv¨ arinen, A., Khemakhem, I. & Moriok a, H. (2023), ‘Nonlinear indep enden t comp onent analysis for principled disen tanglement in unsup ervised deep learning’, Patterns 4 (10). Jacot, A., Gabriel, F. & Hongler, C. (2018), ‘Neural tangent k ernel: Conv ergence and gen- eralization in neural net works’, A dvanc es in neur al information pr o c essing systems 31 . Kennedy , E. H. (2024), ‘Semiparametric doubly robust targeted double mac hine learning: a review’, Handb o ok of statistic al metho ds for pr e cision me dicine pp. 207–236. Kennedy , E. H., Ma, Z., McHugh, M. D. & Small, D. S. (2017), ‘Non-parametric methods for doubly robust estimation of contin uous treatmen t effects’, Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy 79 (4), 1229–1245. Kleib er, C., Zeileis, A. & Zeileis, M. A. (2020), ‘Pac k age ‘aer”, R p ackage version 1 (4). Koh, P . W. & Liang, P . (2017), Understanding blac k-b ox predictions via influence functions, in ‘International conference on mac hine learning’, PMLR, pp. 1885–1894. Kong, L., Xie, S., Y ao, W., Zheng, Y., Chen, G., Sto jano v, P ., Akin wande, V. & Zhang, K. (2022), Partial disen tanglement for domain adaptation, in ‘In ternational conference on mac hine learning’, PMLR, pp. 11455–11472. Levis, A. W., Bonvini, M., Zeng, Z., Keele, L. & Kennedy , E. H. (2025), ‘Co v ariate-assisted b ounds on causal effects with instrumental v ariables’, Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy p. qk af028. Levis, A. W., Kennedy , E. H. & Keele, L. (2024), ‘Nonparametric identification and efficient estimation of causal effects with instrumental v ariables’, arXiv pr eprint . Moran, G. & Aragam, B. (2026), ‘T ow ards interpretable deep generativ e mo dels via causal represen tation learning’, Journal of the Americ an Statistic al Asso ciation 0 (ja), 1–32. Nelsen, R. B. (2006), An intr o duction to c opulas , Springer. New ey , W. K. (2013), ‘Nonparametric instrumental v ariables estimation’, A meric an Ec o- nomic R eview 103 (3), 550–556. 34 Ng, I., Bl¨ obaum, P ., Bhandari, S., Zhang, K. & Kasiviswanathan, S. (2025), ‘Debiasing re- w ard models b y represen tation learning with guaran tees’, arXiv pr eprint . Ng, I., Xie, S., Dong, X., Spirtes, P . & Zhang, K. (2025), Causal represen tation learning from general environmen ts under nonparametric mixing, in ‘The 28th In ternational Conference on Artificial In telligence and Statistics’. Sc h¨ olk opf, B., Lo catello, F., Bauer, S., Ke, N. R., Kalch brenner, N., Goy al, A. & Bengio, Y. (2021), ‘T o ward causal represen tation learning’, Pr o c e e dings of the IEEE 109 (5), 612–634. Shah, R. D. & P eters, J. (2020), ‘The hardness of conditional independence testing and the generalised cov ariance measure’, The A nnals of Statistics 48 (3), 1514–1538. Sheng, T. & Sriperumbudur, B. K. (2023), ‘On distance and k ernel measures of conditional dep endence’, Journal of Machine L e arning R ese ar ch 24 (7), 1–16. Silv erman, B. W. (2018), Density estimation for statistics and data analysis , Routledge. Sw anson, S. A., Hern´ an, M. A., Miller, M., Robins, J. M. & Richardson, T. S. (2018), ‘P artial identification of the av erage treatment effect using instrumen tal v ariables: review of metho ds for binary instrumen ts, treatments, and outcomes’, Journal of the Americ an Statistic al Asso ciation 113 (522), 933–947. W ager, S. & A they , S. (2018), ‘Estimation and inference of heterogeneous treatmen t effects using random forests’, Journal of the Americ an Statistic al Asso ciation 113 (523), 1228– 1242. W u, P ., Li, H., Zheng, C., Zeng, Y., Chen, J., Liu, Y., Guo, R. & Zhang, K. (2025), Learn- ing coun terfactual outcomes under rank preserv ation, in ‘Adv ances in neural information pro cessing systems’. Xie, S., Huang, B., Gu, B., Liu, T. & Zhang, K. (2023), ‘Adv ancing counterfac tual inference through nonlinear quan tile regression’, arXiv pr eprint arXiv:2306.05751 . 35 T ec hnical App endix A Pro of of Theorem 2 Pr o of. 1) Pro of for the Upp er Bound ( U ≤ U marg ) Define the function g : R 2 → R as g ( a, b ) = min( a, b ). The function g is the p oint wise minim um of t wo linear (and th us conca v e) functions, f 1 ( a, b ) = a and f 2 ( a, b ) = b . Therefore, g is a conca v e function . By Jensen’s Inequalit y for conca v e functions, for an y random v ector Y , w e hav e E [ g ( Y )] ≤ g ( E [ Y ]). Let the random v ector b e ( θ 1 ( X ) , θ 0 ( X )). Applying the inequality: E X [ g ( θ 1 ( X ) , θ 0 ( X ))] ≤ g ( E X [ θ 1 ( X )] , E X [ θ 0 ( X )]) Substituting the definitions of g , θ a , U , and U marg : E X min { θ 1 ( X ) , θ 0 ( X ) } | {z } U ( y 1 ,y 0 ) ≤ min E X [ θ 1 ( X )] , E X [ θ 0 ( X )] Since F Y ( a ) ( y a ) = E X [ θ a ( X )], the righ t-hand side is min F Y (1) ( y 1 ) , F Y (0) ( y 0 ) = U marg ( y 1 , y 0 ) Th us, w e ha ve sho wn that U ( y 1 , y 0 ) ≤ U marg ( y 1 , y 0 ). 2) Pro of for the Low er Bound ( L marg ≤ L ) Define the function h : R 2 → R as h ( a, b ) = max( a + b − 1 , 0). The function h is the p oin twise maxim um of tw o linear (and th us con v ex) functions, f 1 ( a, b ) = a + b − 1 and f 2 ( a, b ) = 0. Therefore, h is a con vex function . By Jensen’s Inequalit y for con v ex functions, for an y random v ector Y , we ha v e E [ h ( Y )] ≥ h ( E [ Y ]). Let the random v ector again b e ( θ 1 ( X ) , θ 0 ( X )). Applying the inequality: E X [ h ( θ 1 ( X ) , θ 0 ( X ))] ≥ h ( E X [ θ 1 ( X )] , E X [ θ 0 ( X )]) Substituting the definitions of h and θ a E X max { θ 1 ( X ) + θ 0 ( X ) − 1 , 0 } | {z } L ( y 1 ,y 0 ) ≥ max { E X [ θ 1 ( X )] + E X [ θ 0 ( X )] − 1 , 0 } 36 (W e used the linearit y of expectation inside the max function on the righ t-hand side). Since F Y ( a ) ( y a ) = E X [ θ a ( X )], the righ t-hand side is max F Y (1) ( y 1 ) + F Y (0) ( y 0 ) − 1 , 0 = L marg ( y 1 , y 0 ) Th us, w e ha ve sho wn that L ( y 1 , y 0 ) ≥ L marg ( y 1 , y 0 ). W e hav e established L marg ≤ L and U ≤ U marg , whic h prov es that the in terv al [ L ( y 1 , y 0 ) , U ( y 1 , y 0 )] is a (p oten tially tighter) sub-in terv al of [ L marg ( y 1 , y 0 ) , U marg ( y 1 , y 0 )]. Equality holds if and only if the conditional CDFs θ a ( X ) are constant almost surely with resp ect to X , implying that the co v ariates X provide no additional information b ey ond the marginals. B Pro of on efficien t estimator of conditional FH b ound W e first derive the IF for the conditional CDF functional θ a ( · ), then use chain rule to obtain EIF for smo oth functional ϕ , and finally handle the nonsmo oth min function using (i) smo othing and (ii) direct subgradient under margin condition. B.1 Influence function comp onen ts under regular condition B.1.1 EIF for θ a ( · ) and for E [ θ a ( X )] Fix y ∈ R and a ∈ { 0 , 1 } . Define the functional on the full distribution P Θ a ( P )( x ) := θ a ( x ) = P ( Y ≤ y | A = a, X = x ) . W e are interested in the scalar functional Ψ a ( P ) := E X [ θ a ( X )]. T o obtain the pathwise deriv ativ e (influence function) w e follo w the standard parametric-submo del / Gateaux- deriv ativ e route. Let { P ε : ε ∈ ( − ε 0 , ε 0 ) } b e any regular parametric submodel with score s = o (1) at ε = 0, i.e. dP ε /dP = 1 + εs + o ( ε ) with E P [ s ] = 0. W rite θ a,ε ( x ) for the conditional CDF under P ε . W e compute d dε ε =0 Ψ a ( P ε ) = d dε ε =0 E P ε θ a,ε ( X ) . Differen tiate using pro duct rule d dε E P ε [ θ a,ε ( X )] = E P ˙ θ a ( X ) + E P θ a ( X ) s ( O ) , 37 where ˙ θ a ( x ) := d dε ε =0 θ a,ε ( x ). W e now compute ˙ θ a ( x ) by differentiating the conditional CDF θ a,ε ( x ) = P ε ( Y ≤ y | A = a, X = x ) = E P ε 1 { A = a } 1 { Y ≤ y } | X = x P ε ( A = a | X = x ) . Differen tiate at ε = 0; denote π a ( x ) = P ( A = a | X = x ). Using quotien t rule we obtain ˙ θ a ( x ) = 1 π a ( x ) E P ( 1 { A = a } 1 { Y ≤ y } ) s ( O ) | X = x − θ a ( x ) π a ( x ) E P 1 { A = a } s ( O ) | X = x . (9) Multiply both sides b y the marginal densit y of X and in tegrate to get the Gateaux deriv ativ e of Ψ a . Rearranging and using standard score calculations yields that the influence function for Ψ a is IF a (Ψ a ( P )) = θ a ( X ) − E [ θ a ( X )] + 1 { A = a } π a ( X ) 1 { Y ≤ y } − θ a ( X ) . (10) W e can chec k E [IF a (Ψ a ( P ))] = 0 and that the pathwise deriv ative equals E [ s ( O )IF a (Ψ a ( P ))] for all scores s , which confirms (10) is the canonical gradient for Ψ a . B.1.2 Chain rule: EIF for smo oth ϕ Let ϕ : R 2 → R be con tin uously differen tiable. Define Ψ( P ) = E X ϕ θ 0 ( X ) , θ 1 ( X ) . Consider a parametric submo del P ε with score s and let θ a,ε ( x ) b e the conditional CDF under P ε . Differentiate d dε ε =0 Ψ( P ε ) = E P h 1 X a =0 ∂ a ϕ ( θ 0 ( X ) , θ 1 ( X )) ˙ θ a ( X ) i + E P ϕ ( θ 0 , θ 1 ) s ( O ) . Using (10) which gives the deriv ative of E [ θ a ( X )], we can rewrite the ab ov e as d dε ε =0 Ψ( P ε ) = E P s ( O ) · IF(Ψ( P )) , with IF(Ψ( P )) = 1 X a =0 ∂ a ϕ ( θ 0 ( X ) , θ 1 ( X )) 1 { A = a } π a ( X ) ( 1 { Y ≤ y } − θ a ( X )) + ϕ ( θ 0 ( X ) , θ 1 ( X )) − Ψ( P ) . (11) 38 Again we chec k E P [IF(Ψ( P ))] = 0, and for an y score s , d dε Ψ( P ε ) | ε =0 = R (IF · s ) dP , verifying (11) is the canonical gradient. W e no w treat ϕ ( u, v ) = min { u, v } . Since min is not differen tiable on the diagonal u = v , w e consider tw o strategies: Direct Estimation under a Margin Condition or approximation via a smo oth function. B.2 Pro of of asymptotic prop ert y of direct estimator under mar- gin condition W e first prov e the statistical prop erties of direct estimator (Theorem 3) by introducing a p olynomial margin condition (Assumption 2.1). Pr o of. W e follo w the standard semi-parametric argumen t. Step 1. Oracle represen tation and pathwise deriv ativ e (canonical gradien t). Assume for the momen t the selector d ( x ) is known (“oracle” case). Then Ψ( P ) = E 1 { d ( X ) = 0 } θ 0 ( X ) + 1 { d ( X ) = 1 } θ 1 ( X ) . Let { P ε : ε } be a regular parametric submo del with score function s ( O ) at ε = 0 so that d dε ε =0 log dP ε dP = s ( O ) , E [ s ( O )] = 0 . Differen tiate Ψ( P ε ) using pro duct rule and the conditional structure. W e compute the deriv a- tiv e of each term. Fix a ∈ { 0 , 1 } . F or the term E [ 1 { d ( X ) = a } θ a,ε ( X )] we hav e d dε ε =0 E P ε [ 1 { d ( X ) = a } θ a,ε ( X )] = E 1 { d ( X ) = a } ˙ θ a ( X ) + E 1 { d ( X ) = a } θ a ( X ) s ( O ) , where ˙ θ a ( x ) = d dε ε =0 θ a,ε ( x ). Using deriv ed equation 9 for ˙ θ a ( x ) ˙ θ a ( x ) = 1 π a ( x ) E ( 1 { A = a } 1 { Y ≤ y } ) s ( O ) | X = x − θ a ( x ) π a ( x ) E 1 { A = a } s ( O ) | X = x , and in tegrating the pathwise deriv ative of Ψ along the score s d dε ε =0 Ψ( P ε ) = E s ( O ) IF oracle (Ψ( P ; d )) , w e obtain the (cen tered) oracle influence function IF oracle (Ψ( P ; d )) = 1 X a =0 1 { d ( X ) = a } 1 { A = a } π a ( X ) ( 1 { Y ≤ y } − θ a ( X )) + θ a ( X ) − Ψ( P ) . (12) 39 W e can c heck E [IF oracle ] = 0. Thus (12) is the canonical gradien t under the oracle selector. Step 2. F easible estimator and v on-Mises decomp osition. In practice d ( x ) is unkno wn; replace it by the plug-in selector in a separated indep endent data (by data splitting or cross-fitting) ˆ d ( x ) = arg min a ˆ θ a ( x ) . W e denote the uncentered influence function corresp onding to IF oracle + Ψ( P ) as φ ( O ; P , d ) = 1 X a =0 1 { d ( X ) = a } 1 { A = a } π a ( X ) ( 1 { Y ≤ y } − θ a ( X )) + θ a ( X ) , plug-in ev aluated at true selector φ ( O ; ˆ P , d ) = X a 1 { d ( X ) = a } ˆ θ a ( X ) + 1 { A = a } ˆ π a ( X ) ( 1 { Y ≤ y } − ˆ θ a ( X )) , and φ ( O ; ˆ P , ˆ d ) denotes the same expression with ˆ d in place of d , then define the feasible doubly-robust estimator ˆ Ψ = P n [ φ ( O ; ˆ P , ˆ d )] = P n h 1 X a =0 1 { ˆ d ( X ) = a } ˆ θ a ( X ) + 1 { A = a } ˆ π a ( X ) ( 1 { Y ≤ y } − ˆ θ a ( X )) i . (13) W e compare ˆ Ψ to Ψ b y adding and subtracting the oracle influence function ˆ Ψ − Ψ( P ) = P n [ φ ( O ; ˆ P , ˆ d )] − P [ φ ( O ; P , d )] (14) = ( P n − P ) φ ( O ; P , d ) + ( P n − P )[ φ ( O ; ˆ P , ˆ d ) − φ ( O ; P , d )] + P [ φ ( O ; ˆ P , ˆ d ) − φ ( O ; P , d )] , = S + R 1 + R 2 , W e b ound R 2 b y separating nuisance and selector errors P h φ ( O ; ˆ P , ˆ d ) − φ ( O ; P , d ) i = P h φ ( O ; ˆ P , d ) − φ ( O ; P , d ) i | {z } nuisance error B nuis + P h φ ( O ; ˆ P , ˆ d ) − φ ( O ; ˆ P , d ) i | {z } selector error B sel . Step 3. Bound for the nuisance remainder B nuis . W e first aim to b ound B nuis = P h φ ( O ; ˆ P , d ) − φ ( O ; P , d ) i . Recall that P [ · ] denotes the expectation E O [ · ]. W e use the la w of iterated exp ectations, E O [ · ] = E X E A,Y | X [ · | X ] , and first compute the inner conditional exp ectation. 40 F or a ∈ { 0 , 1 } , let ψ a ( X ; ˆ P ) b e the conditional exp ectation of the a -th comp onen t ψ a ( X ; ˆ P ) := E A,Y | X h φ ( O ; ˆ P , a ) | X i = E A,Y | X ˆ θ a ( X ) + 1 { A = a } ˆ π a ( X ) ( 1 { Y ≤ y } − ˆ θ a ( X )) | X = ˆ θ a ( X ) + E A | X [ 1 { A = a } | X ] ˆ π a ( X ) E Y | A,X [ 1 { Y ≤ y } | A = a, X ] − ˆ θ a ( X ) = ˆ θ a ( X ) + π a ( X ) ˆ π a ( X ) θ a ( X ) − ˆ θ a ( X ) . The corresp onding conditional exp ectation ψ a ( X ; P ) ev aluated at the true parameters P is ψ a ( X ; P ) := E A,Y | X [ φ ( O ; P , a ) | X ] = θ a ( X ) + π a ( X ) π a ( X ) ( θ a ( X ) − θ a ( X )) = θ a ( X ) . No w w e can rewrite B nuis b y substituting these conditional exp ectations B nuis = E X " E A,Y | X " 1 X a =0 1 { d ( X ) = a } φ ( O ; ˆ P , a ) − φ ( O ; P , a ) | X ## = E X " 1 X a =0 1 { d ( X ) = a } ψ a ( X ; ˆ P ) − ψ a ( X ; P ) # . W e compute the difference ψ a ( X ; ˆ P ) − ψ a ( X ; P ) ψ a ( X ; ˆ P ) − ψ a ( X ; P ) = ˆ θ a ( X ) + π a ( X ) ˆ π a ( X ) ( θ a ( X ) − ˆ θ a ( X )) − θ a ( X ) = ( ˆ θ a ( X ) − θ a ( X )) − π a ( X ) ˆ π a ( X ) ( ˆ θ a ( X ) − θ a ( X )) = ( ˆ θ a ( X ) − θ a ( X )) 1 − π a ( X ) ˆ π a ( X ) = ( ˆ θ a ( X ) − θ a ( X )) ˆ π a ( X ) − π a ( X ) ˆ π a ( X ) . This iden tity shows that the n uisance remainder is a pro duct of the estimation errors in θ a and π a . This is the k ey “Neyman-Orthogonal” or “Doubly-Robust” structure, whic h ensures the first-order (linear) error terms cancel exactly . Substituting this bac k in to the expression for B nuis yields B nuis = E X " 1 X a =0 1 { d ( X ) = a } ( ˆ θ a ( X ) − θ a ( X )) ˆ π a ( X ) − π a ( X ) ˆ π a ( X ) # . 41 W e no w b ound this remainder. Assuming the estimators are b ounded inf x ˆ π a ( x ) ≥ π > 0, b y Cauc hy-Sc hw arz, w e ha ve | B nuis | ≤ E X " 1 X a =0 1 { d ( X ) = a } ˆ θ a − θ a · ˆ π a − π a ˆ π a # ≲ 1 X a =0 E X h ˆ θ a ( X ) − θ a ( X ) · | ˆ π a ( X ) − π a ( X ) | i ≤ 1 X a =0 ∥ ˆ θ a − θ a ∥ L 2 ( P ) · ∥ ˆ π a − π a ∥ L 2 ( P ) . Th us, w e ha ve the tigh t, second-order b ound | B nuis | = O p 1 X a =0 ∥ ˆ θ a − θ a ∥ L 2 ( P ) ∥ ˆ π a − π a ∥ L 2 ( P ) . (15) This b ound sho ws that B nuis = o p ( n − 1 / 2 ) under the pro duct-rate condition ∥ ˆ θ a − θ a ∥ L 2 ( P ) ∥ ˆ π a − π a ∥ L 2 ( P ) = o p ( n − 1 / 2 ). Step 4. Bound for the selector remainder B sel . The selector remainder arises b ecause w e use ˆ d instead of d . Note φ ( O ; ˆ P , a ) = ˆ θ a ( X ) + 1 { A = a } ˆ π a ( X ) ( 1 { Y ≤ y } − ˆ θ a ( X )). F or brevit y we write ∆ d ( X ) := 1 { ˆ d ( X ) = d ( X ) } ∈ { 0 , 1 } , ∆ φ ( O ; ˆ P ) = φ ( O ; ˆ P , 1) − φ ( O ; ˆ P , 0), ∆( X ) = θ 1 ( X ) − θ 0 ( X ) and similarly ˆ ∆( X ) = ˆ θ 1 ( X ) − ˆ θ 0 ( X ). Using the p oint wise iden tity φ ( O ; ˆ P , ˆ d ) − φ ( O ; ˆ P , d ) = X a ∈{ 0 , 1 } [ 1 ( ˆ d ( X ) = a ) φ ( O ; ˆ P , 1)] − X a ∈{ 0 , 1 } [ 1 ( d ( X ) = a ) φ ( O ; ˆ P , 1)] = 1 { b d ( X ) = 1 } − 1 { d ( X ) = 1 } · φ ( O ; ˆ P , 1) − φ ( O ; ˆ P , 0) = ∆ d ( X ) sgn( ˆ d ( X ) − d ( X )) · ∆ φ ( O ; ˆ P ) (16) Recall ψ a ( X ; ˆ θ , ˆ π ) := E A,Y | X h φ ( O ; ˆ P , a ) | X i = ˆ θ a ( X ) + π a ( X ) ˆ π a ( X ) θ a ( X ) − ˆ θ a ( X ) . Apply the La w of Iterated Exp ectations and substitute this into the expression for B sel , we hav e B sel = P h φ ( O ; ˆ P , ˆ d ) − φ ( O ; ˆ P , d ) i = E X h E A,Y | X h φ ( O ; ˆ P , ˆ d ) − φ ( O ; ˆ P , d ) | X ii = E X h ∆ d ( X ) sgn( ˆ d ( X ) − d ( X )) · ( ψ 1 ( X ; ˆ θ , ˆ π ) − ψ 0 ( X ; ˆ θ , ˆ π )) i ≤ E X h 1 { ˆ d = d } · | M ∗ ( X ) | i , 42 where M ∗ ( X ) = ψ 1 ( X ; ˆ θ , ˆ π ) − ψ 0 ( X ; ˆ θ , ˆ π ) = E A,Y | X [∆ φ ( O ; ˆ P )]. Assume the prop ensity estimators are uniformly b ounded aw ay from zero, inf x min a ˆ π a ( x ) ≥ π > 0, and note | π a | ≤ 1, | θ a | ≤ 1 and | ˆ θ a | ≤ 1. Observe that point wise | ψ a | ≤ max a ˆ θ a ( X ) + π a ( X ) ˆ π a ( X ) θ a ( X ) − ˆ θ a ( X ) ≤ 1 + 1 π , so | M ∗ | is uniformly in tegrable and sup X | M ∗ | < ∞ . Thus, | B sel | ≤ P ( ˆ d ( X ) = d ( X )) · sup | M ∗ ( X ) | . Note that ˆ d ( X ) = d ( X ) implies that the sign of ∆( X ) is flipp ed by estimation error large enough to o vercome the gap. Indeed, { ˆ d ( X ) = d ( X ) } ⊆ n | ∆( X ) | ≤ | ˆ ∆( X ) − ∆( X ) | o = n | ∆( X ) | ≤ ∥ ˆ ∆( X ) − ∆( X ) ∥ ∞ o . Therefore, by Assumption 2.1 (polynomial margin of exp onent α ), P ˆ d ( X ) = d ( X ) ≤ P | ∆( X ) | ≤ ∥ ˆ ∆( X ) − ∆( X ) ∥ ∞ ≲ ∥ ˆ ∆ − ∆ ∥ α ∞ . W e kno w ∥ ˆ ∆ − ∆ ∥ ∞ ≤ ∥ ˆ θ 0 − θ 0 ∥ ∞ + ∥ ˆ θ 1 − θ 1 ∥ ∞ ≤ 2 max a ∥ ˆ θ a − θ a ∥ ∞ . Hence P ( ˆ d = d ) ≲ max a ∥ ˆ θ a − θ a ∥ ∞ α . W e then decompose | M ∗ ( X ) | = ˆ θ 1 + π 1 ˆ π 1 ( θ 1 − ˆ θ 1 ) − ˆ θ 0 + π 0 ˆ π 0 ( θ 0 − ˆ θ 0 ) = ( ˆ θ 1 − ˆ θ 0 ) + π 1 ˆ π 1 ( θ 1 − ˆ θ 1 ) − π 0 ˆ π 0 ( θ 0 − ˆ θ 0 ) = | ˆ ∆( X ) + R IPW ( X ) | = | ∆( X ) + ( ˆ ∆( X ) − ∆( X )) + R IPW ( X ) | ≤ | ∆( X ) | + | ( ˆ ∆( X ) − ∆( X )) | + | R IPW ( X ) | . On the ev ent ˆ d = d , w e kno w | ∆( X ) | ≤ | ˆ ∆( X ) − ∆( X ) | ≲ max a ∥ ˆ θ a − θ a ∥ ∞ . Also, | R IPW ( X ) | ≤ π 1 ˆ π 1 ( θ 1 − ˆ θ 1 ) + π 0 ˆ π 0 ( θ 0 − ˆ θ 0 ) ≲ max a ∥ ˆ θ a − θ a ∥ ∞ . Hence w e hav e sup | M ∗ ( X ) | ≲ max a ∥ ˆ θ a − θ a ∥ ∞ . 43 Inserting the margin bound for P ( ˆ d = d ) yields | B sel | ≲ max a ∥ ˆ θ a − θ a ∥ ∞ 1+ α . (17) This is the k ey selector-bias b ound: the cost of using the plug-in selector is controlled by a (1 + α ) p ow er of the sup-norm estimation error. Step 5. Bound the empirical pro cess term W e no w b ound the empirical process term R 1 = ( P n − P ) φ ( O ; ˆ P , ˆ d ) − φ ( O ; P , d ) . T o a void the need for Donsker conditions, w e assume sample splitting: the n uisance estimates ( ˆ θ a , ˆ π a ) and selector ˆ d are trained on an auxiliary sample indep enden t of the one used to ev aluate P n . Conditional on this training sample, φ ( O ; ˆ P , ˆ d ) is a fixed measurable function of O , so that standard empirical pro cess inequalities apply . By standard empirical pro cess argumen t, conditional on the indep endent training sample w e ha ve E | R 1 | | ˆ P , ˆ d ≲ 1 √ n ∥ φ ( O ; ˆ P , ˆ d ) − φ ( O ; P , d ) ∥ L 2 ( P ) . Therefore, it suffices to con trol the L 2 ( P )–distance ∥ φ ( O ; ˆ P , ˆ d ) − φ ( O ; P , d ) ∥ L 2 ( P ) = o p (1). Similarly , ∥ φ ( O ; ˆ P , ˆ d ) − φ ( O ; P , d ) ∥ L 2 ( P ) ≤ ∥ φ ( O ; ˆ P , ˆ d ) − φ ( O ; ˆ P , d ) ∥ L 2 ( P ) + ∥ φ ( O ; ˆ P , d ) − φ ( O ; P , d ) ∥ L 2 ( P ) =: T sel + T nuis . Bound for T nuis . This term corresp onds to p erturbations in the nuisance functions with selector fixed. Similar to expansion as in Step 3 (see Eq. (15)) but no w in L 2 ( P ) norm rather than L 1 ( P ), eac h component is a sum of n uisance estimation errors. φ ( O ; ˆ P , a ) − φ ( O ; P , a ) = ˆ θ a + 1 { A = a } ˆ π a ( 1 { Y ≤ y } − ˆ θ a ) − θ a + 1 { A = a } π a ( 1 { Y ≤ y } − θ a ) =( ˆ θ a − θ a ) 1 − 1 { A = a } ˆ π a + 1 { A = a } 1 { Y ≤ y } − θ a ˆ π a − 1 { Y ≤ y } − θ a π a =( ˆ θ a − θ a ) 1 − 1 { A = a } ˆ π a + 1 { A = a } ( 1 { Y ≤ y } − θ a ) π a − ˆ π a ˆ π a π a 44 Then T nuis = 1 X a =0 1 { d ( X ) = a } φ ( O ; ˆ P , a ) − φ ( O ; P , a ) L 2 ( P ) ≤ 1 X a =0 ( ˆ θ a − θ a ) 1 − 1 { A = a } ˆ π a + 1 { A = a } ( 1 { Y ≤ y } − θ a ) π a − ˆ π a ˆ π a π a L 2 ( P ) ≲ 1 X a =0 | ˆ θ a − θ a | L 2 ( P ) + | ˆ π a − π a | L 2 ( P ) = o p (1) , (18) when both n uisance estimators are consisten t | ˆ θ a − θ a | L 2 ( P ) = o p (1) and | ˆ π a − π a | L 2 ( P ) = o p (1). Bound for T sel . Using the p oin twise iden tity equation 16, since it is supp orted only on the set { b d ( X ) = d ( X ) } , taking exp ectation giv es T 2 sel = E h φ ( O ; ˆ P , ˆ d ) − φ ( O ; ˆ P , d ) 2 i = E ∆ d ( X ) 2 ∆ φ ( O ; ˆ P ) 2 = E X h ∆ d ( X ) · E A,Y | X h ∆ φ ( O ; ˆ P ) 2 | X ii , where ∆ d ( X ) := 1 { ˆ d ( X ) = d ( X ) } ∈ { 0 , 1 } , ∆ φ ( O ; ˆ P ) = φ ( O ; ˆ P , 1) − φ ( O ; ˆ P , 0). Recall ∆ φ ( O ; ˆ P ) is bounded when ˆ π a ≥ π > 0. Consequently , T sel ≲ P ( ˆ d = d ) 1 / 2 ≲ max a ∥ ˆ θ a − θ a ∥ ∞ α/ 2 = o p (1) . (19) Com bining. Substituting (18) and (19) into the empirical pro cess b ound yields R 1 = o p ( n − 1 / 2 ) . under the mild condition ∥ ˆ θ a − θ a ∥ = o p (1) and ∥ ˆ π a − π a ∥ = o p (1). Step 6. Asymptotic linearit y and normality . Com bining Steps 3–5, all remainders R 1 and R 2 are o p ( n − 1 / 2 ), and th us the estimator ˆ Ψ in (13) admits the asymptotic linear representation √ n ( ˆ Ψ − Ψ) = 1 √ n n X i =1 φ ( O i ; P , d ) + o p (1) , with influence function giv en in (12). Since IF oracle (Ψ( P ; d )) has finite v ariance (it is a b ounded com bination of bounded terms under our assumptions), classical CL T giv es √ n ( ˆ Ψ − Ψ) d → N 0 , V ar( φ ( O ; P , d ) . 45 A consistent v ariance estimator is ˆ σ 2 = P n φ ( O ; ˆ P , ˆ d ) − ˆ Ψ 2 . Under the same remainder conditions one verifies ˆ σ 2 p → V ar( φ ( O ; P , d )). B.3 Pro of of asymptotic prop ert y of smo oth-appro ximation esti- mator W e no w pro ve the statistical prop erties of smooth-approximation estimator with a fixed smo oth parameter t (Theorem 4). Pr o of. The pro of follows standard semiparametric argumen t as well. Step 1. Smo oth approximation and differen tiabilit y . Define, for t > 0, g t ( u, v ) := − 1 t log e − tu + e − tv , ( u, v ) ∈ [0 , 1] 2 . Then g t ( u, v ) is smo oth and satisfies min( u, v ) − log 2 t ≤ g t ( u, v ) ≤ min( u, v ) , and lim t →∞ g t ( u, v ) = min( u, v ) . W e appro ximate the functional Ψ( P ) = E [min( θ 0 ( X ) , θ 1 ( X ))] by Ψ t ( P ) = E g t ( θ 0 ( X ) , θ 1 ( X )) , whic h is con tin uously Gateaux-differentiable in P for any finite t . As t → ∞ , Ψ t ( P ) ↑ Ψ( P ). Step 2. P ath wise deriv ative and canonical gradien t. Our target parameter is Ψ t ( P ) = E [ g t ( θ 0 ( X ) , θ 1 ( X ))]. Let P ε b e a regular parametric submo del with score s ( O ) at ε = 0, the path wise deriv ative of Ψ t ( P ) is d dε Ψ t ( P ε ) ε =0 = d dε E ε [ g t ( θ 0 ,ε ( X ) , θ 1 ,ε ( X ))] ε =0 Using previously deriv ed EIF for smo oth functional (equation 11), w e ha v e IF(Ψ t ( P )) = 1 X a =0 w a,t ( X ) 1 { A = a } π a ( X ) ( 1 { Y ≤ y } − θ a ( X )) + g t ( θ 0 ( X ) , θ 1 ( X )) − Ψ t ( P ) , (20) where w a,t ( x ) = ∂ a g t ( θ 0 ( x ) , θ 1 ( x )) = e − tθ a ( x ) e − tθ 0 ( x ) + e − tθ 1 ( x ) are smo oth functions b ounded in [0 , 1] with P 1 a =0 w a,t ( x ) = 1, interpreted as a smo oth w eigh ting function betw een θ 0 ( x ) and θ 1 ( x ). 46 Step 3. Doubly-robust estimator. W e denote φ t ( O ; P ) the uncen tered influence func- tion φ t ( O ; P ) = 1 X a =0 w a,t ( X ) 1 { A = a } π a ( X ) ( 1 { Y ≤ y } − θ a ( X )) + g t ( θ 0 ( X ) , θ 1 ( X )) , and φ t ( O ; ˆ P ) for using estimated n uisance parameters ˆ θ a ( X ) and ˆ π a ( X ) obtained on an indep enden t training sample. Define ˆ Ψ t = P n φ t ( O ; ˆ P ) = P n h 1 X a =0 ˆ w a,t ( X ) 1 { A = a } ˆ π a ( X ) ( 1 { Y ≤ y } − ˆ θ a ( X )) + g t ( ˆ θ 0 ( X ) , ˆ θ 1 ( X )) i , (21) where ˆ w a,t ( X ) = e − t ˆ θ a ( X ) e − t ˆ θ 0 ( X ) + e − t ˆ θ 1 ( X ) . W e will then establish the rate conditions when ˆ Ψ t is ro ot- n consistent and asymptotic normalit y . Step 4. von Mises expansion and remainder decomp osition. Let ˆ η denote estimated n uisances from the indep enden t sample. Then ˆ Ψ t − Ψ t ( P ) = ( P n − P ) φ t ( O ; P ) + P [ φ t ( O ; ˆ P ) − φ t ( O ; P )] | {z } B nuis + ( P n − P )[ φ t ( O ; ˆ P ) − φ t ( O ; P )] | {z } R n , (22) where φ t ( O ; ˆ P ) is the uncentered plug-in influence function with ˆ η . Since the training sample used for ˆ η is indep endent of P n , conditional on ˆ η the term R n is a mean-zero empirical pro cess. W e treat B nuis and R n separately . Step 5. Con trol of B nuis . The n uisance remainder is B nuis = P [ φ t ( O ; ˆ P ) − φ t ( O ; P )]. W e use the La w of Iterated Exp ectations E O [ · ] = E X [ E A,Y | X [ · | X ]]. F or the a -th comp onen t of 47 the uncentered IF φ t ( O , ˆ P ), the conditional exp ectation is E [ φ t ( O , ˆ P ) | X ] = E " 1 X a =0 n ˆ w a,t ( X ) 1 { A = a } ˆ π a ( X ) ( 1 { Y ≤ y } − ˆ θ a ( X )) o + g t ( ˆ θ 0 ( X ) , ˆ θ 1 ( X )) X # = 1 X a =0 ˆ w a,t ( X ) E [ 1 { A = a } | X ] ˆ π a ( X ) E [ 1 { Y ≤ y } | A = a, X ] − ˆ θ a ( X ) + g t ( ˆ θ 0 ( X ) , ˆ θ 1 ( X )) = 1 X a =0 ˆ w a,t ( X ) π a ( X ) ˆ π a ( X ) θ a ( X ) − ˆ θ a ( X ) + g t ( ˆ θ 0 ( X ) , ˆ θ 1 ( X )) . Similarly , the conditional exp ectation of the φ t ev aluated at the true parameters P is E [ φ t ( O ; P ) | X ] = 1 X a =0 w a,t ( X ) π a ( X ) π a ( X ) ( θ a ( X ) − θ a ( X )) + g t ( θ 0 ( X ) , θ 1 ( X )) = g t ( θ 0 ( X ) , θ 1 ( X )) . Therefore, B nuis is given by the exp ectation of the difference in conditional means B nuis = E X h E [ φ t ( O ; ˆ P ) | X ] − E [ φ t ( O ; P ) | X ] i , where E [ φ t ( O ; ˆ P ) | X ] − E [ φ t ( O ; P ) | X ] = " 1 X a =0 ˆ w a,t π a ˆ π a ( θ a − ˆ θ a ) + g t ( ˆ θ 0 , ˆ θ 1 ) # − g t ( θ 0 , θ 1 ) = 1 X a =0 ˆ w a,t π a ˆ π a ( θ a − ˆ θ a ) − h g t ( θ 0 , θ 1 ) − g t ( ˆ θ 0 , ˆ θ 1 ) i . No w, w e use the first-order T aylor expansion for g t ( θ 0 , θ 1 ) around ˆ θ , g t ( θ 0 , θ 1 ) − g t ( ˆ θ 0 , ˆ θ 1 ) = 1 X a =0 ˆ w a,t ( θ a − ˆ θ a ) + R θ , where R θ is a quadratic remainder term | R θ ( X ) | ≲ t · ∥ ˆ θ ( X ) − θ ( X ) ∥ 2 2 = t · P 1 a =0 ( ˆ θ a ( X ) − θ a ( X )) 2 , since ∂ 2 g t ∂ θ 2 0 = − t · e − tθ 0 e − tθ 1 ( e − tθ 0 + e − tθ 1 ) 2 = − t · w 0 ,t · w 1 ,t . Substituting this expansion bac k E [ φ t ( O ; ˆ P ) | X ] − E [ φ t ( O ; P ) | X ] = 1 X a =0 ˆ w a,t π a ˆ π a ( θ a − ˆ θ a ) − " 1 X a =0 ˆ w a,t ( θ a − ˆ θ a ) + R θ # = 1 X a =0 ˆ w a,t ( θ a − ˆ θ a ) π a ˆ π a − 1 − R θ = 1 X a =0 ˆ w a,t ( ˆ θ a − θ a ) ˆ π a − π a ˆ π a − R θ . 48 The first term is the key second-order in teraction term. The remainder R θ is O p ( t ∥ ˆ θ − θ ∥ 2 L 2 ( P ) ). The term ˆ w a,t / ˆ π a is b ounded almost surely by sup | ˆ w a,t | / inf ˆ π a ≤ 1 /π . Th us, B nuis | B nuis | ≲ 1 X a =0 E X h | ˆ θ a ( X ) − θ a ( X ) | · | ˆ π a ( X ) − π a ( X ) | i + O p t · ∥ ˆ θ − θ ∥ 2 L 2 ( P ) . Applying the Cauch y-Sch warz inequalit y , the first term is b ounded by P 1 a =0 ∥ ˆ θ a − θ a ∥ L 2 ( P ) · ∥ ˆ π a − π a ∥ L 2 ( P ) . W e get | B nuis | ≲ O p ( ∥ ˆ θ − θ ∥ L 2 ( P ) · ∥ ˆ π − π ∥ L 2 ( P ) + t ∥ ˆ θ − θ ∥ 2 L 2 ( P ) ) . (23) F or a fixed t , under condition ∥ ˆ θ a − θ a ∥ L 2 ( P ) ∥ ˆ π a − π a ∥ L 2 ( P ) = o p ( n − 1 / 2 ) , ∥ ˆ θ a − θ a ∥ 2 L 2 ( P ) = o p ( n − 1 / 2 ) , for example ∥ ˆ θ a − θ a ∥ L 2 ( P ) = o p ( n − 1 / 4 ), ∥ ˆ π a − π a ∥ L 2 ( P ) = o p ( n − 1 / 4 ), w e conclude that the n uisance remainder term v anishes faster than the √ n rate B nuis = o p ( n − 1 / 2 ). Step 6. Bound for Empirical Pro cess Remainder R n The empirical process re- mainder is R n = ( P n − P )[ φ t ( O ; ˆ P ) − φ t ( O ; P )]. Let the difference function b e ∆ φ t ( O ) = φ t ( O ; ˆ P ) − φ t ( O ; P ). W e aim to show √ nR n = o p (1). Since the n uisance parameters ˆ η are obtained on a sample indep enden t of the ev aluation sample used for P n (sample splitting), w e use the conditional √ n concentration b ound E [ R 2 n | ˆ η ] ≤ 1 n E (∆ φ t ( O )) 2 | ˆ η = 1 n ∥ ∆ φ t ∥ 2 L 2 ( P ) , where ∥ · ∥ L 2 ( P ) denotes the L 2 ( P ) norm of the function ∆ φ t ( O ) giv en ˆ η . The function φ t ( O ; η ) is a smo oth function of η = ( θ 0 , θ 1 , π 0 , π 1 ). Giv en the assumptions that θ a and ˆ θ a are b ounded in [0 , 1] and π a and ˆ π a are b ounded aw ay from zero (i.e., inf ˆ π a > π > 0 a.s.), φ t is lo cally Lipsc hitz con tinuous in η | ∆ φ t ( O ) | ≲ 1 X a =0 sup η ∂ φ t ∂ θ a · | ˆ θ a − θ a | + 1 X a =0 sup η ∂ φ t ∂ π a · | ˆ π a − π a | ! ≲ O ( t ) · 1 X a =0 | ˆ θ a ( X ) − θ a ( X ) | + O (1) · 1 X a =0 | ˆ π a ( X ) − π a ( X ) | ! · K ( O ) 49 where K ( O ) is a b ounded function dep ending on A, Y and the b ounding constants for ˆ π a . Squaring the difference and taking the exp ectation P ∥ ∆ φ t ∥ 2 L 2 ( P ) = E O (∆ φ t ( O )) 2 ≲ t 2 ∥ ˆ θ − θ ∥ 2 L 2 ( P ) + ∥ ˆ π − π ∥ 2 L 2 ( P ) . Substituting this b ound bac k in to the v ariance of R n E [ R 2 n | ˆ η ] ≲ 1 n h t 2 ∥ ˆ θ − θ ∥ 2 L 2 ( P ) + ∥ ˆ π − π ∥ 2 L 2 ( P ) i . T aking the square ro ot and applying the Mark ov inequality yields R n = O p n − 1 / 2 h t ∥ ˆ θ − θ ∥ L 2 ( P ) + ∥ ˆ π − π ∥ L 2 ( P ) i . Then for a fixed t , we hav e R n = o p ( n − 1 / 2 ) under the condition ∥ ˆ θ − θ ∥ L 2 ( P ) = o p (1), ∥ ˆ π − π ∥ L 2 ( P ) = o p (1). Step 7. Asymptotic linearit y and normality . Com bining the b ounds abov e, under ∥ ˆ θ a − θ a ∥ L 2 ( P ) · ∥ ˆ π a − π a ∥ L 2 ( P ) = o p ( n − 1 / 2 ) , , we obtain the asymptotic expansion ˆ Ψ t − Ψ t ( P ) = ( P n − P ) φ t ( O ; P ) + o p ( n − 1 / 2 ) , and hence the cen tral limit theorem √ n ( ˆ Ψ t − Ψ t ( P )) d → N (0 , V ar( φ t ( O ; P ))) . A consistent v ariance estimator is ˆ σ 2 t = P n φ t ( O ; ˆ P ) − ˆ Ψ t 2 . C Pro of of iden tifabilit y of triple matc hing learning estimator (Theorem 5). W e pro ve the iden tifiability results in Theorem 5. Pr o of. W e first establish the identifabilit y of confounding represen tation Z C , and then use learned Z C to identify p otential outcomes Y ( a ). 50 Step 1: Iden tifiabilit y of the Confounding Subspace. Let the true data generating pro cess b e defined by the structural equations A = g A ( Z S ) and Y = g Y ( A, Z C ). W e do not assume g A is injective, but w e assume that for each fixed a , the conditional distribution p ( Y | A = a, Z C = · ) is injectiv e with resp ect to Z C (Assumption 3.1). (Note that if Y is deterministic, this reduces to g Y ( a, · ) b eing an injective function). Let ˆ e : A × Y → Z b Z C b e the learned enco der, defining b Z C = ˆ e ( A, Y ). Define the composite map ψ ( z c , z s ) := ˆ e ( g A ( z s ) , g Y ( g A ( z s ) , z c )) . W e emplo y the following regularity conditions (i) Indep endence Constraint: The learned represen tation satisfies b Z C ⊥ S . (ii) Predictiv e Sufficiency: The learned represen tation is sufficient for predicting Y from A . That is, Z C pro vides no additional information ab out Y once b Z C is known Y ⊥ ⊥ Z C | ( A, b Z C ) almost surely . In terms of densities, this implies p ( Y | A, b Z C , Z C ) = p ( Y | A, b Z C ) almost surely . (iii) Sufficien t V ariabilit y: As in Assumption 3.3, the family of conditional distributions { p ( Z S | Z C = z c , S = s ) } s is b oundedly complete. 1(a) F unctional independence ( b Z C dep ends only on Z C ). Fix an y measurable set U ⊆ Z b Z C and let D = ψ − 1 ( U ) b e its preimage. By condition (i), P ( ψ ( Z ) ∈ U | S = s ) is in v arian t to s . Using the factorization p ( z | s ) = p ( z c ) p ( z s | z c , s ), we obtain the identit y Z p ( z c ) p ( z s | z c , s 1 ) − p ( z s | z c , s 2 ) 1 D ( z c , z s ) , dz s dz c = 0 . (24) Supp ose for con tradiction that ψ dep ends on z s . Then D is not a Cartesian pro duct almost surely . Let B ∗ = { ( z c , z s ) ∈ D : { z c } × Z S ⊆ D } b e the en tangled region, whic h has p ositiv e measure. By the completeness condition (iii), the in tegral o v er this non-pro duct region cannot v anish for all s 1 , s 2 , contradicting (24). Th us, D = B × Z S almost surely , implying ψ ( z c , z s ) is constan t in z s . Hence, there exists a measurable map ϕ : Z C → Z b Z C suc h that b Z C = ϕ ( Z C ) almost surely . 51 1(b) Injectivit y via Predictiv e Sufficiency . W e show ϕ is injectiv e by contradiction. Supp ose ϕ is not injectiv e. Then there exist disjoint sets Z 1 , Z 2 ⊂ Z C with p ositiv e measure suc h that ϕ ( Z 1 ) = ϕ ( Z 2 ) = ˆ z . Fix a treatment a ∈ A . By the injectivit y of the generativ e mec hanism (Assumption 3.1), the outcome distributions conditioned on distinct latent v alues m ust differ. Th us, for any z 1 ∈ Z 1 and z 2 ∈ Z 2 , p ( Y | A = a, Z C = z 1 ) = p ( Y | A = a, Z C = z 2 ) . (Note: If Y is deterministic, these are distinct Dirac measures δ y 1 = δ y 2 ). No w consider the distribution conditioned on b Z C = ˆ z . By the Predictive Sufficiency condition (ii), w e ha v e conditional indep endence Y ⊥ Z C | ( A, b Z C ). This implies that p ( Y | A = a, b Z C = ˆ z , Z C = z ) = p ( Y | A = a, b Z C = ˆ z ) for almost all z in the fib er ϕ − 1 ( ˆ z ). Ho w ever, the left-hand side p ( Y | A = a, b Z C = ˆ z , Z C = z ) simplifies b ecause A and Z C fully determines the true conditional distribution of Y . Thus, for z 1 ∈ Z 1 and z 2 ∈ Z 2 , we m ust ha ve p ( Y | A = a, Z C = z 1 ) = p ( Y | A = a, b Z C = ˆ z ) = p ( Y | A = a, Z C = z 2 ) . This equates t w o distributions that are kno wn to b e distinct (due to injectivity), yielding a contradiction. Therefore, ϕ m ust b e injective almost surely . Under standard regularit y conditions (contin uity and matching dimensions), ϕ is an in vertible transformation. W e ha v e established that b Z C = ϕ ( Z C ) where ϕ is in v ertible, whic h implies σ ( b Z C ) = σ ( Z C ). This completes the pro of of subspace iden tifiability of Z C . Step 2: Back-door Criterion. Given the structural equations in Assumption 3.1: A = g A ( Z S ) , Y = g Y ( A, Z C ) , Z S = h ( Z C , S, ε S ) . The only common cause of A and Y is Z C (mediated through Z S to A ). Z C blo c ks all back- do or paths from A to Y . Sp ecifically , the p oten tial outcome Y ( a ) is determined b y g Y ( a, Z C ). Since Z C encapsulates all confounding information, we hav e conditional exchangeabilit y Y ( a ) ⊥ ⊥ A | Z C . 52 Step 3: Replacement with Iden tified Represen tation. F rom Step 1, we established that b Z C = ψ ( Z C ) where ψ is inv ertible. Th us, σ ( b Z C ) = σ ( Z C ), and conditioning on b Z C is statistically equiv alent to conditioning on Z C . Therefore, Y ( a ) ⊥ ⊥ A | b Z C . Step 4: Identification F ormula. W e explicitly deriv e the iden tification of E [ Y ( a )]. By the Law of Iterated Expectations and the indep endence sho wn in Step 3 E [ Y ( a )] = E b Z C E [ Y ( a ) | b Z C ] = E b Z C E [ Y ( a ) | A = a, b Z C ] (b y ignorabilit y Y ( a ) ⊥ ⊥ A | b Z C ) = E b Z C E [ Y | A = a, b Z C ] (b y consistency Y ( a ) = Y when A = a ) . Assumption 3.4 (Positivit y) guaran tees that P ( A = a | b Z C ) > 0 (since b Z C is isomorphic to Z C ), ensuring the conditional exp ectation E [ Y | A = a, b Z C ] is well-defined. Similarly , the A TE is iden tified as A TE( a, a ′ ) = E b Z C h E [ Y | A = a, b Z C ] − E [ Y | A = a ′ , b Z C ] i . This completes the proof. D Pro of of statistical prop erties of triple mac hine learn- ing estimator (Theorem 6) Pr o of. Let the total sample size b e N . W e randomly partition the data D into three disjoint folds I 1 , I 2 , I 3 , each of size n = N / 3. The estimator is constructed sequentially 1. Stage 1 (Represen tation Learning on I 1 ): Construct b ϕ using only data in I 1 . Th us b ϕ ⊥ ⊥ ( I 2 ∪ I 3 ). 2. Stage 2 (Nuisance Estimation on I 2 ): Using b ϕ and data I 2 , estimate b m and b π . Let b η = ( b m, b π , b ϕ ). Crucially , b η ⊥ ⊥ I 3 . 3. Stage 3 (Ev aluation on I 3 ): Compute the estimator on I 3 b ψ = P n, 3 [ φ ( O ; b η )] = 1 n X i ∈I 3 φ ( O i ; b η ) . 53 W e decomp ose the estimation error √ n ( b ψ − ψ 0 ) √ n ( b ψ − ψ 0 ) = √ n ( P n, 3 φ ( b η ) − P φ ( η 0 )) = √ n ( P n, 3 − P ) φ ( η 0 ) | {z } T 1 :Oracle CL T + √ n ( P n, 3 − P )( φ ( b η ) − φ ( η 0 )) | {z } T 2 :Empirical Pro cess + √ nP ( φ ( b η ) − φ ( η 0 )) | {z } T 3 :Bias T erm . (25) Step 1: Oracle CL T ( T 1 ). The term φ ( O ; η 0 ) is a fixed function. Since observ ations in I 3 are i.i.d., b y the standard CL T T 1 d − → N (0 , σ 2 eff ) , where σ 2 eff = V ar( φ ( O ; η 0 )) . Step 2: Empirical Pro cess ( T 2 ). W e must sho w that T 2 = o p (1). This is equiv alen t to sho wing that the unscaled empirical pro cess term ( P n, 3 − P )( φ ( b η ) − φ ( η 0 )) is o p ( n − 1 / 2 ). Let ∆( O ; b η ) = φ ( O ; b η ) − φ ( O ; η 0 ). Conditioning on the training data D train = I 1 ∪ I 2 , the function ∆( · ; b η ) is deterministic. The term T 2 can b e viewed as a sum of i.i.d. random v ariables with mean zero (conditional on D train ) E [ T 2 | D train ] = √ n E O ∼ P [( P n − P )∆( O ; b η ) | D train ] = 0 . W e analyze the conditional v ariance V ar( T 2 | D train ) = n · V ar( P n ∆( O ; b η ) | D train ) = n · 1 n V ar(∆( O ; b η ) | D train ) ≤ E [( φ ( O ; b η ) − φ ( O ; η 0 )) 2 | D train ] = ∥ φ ( b η ) − φ ( η 0 ) ∥ 2 L 2 ( P ) . Under the consistency assumption (C1), we ha ve ∥ b η − η 0 ∥ p − → 0. Assuming φ satisfies mild Lipsc hitz contin uity or the n uisances are bounded, consistency implies conv ergence in the L 2 norm of the score ∥ φ ( b η ) − φ ( η 0 ) ∥ 2 L 2 ( P ) = o p (1) . By Chebyshev’s inequality , for any ϵ > 0 P ( | T 2 | > ϵ | D train ) ≤ V ar( T 2 | D train ) ϵ 2 = o p (1) ϵ 2 p − → 0 . Th us, T 2 = o p (1). This confirms that the estimation noise of b η do es not affect the asymptotic distribution via the empirical pro cess term. 54 Step 3: Bias T erm ( T 3 ). W e analyze the drift term T 3 = √ n E [ φ ( O ; b η ) − φ ( O ; η 0 ) | b η ]. Define the in termediate parameter ˜ η = ( m 0 , π 0 , b ϕ ), which represents the ideal nuisance parameters given the learned represen tation b ϕ . W e decompose the bias in to a nuisance estimation error and a representation learning error T 3 = √ nP ( φ ( b η ) − φ ( ˜ η )) | {z } T 3 a (Nuisance Bias) + √ nP ( φ ( ˜ η ) − φ ( η 0 )) | {z } T 3 b (Representation Bias) . (a) Nuisance Parameter Bias ( T 3 a ). This term captures the error from estimating m and π on I 2 , conditional on the fixed representation b ϕ from I 1 . Utilizing the algebraic prop ert y of the doubly robust score function, for an y m, π and fixed representation z , the difference satisfies the exact identit y E [ φ ( m, π , z ) − φ ( m 0 , π 0 , z )] = E 1 { A = a } π ( z ) ( Y − m ( z )) − 1 { A = a } π 0 ( z ) ( Y − m 0 ( z )) + ( m ( z ) − m 0 ( z )) = − E π ( z ) − π 0 ( z ) π ( z ) m ( z ) − m 0 ( z ) . Applying this to our estimator b η given b ϕ T 3 a = − √ n E " ( b π ( b Z C ) − π 0 ( b Z C ))( b m ( b Z C ) − m 0 ( b Z C )) b π ( b Z C ) I 1 # . Note that the first-order terms v anish iden tically due to Neyman orthogonality . The remain- ing term is strictly second-order. By the Cauc hy-Sc h warz inequality and the b oundedness of 1 / b π | T 3 a | ≲ √ n ∥ b π − π 0 ∥ L 2 ( b ϕ ) ∥ b m − m 0 ∥ L 2 ( b ϕ ) . Under the robustness assumption (pro duct of rates is o p ( n − 1 / 2 )), we hav e T 3 a = o p (1). (b) Representation Bias ( T 3 b ). This term captures the sto chastic error propagated from the representation learning step (Stage 1) to the final estimation (Stage 3). Let M ( ϕ ) = E [ φ ( O ; m 0 , π 0 , ϕ )] b e the exp ected score functional. Since the score φ is gener- ally not orthogonal with resp ect to ϕ , w e p erform a functional T aylor expansion around the true representation ϕ 0 , T 3 b = √ n ( M ( b ϕ ) − M ( ϕ 0 )) = √ n ∇ ϕ M ( ϕ 0 )[ b ϕ − ϕ 0 ] | {z } Linear T erm (I) + √ n R ( b ϕ, ϕ 0 ) | {z } Remainder T erm (II) , where ∇ ϕ M ( ϕ 0 )[ h ] is the Gˆ ateaux deriv ative of M in direction h . 55 The remainder T erm (I I) is b ounded b y the square of the estimation error |R| ≤ C ∥ b ϕ − ϕ 0 ∥ 2 L 2 . F or the remainder to b e asymptotically negligible (i.e., o p (1)), we only require the quarter-rate condition ∥ b ϕ − ϕ 0 ∥ L 2 = o p ( n − 1 / 4 ) . Assuming this holds, the asymptotic behavior of T 3 b is entirely determined by the Linear T erm (I). W e consider tw o regimes • Case 1: Sup er-Efficiency (Oracle V ariance). Supp ose the representation is learned on a massiv e auxiliary dataset or conv erges strictly faster than the parametric rate ∥ b ϕ − ϕ 0 ∥ L 2 = o p ( n − 1 / 2 ) . Then, the Linear T erm (I) satisfies | √ n ∇ ϕ M ( ϕ 0 )[ b ϕ − ϕ 0 ] | ≤ C √ n ∥ b ϕ − ϕ 0 ∥ L 2 = √ n · o p ( n − 1 / 2 ) = o p (1) . The representation bias v anishes. The estimator achiev es the Oracle Efficiency Bound , with asymptotic v ariance V eff = V ar( φ ( O ; η 0 )). • Case 2: Standard Rate (V ariance Inflation). Suppose the representation is learned at the standard parametric rate (e.g., via regression or standard ML on F old 1) ∥ b ϕ − ϕ 0 ∥ L 2 = O p ( n − 1 / 2 ) . In this case, assume that b ϕ admits an asymptotic linear expansion characterized by its o wn influence function ξ ϕ b ϕ ( z ) − ϕ 0 ( z ) = 1 n 1 X j ∈I 1 ξ ϕ ( O j , z ) + o p ( n − 1 / 2 ) . and then define IF ϕ, rep ( O ) := ⟨∇ ϕ M ( ϕ 0 ) , ξ ϕ ( O ) ⟩ = E O ′ [ ∇ ϕ φ ( O ′ ; η 0 )] · ξ ϕ ( O ) . By the linearit y of the deriv ativ e, √ nP ( φ ( ˜ η ) − φ ( η 0 )) ≃ 1 √ n n X i =1 IF ϕ, rep ( O i ) , 56 and the Linear T erm (I) conv erges in distribution √ n ∇ ϕ M ( ϕ 0 )[ b ϕ − ϕ 0 ] d − → N (0 , V rep ) , where V rep = V ar(IF ϕ, rep ( O )) is the v ariance con tribution from the represen tation learn- ing step. Because b ϕ is estimated on I 1 and the ev aluation score φ is computed on the indep enden t fold I 3 , the error term ∇ ϕ M ( ϕ 0 )[ b ϕ − ϕ 0 ] and the oracle influence function IF oracle ( O i ) are uncorrelated. This justifies the decoupled summation of v ariances in V total . The total asymptotic v ariance hence inflates to V total = V eff + ρ · V rep , where ρ accoun ts for the ratio of sample sizes betw een folds. Standard errors m ust be corrected to accoun t for V rep . On thing w e need to emphasize is that w e ac knowledge that establishing the exact asymp- totic linearity for highly non-conv ex deep learning mo dels lik e V AEs remains an op en theoret- ical c hallenge. Our deriv ation of IF ϕ, rep op erates under the premise that the representation learner con v erges to an isolated lo cal optim um, b ehaving asymptotically as a regularized M-estimator, or alternativ ely , op erates in a regime where the neural tangen t kernel (NTK) affords linear resp onsiveness Jacot et al. (2018). Com bining the steps, if the bias terms v anish ( o p (1)), w e ha v e √ n ( b ψ − ψ 0 ) = T 1 + o p (1) d − → N (0 , σ 2 eff ). 57
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment