Identification and estimation of the conditional average treatment effect with nonignorable missing covariates, treatment, and outcome

Treatment effect heterogeneity is central to policy evaluation, social science, and precision medicine, where interventions can affect individuals differently. In observational studies, covariates, treatment, and outcomes are often only partially obs…

Authors: Shuozhi Zuo, Yixin Wang, Fan Yang

Identification and estimation of the conditional average treatment effect with nonignorable missing covariates, treatment, and outcome
Iden tification and estimation of the conditional a v erage treatmen t effect with nonignorable missing co v ariates, treatmen t, and outcome Sh uozhi Zuo ∗ Yixin W ang † F an Y ang ‡ Abstract T reatment effect heterogeneity is central to p olicy ev aluation, so cial science, and precision medicine, where in terven tions can affect individuals differently . In observ a- tional studies, cov ariates, treatmen t, and outcomes are often only partially observ ed. When missingness dep ends on unobserv ed v alues (missing not at random; MNAR), standard metho ds can yield biased estimates of the conditional av erage treatment ef- fect (CA TE). This pap er establishes nonparametric iden tification of the CA TE under m ultiv ariate MNAR mec hanisms that allo w co v ariates, treatmen t, and outcomes to b e MNAR. It also develops nonparametric and parametric estimators and prop oses a sensitivit y anal ysis framew ork for assessing robustness to violations of the missingness assumptions. Keywor ds: missing data; missing not at random; nonparametric identification; causal in- ference; sensitivit y analysis ∗ Departmen t of Statistics, Universit y of Michigan, Ann Arbor † Departmen t of Statistics, Universit y of Michigan, Ann Arbor ‡ Y au Mathematical Sciences Cen ter, Tsinghua Univ ersity; Y anqi Lak e Beijing Institute of Mathematical Sciences and Applications. Corresp onding author: yangfan1987@tsinghua.edu.cn 1 1 In tro duction In man y observ ational studies, researchers seek to understand how treatment effects v ary across individuals. This heterogeneit y is captured b y the conditional av erage treatment effect (CA TE), defined as the exp ected treatment effect given co v ariates. Accurate CA TE estimation is essen tial for individualized decisions in b oth p olicy and medicine. In practice, ho w ever, cov ariates, treatmen t, and outcomes are often only partially observed. Missingness ma y dep end on a v ariable’s unobserv ed v alue as well as on other v ariables or missingness indicators, yielding complex multiv ariate missing not at random (MNAR) patterns. Standard approac hes for handling missing data, such as complete-case analysis (CCA), m ultiple imputation (MI) (Rubin, 1987), and in v erse probabilit y w eigh ting (IPW) (Seaman and White, 2013), rely on strong assumptions ab out the missingness mec hanism. CCA as- sumes missing completely at random (MCAR), meaning that missingness is indep endent of all v ariables. MI and IPW t ypically assume missing at random (MAR), under whic h missingness is indep endent of the unobserv ables conditional on the observ ables. In con- trast, MNAR allows missingness to dep end on the unobserv ables even conditional on all observ ables. Existing practical guidance (Lee et al., 2021) do es not cov er identification and estimation under multiv ariate MNAR mec hanisms. Across disciplines, reviews sho w that missing data are common but often underrep orted or handled by discarding incomplete cases. In the so cial sciences, Berch told (2019) reviewed quan titativ e articles published in 2017 across six leading journals and found that 69.5% (105/151) of quan titativ e articles included missing data, yet only 44.4% clearly rep orted its presence; most studies relied on discarding incomplete cases. Similar patterns also app ear in education and psyc hology (Dong and Peng, 2013). In epidemiology , Mainzer et al. (2024) review ed 130 observ ational studies published b etw een 2019 and 2021 in fiv e 2 leading journals. The review rep orted that 88% had missing data in multiple v ariables, and only 34% of studies stated any assumption about the missing-data mechanism. The National Job Corps Study (NJCS) (Schochet et al., 2006) illustrates the c hal- lenges that motiv ate this work. The goal is to c haracterize how the effect of obtaining a creden tial on earnings v aries across participan ts. Several k ey v ariables exhibit substan- tial nonresp onse, including co v ariates suc h as prior-year earnings and arrest history , the treatmen t of credential attainment, and the outcome of earnings, with missingness rates ranging b etw een 7% and 21%. Because sev eral measures are sensitiv e and collected b y sur- v ey , selective nonresponse is likely . In particular, rep orting of arrest history , prior earnings, creden tial status, and earnings ma y dep end on the v alues themselv es. Missingness ma y also dep end on other partially observ ed v ariables. F or example, baseline c haracteristics can affect follow-up participation, so later treatmen t and outcome missingness can dep end on baseline v alues that are not alwa ys observed. T ogether, these features mak e MNAR a cen tral concern in the NJCS. 1.1 Related w ork Directed acyclic graphs (DA Gs) enco de dep endencies among v ariables and missingness indicators, with directed edges representing p otential causal relationships. D A Gs mak e assumptions explicit and clarify which asp ects of the data structure are restricted. Early w ork by F ay (1986), Ma et al. (2003), Mohan and P earl (2014), and Mohan and P earl (2021) used D A Gs to deriv e identification results under a range of missingness patterns and clarified when parameters can b e learned from incomplete data. Despite this progress, there is still no general algorithm that determines from a DA G whether a given parameter is iden tifiable from the observ able data. 3 Previous work on MNAR in observ ational studies has largely fo cused on iden tifying the av erage treatment effect (A TE) under either restrictive missingness mechanisms or b y limiting missingness to only a few v ariables. When only co v ariates are missing, Blak e et al. (2020) use outcome regression augmented with missingness indicators without re- stricting the missingness mec hanism. Ho w ever, their approac h requires a modified strong ignorabilit y assumption (Rosenbaum and Rubin, 1984) that allows a co v ariate to confound treatmen t and outcome only when observ ed, an assumption difficult to justify in practice. T o a void this assumption, other work imp oses structure on the missingness mechanism. Under outcome-indep endent cov ariate missingness, where missingness is indep endent of the outcome conditional on cov ariates and treatment, Y ang et al. (2019), Guan and Y ang (2024), and Sun and Liu (2021) establish A TE identification under a completeness condi- tion. Sun and F u (2025) consider treatment-independent cov ariate missingness and derive iden tification results for the A TE under parametric mo dels when missingness exists in a single cov ariate. When b oth treatment and cov ariates ha v e missing data, W en and McGee (2025) study A TE iden tification requiring treatment missingness to b e independent of the outcome conditional on treatment and co v ariates, while co v ariate missingness m ust be in- dep enden t of b oth the outcome and its own unobserv ed v alue given treatment and observ ed v ariables. Iden tification under MNAR b ecomes more c hallenging when the outcome is also miss- ing. F rom a distribution recov ery p ersp ectiv e, Moreno-Betancur et al. (2018) sho w that iden tification often fails when v ariables are allo wed to affect their o wn missingness, i.e., self-censoring. F rom a regression p ersp ective, Hughes et al. (2019) demonstrate that the treatmen t co efficient can b e reco vered b y CCA when the missingness of all v ariables is con- ditionally indep endent of the outcome. Landsiedel et al. (2025) study A TE identification 4 also under the constraint that outcome missingness do es not dep end on the outcome it- self. Chen et al. (2023) directly address outcome self-censoring in A TE identification using treatmen t as a shadow v ariable but do not allow cov ariates or treatmen t to be missing. In observ ational studies, most of the literature on missing data targets the A TE, whic h t ypically requires reco vering the join t distribution of treatment, co v ariates and outcome, and th us also iden tifies the CA TE when ac hiev able. Our fo cus is complemen tary: when sev eral v ariables are MNAR, reco vering the full joint distribution may require strong miss- ingness assumptions, while the CA TE can remain identifiable under weak er conditions b ecause it dep ends only on the conditional outcome distribution. W e illustrate this dis- tinction in Section S1.1 of the supplemen tary material through a counterexample where the CA TE is iden tifiable but the A TE is not. Metho dological w ork that directly targets the CA TE under missingness remains limited. Kuzmano vic et al. (2023) study the esti- mation of CA TE when treatmen t is MAR. Ding and Geng (2014) establish iden tification of the CA TE with MNAR co v ariates in randomized studies with fully observ ed treatmen t and outcome. A key distinction from randomized settings is that in observ ational studies, co v ariates also act as confounders, further complicating iden tification. 1.2 Our con tributions This pap er studies identification and estimation of the CA TE under general m ultiv ariate MNAR mechanisms, allowing cov ariates, treatment, and outcome to b e partially observ ed, including self-censoring. In prosp ective studies where cov ariates and treatment are mea- sured b efore the outcome, the framew ork allo ws general MNAR mechanisms for co v ari- ates and treatment while imp osing restrictions only on the outcome-missingness process. W e study and establish nonparametric iden tification of the CA TE under three outcome- 5 missingness mec hanisms: outcome-indep enden t, treatment-independent, and co v ariate- indep enden t mec hanisms. Counterexamples in Section S1 of the supplemen tary material sho w that these restrictions do not generally iden tify the A TE and that the CA TE b e- comes uniden tifiable under more general outcome-missingness mechanisms. The pap er also dev elops corresponding nonparametric and parametric estimators. Because each baseline assumption remo ves exactly one path wa y in to outcome missingness, sensitivity analysis can b e implemen ted by reintroducing that path wa y through a single-parameter extension. 1.3 Organization of the pap er Section 2 introduces the notations, definitions, and causal assumptions. Section 3 presents the missingness mec hanisms and corresp onding identification results. Section 4 describes the proposed estimation approac hes. Section 5 reports simulation studies comparing esti- mators under differen t missingness assumptions. Section 6 presents the NJCS application and the sensitivity analysis. Section 7 concludes. 2 Notation, definition, and causal assumptions Consider a sample of size n consisting of indep endent and identically distributed observ a- tions drawn from a sup erp opulation. Let X , T , and Y denote the co v ariates, treatmen t, and outcome, with supports X , T , and Y , respectively . Let R Y b e the resp onse indicator suc h that R Y = 1 if Y is observed and R Y = 0 otherwise. Similarly , let R T and R X denote the resp onse indicators for T and X , resp ectiv ely . In general, X is a v ector of co v ariates, in which case w e write X = ( X 1 , . . . , X p ) and R X = ( R X 1 , . . . , R X p ) for the corresp onding comp onen twise resp onse indicators. F or notational simplicity , w e use the scalar notation X and R X throughout, and all iden tification results and estimators extend directly to the 6 m ultiv ariate case. W e adopt the p oten tial outcomes framew ork, where eac h unit has p o- ten tial outcomes { Y ( t ) : t ∈ T } for all treatment lev els. The parameter of interest is the CA TE comparing a pair of treatment levels t 1 ∈ T and t 0 ∈ T , τ t 1 ,t 0 ( x ) = E { Y ( t 1 ) − Y ( t 0 ) | X = x } , defined for each cov ariate v alue x ∈ X . W e imp ose the follo wing causal assumptions: (i) the stable unit treatment v alue as- sumption (SUTV A), whic h requires consistency and no in terference, so that the observ ed outcome satisfies Y = Y ( T ); (ii) strong ignorability , { Y ( t ) : t ∈ T } ⊥ ⊥ T | X ; and (iii) p ositivit y , P ( T = t | X = x ) > 0 for all t ∈ T and all x ∈ X . Under these assumptions, E { Y ( t ) | X = x } = E { Y | T = t, X = x } , so iden tification of P ( Y | T , X ) suffices to iden tify τ t 1 ,t 0 ( x ). Tw o of our iden tification results rely on a completeness condition. A function f ( A, B ) is said to b e c omplete in B if, for an y square-integrable function g , Z g ( A ) f ( A, B ) dν ( A ) = 0 = ⇒ g ( A ) = 0 a.s. , where ν ( · ) denotes the Leb esgue measure for con tinuous A and the coun ting measure for discrete A . 3 Missingness mec hanisms and iden tification W e fo cus on prosp ectiv e studies in whic h X and T are measured b efore Y , reflecting the natural temp oral ordering commonly observ ed in applied researc h. Accordingly , it is reasonable to assume that future outcomes do not influence the past even ts; that is, ( Y , R Y ) are assumed not to affect ( X , T , R X , R T ). T ypically , cov ariates X are measured 7 b efore treatment T is administered; ho wev er, in settings where X and T are obtained within the same survey , b oth v ariables ma y influence R X and R T , and our framework accommo dates this p ossibilit y . Under this setup, the DA Gs in Figure 1 illustrate the most general missing data assumption, as w ell as the MCAR, MAR, and MNAR Assumptions 1– 3. D AGs enco de missingness assumptions as conditional indep endence restrictions at the v ariable lev el. The MAR assumption represented by DA Gs differs from the form typically used in MI, where missingness ma y dep end on a v ariable only when it is observ ed. As noted by Mealli and Rubin (2015), this MI form ulation is not a conditional indep endence assumption and b ecomes difficult to in terpret or justify when multiple v ariables are sub ject to missingness. F or this reason, we adopt the MAR assumption formalized via D AGs (Mohan and Pearl, 2014). In the first row of Figure 1, the most general missingness assumption allows the miss- ingness indicators ( R X , R T , R Y ) to dep end arbitrarily on ( X, T , Y ) and on one another, sub ject only to the temp oral restriction describ ed ab ov e. Under this general mec hanism, the observed data do not, in general, iden tify τ t 1 ,t 0 ( x ); a simple coun terexample is provided in Section S1.2 of the supplementary material. Under MCAR, eac h missingness indicator is indep enden t of all other v ariables. Under MAR, the missingness indicators ma y dep end on one another but cannot dep end on the partially observ ed v ariables ( X , T , Y ). The three MNAR mec hanisms in the second row are derived from the most general assumption b y remo ving a single arrow into R Y , thereby imp osing additional restrictions on the outcome- missingness pro cess as formalized in Assumptions 1 – 3. Assumption 1 R X ⊥ ⊥ Y | ( X , T ) , R T ⊥ ⊥ Y | ( X , T , R X ) , R Y ⊥ ⊥ Y | ( X , T , R X , R T ) . Assumption 2 R X ⊥ ⊥ Y | ( X , T ) , R T ⊥ ⊥ Y | ( X , T , R X ) , R Y ⊥ ⊥ T | ( X , Y , R X , R T ) . Assumption 3 R X ⊥ ⊥ Y | ( X , T ) , R T ⊥ ⊥ Y | ( X , T , R X ) , R Y ⊥ ⊥ X | ( T , Y , R X , R T ) . 8 Assumptions 1 – 3 main tain the general missingness structure for X and T . Sp ecifically , w e allo w R X to dep end on ( X , T ) and R T to dep end on ( X , T , R X ). The differences among Assumptions 1 – 3 lie in the sp ecification of the missingness mechanism for Y . Under Assumption 1, R Y ma y dep end on ( X , T , R X , R T ) but is conditionally indep enden t of Y . Under Assumption 2, R Y ma y dep end on ( X , Y , R X , R T ) but is conditionally indep enden t of T . Under Assumption 3, R Y ma y dep end on ( T , Y , R X , R T ) but is conditionally indep endent of X . While identification of the CA TE under Assumption 1 has b een discussed previously (Moreno-Betancur et al., 2018), Assumptions 2 and 3 enable new identification results for the CA TE in settings where the outcome may b e self-censored. X T Y R X R T R Y General missing assumption X T Y R X R T R Y MCAR X T Y R X R T R Y MAR X T Y R X R T R Y MNAR Assumption 1 X T Y R X R T R Y MNAR Assumption 2 X T Y R X R T R Y MNAR Assumption 3 Figure 1: Eac h MNAR assumption restricts exactly one arro w into R Y relativ e to the general missingness DA G. T op row: general missingness, MCAR, and MAR. Bottom row: MNAR Assumptions 1 – 3. 9 The follo wing theorems presen t the conditions required for nonparametric iden tification of τ t 1 ,t 0 ( x ) under MNAR Assumptions 1 – 3. Theorem 1 Under the c ausal assumptions (i) to (iii) and Assumption 1, if P ( R Y = 1 , R T = 1 , R X = 1 | X = x, T = t ) > 0 for al l x , t , then P ( Y | T , X ) is identifiable, and ther efor e, τ t 1 ,t 0 ( x ) is identifiable. Assumption 1 rules out outcome self-censoring, implying that the outcome do es not influence its own missingness. This corresp onds to case (m-DA G E) in the framework of Moreno-Betancur et al. (2018), under which the conditional outcome distribution is iden tifiable from the complete cases because P ( Y | X , T ) = P ( Y | X , T , R X = 1 , R T = 1 , R Y = 1). MCAR and MAR are b oth sp ecial cases of Assumption 1. Consequently , when the missingness in Y is conditionally indep endent of Y itself, CCA yields consistent estimates of P ( Y | T , X ) and hence of τ t 1 ,t 0 ( x ). Theorem 2 Under the c ausal assumptions (i) to (iii) and Assumption 2, supp ose P ( R Y = 1 | X = x, Y = y , R X = 1 , R T = 1) > 0 for al l x, y and P ( R T = 1 , R X = 1 | T = t, X = x ) > 0 for al l t, x . (i) If P ( Y , R Y = 1 | T , X = x, R T = 1 , R X = 1) is c omplete in T for x ∈ X , then P ( Y | T , X = x ) is identifiable, and henc e τ t 1 ,t 0 ( x ) is identifiable. (ii) F or any x such that Y ⊥ ⊥ T | X = x , the nul l str atum-sp e cific effe ct τ t 1 ,t 0 ( x ) = 0 is identifiable, although the c ompleteness c ondition in p art (i) fails at that x . P art (i) of Theorem 2 relies on a completeness condition, an assumption widely used in nonparametric identification problems. In MNAR settings where a partially observed v ariable ma y influence its o wn missingness (Miao et al., 2024; Y ang et al., 2019; Li et al., 2023; Zuo et al., 2025), i.e., self-censoring, the completeness condition leverages a shadow 10 v ariable (Miao et al., 2024) that is asso ciated with the partially observ ed v ariable but, conditional on other observed information, do es not affect its missingness. Completeness then guarantees that the missingness mechanism is uniquely identified from the observ ed data. Sufficient conditions for completeness are pro vided in D’Haultfœuille (2010). This prop ert y also holds for many commonly used parametric families under some assumptions, including exp onential-family mo dels (Newey and P ow ell, 2003) and certain lo cation-scale families (Hu and Shiu, 2018). W e pro vide simple examples tailored to Theorems 2 and 3 in Section S3 of the supplemen tary material. T o build in tuition for part (i) and to illustrate ho w completeness facilitates identifi- cation, w e consider a simple discrete example. Supp ose that T tak es J distinct v alues { t 1 , . . . , t J } and Y takes K distinct v alues { y 1 , . . . , y K } . Because P ( Y = y | X = x, T = t ) = P ( Y = y , R Y = 1 | X = x, T = t, R X = 1 , R T = 1) P ( R Y = 1 | X = x, Y = y , R X = 1 , R T = 1) , the conditional distribution P ( Y = y | X = x, T = t ) is iden tifiable if P ( R Y = 1 | X = x, Y = y , R X = 1 , R T = 1) is identifiable. F or each x , define ζ x ( y k ) = P ( R Y = 0 | X = x, Y = y k , R X = 1 , R T = 1) P ( R Y = 1 | X = x, Y = y k , R X = 1 , R T = 1) , k = 1 , . . . , K , whic h represents the unknown o dds that Y is missing at each lev el y k . F or each treatment lev el t j , w e hav e the follo wing identit y P ( R Y = 0 | T = t j , X = x, R T = 1 , R X = 1) = K X k =1 P ( Y = y k , R Y = 1 | T = t j , X = x, R T = 1 , R X = 1) ζ x ( y k ) . Th us, for each x , we obtain a system of J linear equations in the K unknowns { ζ x ( y k ) } K k =1 . Let Θ x denote the J × K matrix with entries Θ x ( j, k ) = P ( Y = y k , R Y = 1 | T = t j , X = x, R T = 1 , R X = 1) . 11 T o ensure a unique solution for ζ x ( y k )’s and hence for P ( R Y = 1 | X = x, Y = y , R X = 1 , R T = 1) = { 1 + ζ x ( y ) } − 1 , it suffices to imp ose the full-rank condition rank(Θ x ) = K . This is the completeness condition in the discrete case. This condition essentially requires that J ≥ K and T  ⊥ ⊥ Y | X . P art (ii) highlights an imp ortan t case in whic h Y ⊥ ⊥ T | X = x holds for some x . In this situation, T no longer serv es as a shadow v ariable, and consequently the completeness con- dition fails at X = x . Therefore, the conditional distribution P ( Y | T , X = x ) is no longer iden tifiable. Nevertheless, under Assumption 2, we show that τ t 1 ,t 0 ( x ) remains iden tifi- able and equals 0. In tuitively , b ecause R Y ⊥ ⊥ T | ( X , Y , R X , R T ), the outcome-missingness mec hanism is iden tical across treatmen t lev els after conditioning on ( X , Y , R X , R T ). There- fore, any differences in observ ed outcome distributions across treatmen t lev els reflect differ- ences in the underlying conditional outcome distribution P ( Y | T , X = x ). When no such differences exist at X = x , τ t 1 ,t 0 ( x ) = 0 is correctly iden tified even though P ( Y | T , X = x ) ma y not b e recov erable. F ull pro ofs are provided in Section S2 of the supplemen tary material. Theorem 3 Under the c ausal assumptions (i) to (iii) and Assumption 3, if P ( R Y = 1 | T = t, Y = y , R X = 1 , R T = 1) > 0 for al l t, y , P ( R T = 1 , R X = 1 | T = t, X = x ) > 0 for al l t, x , and P ( Y , R Y = 1 | T = t, X , R T = 1 , R X = 1) is c omplete in X for al l t , then P ( Y | T , X ) is identifiable, and ther efor e, τ t 1 ,t 0 ( x ) is identifiable. Remark 1 When X is multivariate, we c an r elax The or em 3 as fol lows. Write X = ( X id , X c ) , wher e X id denotes the c ovariate c omp onent that satisfies (i) the mo difie d c on- ditional indep endenc e assumption in Assumption 3, R Y ⊥ ⊥ X id | ( T , Y , R X id , R X c , R T , X c ) , and (ii) the mo difie d c ompleteness c ondition r e quir e d in The or em 3, P ( Y , R Y = 1 | T = t, X , R T = 1 , R X = 1) is c omplete in X id for al l t . Under these two mo difie d c onditions, 12 the c onclusions of The or em 3 c ontinue to hold. The estimation and infer enc e str ate gies describ e d in Se ction 4 c an b e str aightforwar d ly extende d under the mo difie d assumptions. The in tuition b ehind Theorem 3 parallels part (i) of Theorem 2, with the roles of T and X reversed. Under Assumption 3, R Y ⊥ ⊥ X | ( T , Y , R X , R T ), implying that, within eac h treatmen t stratum T = t , the outcome-missingness mechanism is inv ariant across v alues of X . The completeness condition ensures that X pro vides sufficien t v ariation relativ e to Y (within each T = t ) to identify this missingness mechanism, thereb y allowing reco v ery of P ( Y | T , X ) from the observed data. Summary . Theorems 1 – 3 iden tify the CA TE in prosp ective studies where ( X , T ) are measured b efore Y , precluding ( Y , R Y ) from influencing the missingness of X or T . Under this time ordering, w e allo w the most general MNAR mechanisms for ( X, T ) and restrict only the outcome-missingness process, leading to three iden tifiable cases in Figure 1: (i) Assumption 1, whic h rules out outcome self-censoring and iden tifies P ( Y | X , T ) from complete cases; (ii) Assumption 2, whic h allows outcome self-censoring but imp oses R Y ⊥ ⊥ T | ( X , Y , R X , R T ); and (iii) Assumption 3, which allo ws outcome self-censoring but imp oses R Y ⊥ ⊥ X | ( T , Y , R X , R T ). Under Assumptions 2 – 3, identification additionally requires the corresp onding completeness condition, except when Y ⊥ ⊥ T | X = x under Assumption 2, in whic h case the CA TE is identifiable and equals zero even without completeness. These identification results imply that estimation and inference for the CA TE should b e conducted using analyses restricted to units with observed ( X , T ), as illustrated in the sketc h of the pro of for Theorem 2. An imp ortan t distinction betw een iden tifying the CA TE and the A TE is that the CA TE depends only on P ( Y | X , T ), and th us can remain iden tifiable even when the joint distribution of ( X , T ) is not. As a result, the CA TE can b e iden tified under weak er and more realistic missingness assumptions than those required for 13 the A TE. Assumptions 1 – 3 do not generally iden tify the full-data join t law and therefore do not, in general, iden tify the A TE; we pro vide coun terexamples in Section S1.1 of the supplemen tary material. Imp ortan tly , even with correctly sp ecified mo dels for the join t distribution of ( X , T , Y , R X , R T , R Y ), the MLE based on the full-data likelihoo d may yield biased estimates of the CA TE under the MNAR mechanisms in Assumptions 1 – 3, precisely b ecause the join t distribution in v olv es non-iden tifiable comp onen ts ( X , T ). This p oin t is reflected in our simulation results. The missingness mechanisms in Assumptions 1 – 3 are highly flexible, as each excludes only a single pathw ay into R Y relativ e to the most general missing DA G. This struc- ture mak es sensitivity analysis particularly straigh tforw ard. In Section 6, w e em b ed each baseline missingness mo del in a single-parameter family that rein tro duces the excluded de- p endence (e.g., allowing Y → R Y , T → R Y , or X → R Y ) and assess the robustness of the estimated CA TE ov er a range of quantified departures from the baseline assumption. 4 Estimation Under Assumption 1, P ( Y | X , T ) = P ( Y | X , T , R X = 1 , R T = 1 , R Y = 1), the complete-case estimator is consistent for P ( Y | X , T ) and hence for τ t 1 ,t 0 ( x ). F or example, a nonparametric estimator can b e obtained by flexibly regressing Y on ( X , T ) among complete cases (e.g., via k ernel-based or spline smo othing), yielding b µ ( t, x ) ≈ E ( Y | T = t, X = x ) and b τ t 1 ,t 0 ( x ) = b µ ( t 1 , x ) − b µ ( t 0 , x ). Alternatively , under a parametric outcome mo del P β ( y | x, t ), one can estimate β by maximizing the complete-case lik eliho od and then obtain τ t 1 ,t 0 ( x ) via plug-in. The remainder of this section fo cuses on estimation under Assumptions 2 and 3. Under these t wo assumptions, R Y ma y depend on Y itself, rendering the complete-case estima- tor generally biased due to selection. T o address this, we dev elop b oth nonparametric 14 and parametric estimators that implement the corresp onding identification results: (i) a nonparametric series tw o-stage least squares (2SLS) estimator that solves the associated F redholm equation for the resp onse-o dds function via a siev e appro ximation and quadratic regularization (Kress et al., 1999; New ey and Po well, 2003; Y ang et al., 2019); and (ii) a parametric likelihoo d-based estimator computed via the exp ectation–maximization (EM) algorithm (Dempster et al., 1977). 4.1 Nonparametric series 2SLS estimator under Theorem 2 or 3 W e implement the iden tification results in Theorems 2 and 3 by (i) recov ering the outcome resp onse mec hanism through an in tegral equation solved via a series 2SLS approach, and (ii) using the estimated resp onse o dds to construct a selection-corrected estimator of E ( Y | X , T ) and consequen tly of τ t 1 ,t 0 ( x ). Based on the iden tification results, we restrict analysis to units with observed ( X , T ), i.e., { i : R X = R T = 1 } . W e presen t the construction under Theorem 2; an analogous pro cedure applies to Theorem 3 after sw apping the roles of T and X . Define the resp onse o dds function ζ ( x, y ) = P ( R Y = 0 | X = x, Y = y , R X = 1 , R T = 1) P ( R Y = 1 | X = x, Y = y , R X = 1 , R T = 1) , and let π ( x, y ) = P ( R Y = 1 | X = x, Y = y , R X = 1 , R T = 1) = { 1 + ζ ( x, y ) } − 1 . The probabilities b elow satisfy the F redholm equation P ( R Y = 0 | T = t, X = x, R X = 1 , R T = 1) = Z P ( Y = y , R Y = 1 | T = t, X = x, R X = 1 , R T = 1) ζ ( x, y ) dy . (1) T o op erationalize (1), we appro ximate function ζ in a finite-dimensional sieve space and enforce the integral equation through its sample analogue, yielding a regularized least- squares problem. Let h ( y ) = ( h 1 ( y ) , . . . , h J ( y )) ⊤ b e a chosen sieve basis in y (e.g., a 15 Hermite-t yp e env elop e basis). When x is discrete, for each x we use the appro ximation ζ ( x, y ) ≈ h ( y ) ⊤ β ( x ), whic h leads to a linear system b b x ≈ c M x β ( x ) obtained from the sample analogue of (1) ov er t ∈ T . W e estimate β ( x ) b y minimizing the integral-equation residual sub ject to quadratic regularization, b β ( x ) = arg min β ∥ b b x − c M x β ∥ 2 sub ject to β ⊤ Λ β ≤ B , where Λ is a p ositiv e definite p enalty matrix that regularizes the sieve co efficients and B > 0 con trols the regularization strength. W e then set b ζ ( x, y ) = h ( y ) ⊤ b β ( x ). When X is con tin uous, w e instead use a tensor-pro duct sieve ζ ( x, y ) ≈ ϕ ( x, y ) ⊤ θ with ϕ ( x, y ) = g ( x ) ⊗ h ( y ), where g ( x ) = ( g 1 ( x ) , . . . , g J x ( x )) ⊤ is a chosen siev e basis in x (e.g., the same Hermite- t yp e env elop e basis applied to a standardized x ). Finally , w e form weigh ts b w i = 1 + b ζ ( X i , Y i ) and estimate E ( Y | X , T ) via w eighted nonparametric regression on complete cases, i.e., units { i : R X = R T = R Y = 1 } , yielding b τ t 1 ,t 0 ( x ). Implementation details are pro vided in Section S4 of the supplementary material. As highlighted in part (ii) of Theorem 2, when Y ⊥ ⊥ T | X = x , the completeness condition in part (i) of Theorem 2 fails, rendering P ( Y | T , X = x ) (and thus E ( Y | X = x, T )) not identifiable. In this case, an estimator targeting E ( Y | X = x, T ) ma y b e biased, y et the bias in the estimated τ t 1 ,t 0 ( x ) is exp ected to b e negligible, since τ t 1 ,t 0 ( x ) can still b e correctly iden tified as 0. In contrast, when the completeness condition holds (part (i) of Theorem 2), the biases in b oth E ( Y | X = x, T ) and τ t 1 ,t 0 ( x ) are exp ected to be small. These exp ectations are consistent with our sim ulation results. F or inference, we follo w the b o otstrap pro cedure recommended in Y ang et al. (2019) with all tuning parameters fixed across b o otstrap resamples. 16 4.2 P arametric estimation under Theorem 2 or 3 The series 2SLS estimator directly implemen ts the nonparametric identification results and is flexible, but it can b e unstable due to the ill-p osed nature of the integral equation and its sensitivit y to tuning parameters. As a more stable alternative, w e consider a parametric estimator implemented via the EM algorithm. This approac h imp oses structure on the outcome and missingness mo dels and can yield smoother and more stable estimates when the parametric assumptions are adequate. W e again restrict our analysis to the subset of units with observed ( X , T ). W e sp ecify a parametric mo del for the outcome, P β ( y | x, t ) , for example a generalized linear model g { E β ( Y | x, t ) } = U ( x, t ) ⊤ β . Under Assumptions 2 and 3, P β ( y | x, t ) = P β ( y | x, t, R X = 1 , R T = 1). W e also sp ecify a parametric mo del for the outcome response probabilit y , π λ ( x, t, y ) = P λ ( R Y = 1 | X = x, T = t, Y = y , R X = 1 , R T = 1) = expit { λ ⊤ Z ( x, t, y ) } , where the design vector Z ( x, t, y ) is chosen to enforce the relev an t missingness restriction: under Assumption 2, the vector Z dep ends only on ( x, y ); and under Assumption 3, Z dep ends only on ( t, y ). F or a unit with R Y = 1, its contribution to the lik eliho o d is P β ( Y | X , T , R X = 1 , R T = 1) π λ ( X , T , Y ) . F or a unit with R Y = 0, the outcome is laten t and the contribution is Z P β ( y | X , T , R X = 1 , R T = 1) { 1 − π λ ( X , T , y ) } dy . W e employ the EM algorithm to handle the laten t outcome, in the presence of outcome missingness. 17 In the E-step, for a unit with R Y = 0, the conditional distribution of the missing outcome under the current parameters is P β ,λ ( y | X , T , R Y = 0 , R X = 1 , R T = 1) ∝ P β ( y | X , T , R X = 1 , R T = 1) { 1 − π λ ( X , T , y ) } . If Y is discrete, this conditional distribution can b e computed exactly . When Y is contin- uous, direct ev aluation is t ypically infeasible. W e therefore focus on fractional imputation (FI) (Kim, 2011), which provides a con venien t appro ximation to the E-step of the EM algorithm in this setting. • I-step (imputation). F or each unit i with R Y i = 0, dra w M candidate v alues y ∗ ij ∼ h ( · | X i , T i ), j = 1 , . . . , M , from a prop osal distribution h . A con venien t default is to take h as the complete-case outcome mo del, h ( · | X i , T i ) = P β (0) ( · | X i , T i , R X i = 1 , R T i = 1) , where β (0) is obtained b y fitting P β ( · | X , T , R X = 1 , R T = 1) on units with R Y = 1. F urther discussion of the choice of the prop osal distribution h is deferred to Y ang and Kim (2016). • W-step (weighting). Compute fractional weigh ts w ( m ) ij ∝ P β ( m ) ( y ∗ ij | X i , T i , R X i = 1 , R T i = 1) { 1 − π λ ( m ) ( X i , T i , y ∗ ij ) } h ( y ∗ ij | X i , T i ) , M X j =1 w ( m ) ij = 1 . • M-step (maximization). Maximize the Mon te Carlo approximation to the EM Q - function: Q ( m ) ( β , λ ) = X i : R Y i =1  log P β ( Y i | X i , T i , R X i = 1 , R T i = 1) + log π λ ( X i , T i , Y i )  + X i : R Y i =0 M X j =1 w ( m ) ij  log P β ( y ∗ ij | X i , T i , R X i = 1 , R T i = 1) + log[1 − π λ ( X i , T i , y ∗ ij )]  . 18 W e iterate un til con vergence. After obtaining b β , we estimate τ t 1 ,t 0 ( x ) accordingly . W e again emplo y the b o otstrap for inference. 5 Sim ulation study This section ev aluates how different estimators recov er τ t 1 ,t 0 ( x ) under Assumptions 1 – 3. 5.1 Setups and estimators Eac h sim ulation uses N = 1000 observ ations and is rep eated 500 times. W e consider eight data-t yp e com binations, corresp onding to all binary/con tinuous choices of X , T , and Y . P erformance is summarized by the p ercen t bias of τ t 1 ,t 0 ( x ) at a fixed x with ( t 1 , t 0 ) = (1 , 0). P arameters are calibrated so that the marginal missingness rates of X , T , and Y are eac h appro ximately 20%. F ull details are pro vided in Section S5 of the supplementary material. W e compare the following estimators: • Or acle : fits the correctly specified outcome mo del P β ( Y | T , X ) to the full data by plugging in the true v alues of the missing data. • CCA : fits the correctly sp ecified outcome mo del P β ( Y | T , X ) using only complete cases, i.e., units with R X = R T = R Y = 1. • X -miss-indic ator + CCA : augmen ts the co v ariate with the missingness indicator I ( R X = 0), restricts the analysis to units with observed treatment and outcome ( R T = R Y = 1), and fits the correctly sp ecified outcome model for Y giv en T and the augmen ted cov ariates. • MI (r estricte d) : a MI procedure (v an Buuren and Gro othuis-Oudshoorn, 2011) with predictiv e imputation mo dels that exclude Y when imputing X and T , ensuring the 19 outcome does not drive reconstruction of the co v ariate and treatment, consisten t with the design-stage principle of Rubin (2007). The outcome Y is imputed using the same mo del as in the outcome analysis. • MI (al l) : a MI pro cedure (v an Buuren and Gro othuis-Oudshoorn, 2011) that condi- tions on all other v ariables when imputing each of X , T , and Y . • NP : the prop osed nonparametric estimator. • Par a : the proposed parametric estimator. • Par a (ful l) : unlik e the subset analysis used in Par a , this approach additionally sp eci- fies parametric mo dels for ( X , T , R X , R T ) and maximizes the full observed-data lik e- liho o d. This is included as a cautionary b enchmark, since the full join t distribution is generally not identified under Assumptions 1 – 3. 5.2 Results F or each estimator, we obtain estimates b τ t 1 ,t 0 ( x ) across 500 sim ulation replicates. W e summarize p erformance using replicate-level p ercent bias P ercen t bias ( r ) = 100 × b τ ( r ) t 1 ,t 0 ( x ) − τ t 1 ,t 0 ( x ) τ t 1 ,t 0 ( x ) , r = 1 , . . . , 500 . W e rep ort the distribution of this quantit y using b o xplots. The center of each b o x reflects the mean p ercent bias across replicates, and the spread reflects the v ariability of the esti- mator. W e also examine a n ull effect setting with τ t 1 ,t 0 ( x ) = 0, corresp onding to part (ii) of Theorem 2. Since p ercent bias is not well-defined when τ t 1 ,t 0 ( x ) = 0, we instead assess p erformance using the difference b τ t 1 ,t 0 ( x ) − τ t 1 ,t 0 ( x ). Results for this null effect setting are rep orted in Section S5 of the supplementary material and are consisten t with the theory . 20 Figure 2: CCA is un biased only under Assumption 1; under Assumptions 2 – 3, bias can b e substan tial. Bo xplots sho w p ercent bias in τ 1 , 0 (1) (binary X ) and τ 1 , 0 (0) (con tin uous X ) with binary Y ; closer to zero is better. Rows corresp ond to ( X , T ) t yp e com binations and columns to Assumptions 1 – 3. Metho ds: Oracle, CCA, X -miss-indicator + CCA, MI (restricted), MI (all), NP , P ara, and P ara (full). 21 Figure 3: Under Assumptions 2 – 3, estimators that ignore outcome-dep endent missingness can b e biased. Boxplots sho w p ercent bias in τ 1 , 0 (1) (binary X ) and τ 1 , 0 (0) (contin uous X ) with contin uous Y ; closer to zero is b etter. Ro ws corresp ond to ( X , T ) t yp e combinations and columns to Assumptions 1 – 3. Metho ds: Oracle, CCA, X -miss-indicator + CCA, MI (restricted), MI (all), NP , P ara, and P ara (full). 22 F or settings with τ t 1 ,t 0 ( x )  = 0, percent bias is sho wn in Figure 2 for binary Y and in Figure 3 for contin uous Y . Rows corresp ond to ( X , T ) data t yp e com binations and columns to Assumptions 1 – 3. Under Assumption 1, P ( Y | X , T ) is iden tifiable from the complete cases, so CCA, NP , and Para remain nearly unbiased across all eigh t data-type combinations, while other estimators exhibit noticeable bias in some settings. Under Assumptions 2 and 3, identification of P ( Y | X , T ) additionally requires the corresp onding completeness condition. In Figure 2, and in Figure 3 for the cases with con- tin uous T under Assumption 2 and con tin uous X under Assumption 3, the completeness condition holds, and b oth NP and Para stay nearly un biased. In con trast, in Figure 3 for the cases with binary T under Assumption 2 and binary X under Assumption 3, the completeness condition fails without further assumptions. In these settings, the F red- holm equation underlying NP is underdetermined in the nonparametric mo del and NP can exhibit non trivial bias, consisten t with the theory . In con trast, Para remains un biased b ecause the completeness condition holds under the parametric mo dels emplo y ed (see the parametric completeness examples in Section S3 of the supplemen tary material). Other estimators show non trivial biases. In general, NP displa ys larger v ariabilit y than P ara, as expected from a fully nonparametric approach that estimates and in v erts an ill-posed in tegral equation. The p erformance of P ara (full) illustrates that mo deling additional components of the join t distribution, whic h are not generally identifiable, can lead to biased estimates. This reinforces the imp ortance of subset analyses that directly target the iden tified ob ject P ( Y | X , T ) in P ara and NP . 23 6 Application to the NJCS and sensitivit y analysis 6.1 Data and discussion on assumptions The data describ e 8 , 707 eligible applicants in the mid-1990s who lived in areas selected for in-p erson baseline in terviews. Sub jects were randomized to Job Corps program or control groups in the NJCS (Sc ho c het et al., 2001). While the original study fo cused on the effect of the Job Corps program, here w e seek to understand the causal relationship b et ween p ost-baseline educational attainment and subsequent earnings. Let X denote baseline cov ariates, including program assignment, gender, age, race, education, prior-year earnings, parenthoo d, and ev er arrested; in this application, all com- p onen ts of X are co ded as categorical indicators. Let T b e a binary indicator for obtaining an education or v o cational creden tial at 30 months of follo w-up, and let Y denote w eekly earnings at 48 mon ths of follo w-up. Our goal is to estimate how the effect of obtaining a creden tial ( T ) on weekly earnings ( Y ) v aries with baseline co v ariates X , that is, the CA TE τ t 1 ,t 0 ( x ) with t 1 = 1 and t 0 = 0. This setting aligns with our prosp ectiv e structure. The baseline cov ariates X and T (measured at 30 months) are both measured b efore Y (48 mon ths), so Y cannot influence missingness in X or T . These data exhibit substantial m ultiv ariate missingness. Several baseline co v ariates ha v e missing v alues (education ≈ 0 . 65%, prior-year earnings ≈ 9 . 6%, ev er arrested ≈ 6 . 7%), and approximately 21% of T and 21% of Y are missing. Conse- quen tly , appropriately addressing missingness in ( X, T , Y ) is central to the analysis. W e no w discuss the plausibility of each assumption. Assumption 1 requires that earn- ings do not affect their o wn missingness. This assumption may b e questionable in practice, since earnings information is often sensitive. In survey settings, resp ondents may selectiv ely 24 withhold earnings dep ending on the earnings v alues themselv es. Assumption 2 requires that educational attainment do es not directly affect earnings missingness except through miss- ingness in other v ariables or earnings themselves. This assumption is arguably reasonable b ecause resp ondents’ willingness to disclose earnings lik ely dep ends on their general atti- tudes to w ard sensitiv e information (reflected in the missingness across other v ariables) or the earnings v alues themselves. The completeness condition in Theorem 2 further requires that educational attainment influences earnings. Empirical supp ort for this assumption comes from prior works. Zuo et al. (2025) and Qin et al. (2019, 2021) do cumented indirect effects of Job Corps on earnings through educational attainmen t, pro viding evidence of a nonzero treatment effect in our setting. How ever, previous analyses emplo yed more restric- tiv e assumptions regarding missing data. W e instead fo cus on the CA TE of educational attainmen t under weak er and more flexible assumptions. Assumption 3 in its m ultiv ariate extension (Remark 1) requires the existence of baseline co v ariates that do not directly affect outcome missingness, except through co v ariate miss- ingness, educational attainment, its missingness, or earnings. As an illustration, w e tak e race and decompose X = ( X id , X c ), where X id denotes the race v ariable and X c collects the remaining baseline cov ariates. The asso ciated completeness condition in Theorem 3 further requires that race is associated with earnings. Because these missingness mechanisms are in trinsically un testable, it is essen tial to as- sess the robustness of causal conclusions to different missingness assumptions. In addition, recognizing that all three missingness mechanisms may b e violated in practice, we conduct sensitivit y analyses to ev aluate the stabilit y of our findings under quan tified departures from eac h assumption. 25 6.2 Mo dels W e apply b oth nonparametric and parametric approac hes. Because weekly earnings con tain a substantial p oin t mass at zero (14.85%), we adopt a t wo-part mo del that separately estimates the probability of p ositive earnings and the magnitude of p ositiv e earnings. Let D = I ( Y > 0) indicate whether an individual has an y earnings. Under Assumptions 2 and 3, missingness can dep end on Y . W e assume this dep endence op erates through the binary indicator D (zero v ersus p ositive earnings) rather than through the magnitude of p ositiv e earnings conditional on D = 1. This restriction allows the completeness conditions in Theorems 2 and 3 to hold with discrete support for T and X id ev en without additional parametric assumptions, pro vided D fully captures the information relev ant to missingness. W e mo del the probability for positive earnings using logistic regression: logit P ( D = 1 | T , X ) = α + β T T + W ( X ) ⊤ γ + { T · W ( X ) } ⊤ η , where W ( X ) is a v ector of dumm y v ariables for the cov ariates (with one baseline category p er co v ariate), and T · W ( X ) denotes treatmen t-co v ariate in teractions. Denote p ( t, x ) = P ( D = 1 | T = t, X = x ) . W e mo del the conditional mean of Y among the p ositive earners ( D = 1) using Gamma regression with a log link: log E { Y | D = 1 , T , X } = α Y + θ T T + W ( X ) ⊤ θ + { T · W ( X ) } ⊤ κ. Since w e assume the missingness dep ends on Y only through D , conditioning on D = 1 remo v es the need for further missingness correction. Consequently , the parametric estima- tor is fit using complete cases with Y > 0, while the nonparametric estimator replaces the parametric functional form with nonparametric regression. W riting the conditional mean 26 as m ( t, x ) = E ( Y | D = 1 , T = t, X = x ) , w e hav e E ( Y | T = t, X = x ) = p ( t, x ) m ( t, x ) , and therefore the CA TE is τ t 1 ,t 0 ( x ) = p ( t 1 , x ) m ( t 1 , x ) − p ( t 0 , x ) m ( t 0 , x ) . T o address outcome missingness, w e sp ecify a logistic mo del for the response indicator: π λ ( x, t, d ) = P λ ( R Y = 1 | X = x, T = t, D = d, R X = 1 , R T = 1) = expit { λ ⊤ Z ( x, t, d ) } , where Z ( x, t, d ) is chosen to match Assumptions 1 – 3. Under Assumption 3 with multiv ari- ate X = ( X id , X c ), w e enforce the w eaker restriction R Y ⊥ ⊥ X id | ( T , D , R X = 1 , R T = 1 , X c ) , so the baseline R Y mo del ma y dep end on X c but excludes X id (race). W e rep ort τ t 1 ,t 0 ( x ) ev aluated at a reference profile x ref , chosen as the most common category of eac h baseline co v ariate among units with observed treatment and cov ariates: assigned to Job Corps program, male gender, age b etw een 16 and 17, black race, no child, nev er arrested, no high school diploma or GED, and zero prior earnings. The target estimand is thus τ 1 , 0 ( x ref ). W e compare six estimation approaches: CCA, X -miss-indicator + CCA, MI (restricted), MI (all), NP , and P ara. F or eac h estimator, w e compute b τ 1 , 0 ( x ref ) with 95% b o otstrap p ercen tile confidence interv als (CIs) based on 500 bo otstrap resamples. 6.3 Results. Figure 4 summarizes treatment effect estimates for the reference group. W e rep ort results for six estimators (CCA, X-miss-ind + CCA, MI(restricted), MI(all), NP , and P ara), with 27 NP and P ara estimates shown separately under Assumptions 1, 2, and 3. Despite this metho dological and assumption heterogeneit y , all sp ecifications yield the same qualitative conclusion: the 95% bo otstrap CIs for τ 1 , 0 ( x ref ) exclude zero, confirming a p ositiv e earnings effect. P oint estimates v ary mo destly across estimators, while the nonparametric approac h pro duces wider CIs due to its greater estimation v ariability . Figure 4: Credential attainment increases earnings for the reference profile: all 95% b o ot- strap p ercen tile CIs for τ 1 , 0 ( x ref ) exclude zero. Sho wn are CCA, X -miss-indicator + CCA, MI (restricted), MI (all), NP , and Para; NP and Para are reported under Assumptions 1 – 3. Larger v alues indicate larger earnings gains. 6.4 Sensitivit y analysis Assumptions 1 – 3 impose restrictions on ho w R Y ma y dep end on ( X , T , D ), but these re- strictions are not testable from the observed data. When an additional dep endence is in tro duced (as in Figure 5), the observ able data no longer identify τ t 1 ,t 0 ( x ) without further 28 X T D R X R T R Y Assumption 1 departure: allow D → R Y (offset δ D ). X T D R X R T R Y Assumption 2 departure: allow T → R Y (offset δ T ). X id T D R X R T R Y Assumption 3 departure: allow X id → R Y (offset δ s ( X id )). Figure 5: Sensitivit y analysis adds one excluded edge in to R Y to quantify departures from eac h baseline MNAR assumption. Each dotted red arro w indicates the violation in tro duced b y offset δ . A t δ = 0, the baseline missingness mo del holds. In the Assumption 3 panel, X id denotes race (the iden tifying co v ariate comp onent); the remaining baseline cov ariates X c , whic h may influence any other v ariables in the DA G, are omitted for readabilit y . assumptions. T o assess robustness, a sensitivity analysis is conducted using the parametric approac h b y em b edding eac h missingness model in a single-parameter family indexed b y an offset δ added to the logit for R Y . Denote the baseline logit b y g ( · ). W e consider Assumption 1: logit P ( R Y =1 | X , T , D , R X = 1 , R T = 1) = g ( X , T , R X = 1 , R T = 1) + δ D , Assumption 2: logit P ( R Y =1 | X , T , D , R X = 1 , R T = 1) = g ( X , D , R X = 1 , R T = 1) + δ T , Assumption 3: logit P ( R Y =1 | X , T , D , R X = 1 , R T = 1) = g ( T , D , X c , R X = 1 , R T = 1) + δ s ( X id ) , where s ( X id ) denotes the v ector of dummies for the race v ariable. At δ = 0 the correspond- ing missingness assumption holds. When δ  = 0, one additional directed edge ( D → R Y , T → R Y , or X id → R Y ) is introduced in to the corresp onding missingness mec hanism, as illustrated in Figure 5. W e consider a broad range of sensitivit y v alues, with δ ∈ [ − 2 , 2] on the log-o dds scale (Chen et al., 2010). F or each sensitivit y v alue, w e refit the prop osed parametric mo dels under the corresp onding offset v alue to obtain b p ( t, x ref ; δ ). W e then rep ort the implied 29 Figure 6: The estimated effect remains p ositive across δ ∈ [ − 2 , 2] under all three sensitivit y mo dels. Sho wn are P ara estimates of τ 1 , 0 ( x ref ; δ ) with 95% b o otstrap p ercentile CIs under Assumptions 1 – 3 and their single-parameter sensitivit y v ariants indexed b y δ ∈ [ − 2 , 2]. effect b τ 1 , 0 ( x ref ; δ ) = b p (1 , x ref ; δ ) b m (1 , x ref ) − b p (0 , x ref ; δ ) b m (0 , x ref ) . Figure 6 rep orts the sensitivit y analysis indexed by δ ∈ [ − 2 , 2]. The estimated effect remains p ositive throughout this range, demonstrating substan tial robustness to departures from eac h baseline assumption. 7 Discussion This pap er studies iden tification and estimation of th e CA TE in observ ational studies where co v ariates, treatmen t, and outcome ma y b e MNAR. In prospective settings, the frame- w ork allows general missingness in cov ariates and treatment and considers three outcome- missingness mechanisms (including self-censoring) under whic h the CA TE remains iden ti- fiable. W e develop t w o estimation strategies. The nonparametric approach implemen ts the iden tification results by solving an integral equation, but can b e numerically unstable, sen- 30 sitiv e to tuning choices, and less suited for sensitivit y analysis. The parametric approach trades flexibility for stability via structured outcome and missingness mo dels and admits a simple and interpretable sensitivity analysis. In practice, we recommend using the parametric estimator as the primary analysis, rep orting results under m ultiple missingness assumptions, and conducting sensitivity anal- ysis. The nonparametric estimator then serves as a robustness chec k on the parametric sp ecification rather than the default approac h in applied analyses. Data and co de a v ailabilit y Data and co de to repro duce all analyses are a v ailable at https://github.com/sushi133/ CATE- MNAR . Ac kno wledgemen ts Sh uozhi Zuo and Yixin W ang w ere supp orted in part by funding from the Office of Na v al Researc h under gran t N00014-23-1-2590, the National Science F oundation under gran t No. 2310831, No. 2428059, No. 2435696, No. 2440954, a Michigan Institute for Data Science Prop elling Original Data Science (PODS) gran t, Tw o Sigma In vestmen ts LP , and LG Man- agemen t Dev elopment Institute AI Researc h. An y opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors. 31 References Berc h told, A. (2019). T reatment and rep orting of item-level missing data in social science researc h. International Journal of So cial R ese ar ch Metho dolo gy , 22(5):431–439. Blak e, H. A., Leyrat, C., Mansfield, K. E., T omlinson, L. A., Carp en ter, J., and Williamson, E. J. (2020). Estimating treatmen t effects with partially observed co v ariates using out- come regression with missing indicators. Biometric al Journal , 62(2):428–443. Chen, H., Cohen, P ., and Chen, S. (2010). Ho w big is a big o dds ratio? in terpreting the magnitudes of o dds ratios in epidemiological studies. Communic ations in Statistics - Simulation and Computation , 39(4):860–864. Chen, J. M., Malinsky , D., and Bhattac harya, R. (2023). Causal inference with outcome- dep enden t missingness and self-censoring. Pr o c e e dings of Machine L e arning R ese ar ch , 216:358–368. Dempster, A. P ., Laird, N. M., and Rubin, D. B. (1977). Maximum lik eliho o d from in- complete data via the EM algorithm (with discussion). Journal of the R oyal Statistic al So ciety: Series B (Metho dolo gic al) , 39(1):1–22. Ding, P . and Geng, Z. (2014). Identifiabilit y of subgroup causal effects in randomized exp erimen ts with nonignorable missing co v ariates. Statistics in Me dicine , 33(7):1121– 1133. Dong, Y. and Peng, C.-Y. J. (2013). Principled missing data metho ds for researc hers. SpringerPlus , 2(1):222. D’Haultfœuille, X. (2010). A new instrumental metho d for dealing with endogenous selec- tion. Journal of Ec onometrics , 154(1):1–15. 32 F a y , R. E. (1986). Causal models for patterns of nonresp onse. Journal of the Americ an Statistic al Asso ciation , 81(394):354–365. Guan, Q. and Y ang, S. (2024). A unified inference framew ork for m ultiple imputation using martingales. Statistic a Sinic a , 34:1649–1673. Hu, Y. and Shiu, J.-L. (2018). Nonparametric identification using instrumen tal v ariables: sufficien t conditions for completeness. Ec onometric The ory , 34(3):659–693. Hughes, R. A., Heron, J., Sterne, J. A. C., and Tilling, K. (2019). Accounting for missing data in statistical analyses: m ultiple imputation is not alwa ys the answer. International Journal of Epidemiolo gy , 48(4):1294–1304. Kim, J. K. (2011). P arametric fractional imputation for missing data analysis. Biometrika , 98(1):119–132. Kress, R., Maz’ya, V., and Kozlov, V. (1999). Line ar Inte gr al Equations . New Y ork: Springer, 2nd edition. Kuzmano vic, M., Hatt, T., and F euerriegel, S. (2023). Estimating conditional a v erage treat- men t effects with missing treatmen t information. In Pr o c e e dings of the 26th International Confer enc e on A rtificial Intel ligenc e and Statistics , volume 206, pages 746–766. Landsiedel, K. E., Abb ott, R., Mucunguzi, A., Mw angwa, F., Kak ande, E., Charleb ois, E. D., Marquez, C., Kamy a, M. R., and Balzer, L. B. (2025). Causal inference with missing exp osures and missing outcomes. arXiv pr eprint arXiv:2506.03336 . Lee, K. J., Tilling, K. M., Cornish, R. P ., Little, R. J. A., Bell, M. L., Go etgheb eur, E., Hogan, J. W., Carp en ter, J. R., and on b ehalf of the STRA TOS initiativ e (2021). F ramew ork for the treatment and rep orting of missing data in observ ational studies: the 33 treatmen t and rep orting of missing data in observ ational studies framew ork. Journal of Clinic al Epidemiolo gy , 134:79–88. Li, Y., Miao, W., Shpitser, I., and Tchetgen, E. J. T. (2023). A self-censoring mo del for m ultiv ariate nonignorable nonmonotone missing data. Biometrics , 79(4):3203–3214. Ma, W.-Q., Geng, Z., and Hu, Y.-H. (2003). Iden tification of graphical mo dels for nonig- norable nonresp onse of binary outcomes in longitudinal studies. Journal of Multivariate A nalysis , 87(1):24–45. Mainzer, R. M., Moreno-Betancur, M., Nguy en, C. D., Simpson, J. A., Carlin, J. B., and Lee, K. J. (2024). Gaps in the usage and rep orting of multiple imputation for incomplete data: findings from a scoping review of observ ational studies addressing causal questions. BMC Me dic al R ese ar ch Metho dolo gy , 24(193):1–15. Mealli, F. and Rubin, D. B. (2015). Clarifying missing at random and related definitions, and implications when coupled with exc hangeability . Biometrika , 102(4):995–1000. Miao, W., Liu, L., Li, Y., Tc hetgen Tc hetgen, E. J., and Geng, Z. (2024). Identification and semiparametric efficiency theory of nonignorable missing data with a shado w v ariable. A CM/IMS Journal of Data Scienc e , 1(2):1–23. Mohan, K. and P earl, J. (2014). Graphical mo dels for recov ering probabilistic and causal queries from missing data. In A dvanc es in Neur al Information Pr o c essing Systems , v ol- ume 27. Mohan, K. and P earl, J. (2021). Graphical mo dels for pro cessing missing data. Journal of the A meric an Statistic al Asso ciation , 116(534):1023–1037. 34 Moreno-Betancur, M., Lee, K. J., Leacy , F. P ., White, I. R., Simpson, J. A., and Carlin, J. B. (2018). Canonical causal diagrams to guide the treatment of missing data in epidemiologic studies. Americ an Journal of Epidemiolo gy , 187(12):2705–2715. New ey , W. and Po well, J. (2003). Instrumental v ariable estimation of nonparametric mo d- els. Ec onometric a , 71(5):1565–1578. Qin, X., Deutsch, J., and Hong, G. (2021). Unpac king complex mediation mechanisms and their heterogeneity b etw een sites in a job corps ev aluation. Journal of Policy A nalysis and Management , 40(1):158–190. Qin, X., Hong, G., Deutsch, J., and Bein, E. (2019). Multisite causal mediation analysis in the presence of complex sample and surv ey designs and non-random non-resp onse. Journal of the R oyal Statistic al So ciety: Series A (Statistics in So ciety) , 182(4):1343– 1370. Rosen baum, P . R. and Rubin, D. B. (1984). Reducing bias in observ ational studies using sub classification on the prop ensit y score. Journal of the A meric an Statistic al Asso ciation , 79(387):516–524. Rubin, D. B. (1987). Multiple Imputation for Nonr esp onse in Surveys . New Y ork: Wiley . Rubin, D. B. (2007). The design versus the analysis of observ ational studies for causal effects: parallels with the design of randomized trials. Statistics in Me dicine , 26(1):20– 36. Sc ho chet, P . Z., Burghardt, J., and Glazerman, S. (2001). National job corps study: The impacts of job corps on participan ts’ employmen t and related outcomes. Princ eton, NJ: Mathematic a Policy R ese ar ch. 35 Sc ho chet, P . Z., Burghardt, J., and McConnell, S. (2006). National job corps study and longer-term follo w-up study: Impact and benefit-cost findings using surv ey and summary earnings records data. Washington, DC: Employment and T r aining A dministr ation, U.S. Dep artment of L ab or. Seaman, S. R. and White, I. R. (2013). Review of inv erse probability weigh ting for dealing with missing data. Statistic al Metho ds in Me dic al R ese ar ch , 22(3):278–295. Sun, J. and F u, B. (2025). Identification and estimation of causal effects with confounders missing not at random. Biostatistics , 26(1):kxaf015. Sun, Z. and Liu, L. (2021). Semiparametric inference of causal effect with nonignorable missing confounders. Statistic a Sinic a , 31(4):1669–1688. v an Buuren, S. and Gro othuis-Oudshoorn, K. (2011). mice: Multiv ariate imputation b y c hained equations in r. Journal of Statistic al Softwar e , 45(3):1–67. W en, L. and McGee, G. (2025). Estimating a verage causal effects with incomplete exposure and confounders. arXiv pr eprint arXiv:2506.21786 . Y ang, S. and Kim, J. K. (2016). F ractional imputation in survey sampling: A comparative review. Statistic al Scienc e , 31(3):415–432. Y ang, S., W ang, L., and Ding, P . (2019). Causal inference with confounders missing not at random. Biometrika , 106(4):875–888. Zuo, S., Ghosh, D., Ding, P ., and Y ang, F. (2025). Mediation analysis with the mediator and outcome missing not at random. Journal of the A meric an Statistic al Asso ciation , 120(550):794–804. 36 Supplemen tary material Section S1 presents tw o counterexamples referenced in the main paper. Section S2 contains the pro ofs of all theorems. Section S3 presents parametric examples for the completeness conditions. Section S4 details the nonparametric estimator. Section S5 provides additional details and results on the simulation study . S1 Coun terexamples This section presents tw o counterexamples referenced in the main pap er. The first es- tablishes non-identifiabilit y of the A TE under Assumptions 1 – 3. The second establishes non-iden tifiabilit y of the CA TE when R Y admits an additional dep endence path beyond Assumptions 1 – 3. S1.1 Coun terexample 1 W e allow general missingness in ( X , T ) under Assumptions 1 – 3 and assume that Y is fully observ ed: P ( X , T , Y , R X , R T ) = P ( X ) P ( T | X ) P ( Y | X , T ) P ( R X | X , T ) P ( R T | X , T , R X ) . Assume X , T , Y ∈ { 0 , 1 } . F or simplicity , consider the sp ecial case R T = R X , so that ( X , T ) are either jointly observed or jointly missing. The observ ed data identify p 11 xt = P ( X = x, T = t, Y = 1 , R X = 1 , R T = 1) , p 01 xt = P ( X = x, T = t, Y = 0 , R X = 1 , R T = 1) , p +1 = P ( Y = 1 , R X = 0 , R T = 0) , p +0 = P ( Y = 0 , R X = 0 , R T = 0) , S1 with P x,t ( p 11 xt + p 01 xt ) + p +1 + p +0 = 1. In this coun terexample, ( p 11 00 , p 01 00 ) =  45 800 , 180 800  , ( p 11 01 , p 01 01 ) =  72 800 , 18 800  , ( p 11 10 , p 01 10 ) =  28 800 , 7 800  , ( p 11 11 , p 01 11 ) =  12 800 , 48 800  , ( p +1 , p +0 ) =  123 800 , 267 800  . Let γ xt = P ( X = x, T = t ), θ xt = P ( Y = 1 | X = x, T = t ), and π xt = P ( R X = 1 , R T = 1 | X = x, T = t ). Then p 11 xt = γ xt θ xt π xt , p 01 xt = γ xt (1 − θ xt ) π xt . Th us the observed data identify θ xt = p 11 xt / ( p 11 xt + p 01 xt ) but do not identify γ xt , since only the pro ducts γ xt π xt are observ ed. W e no w construct t wo differen t parameterizations, models A and B, that b oth repro duce the same observ ed data probabilities but imply different A TEs. Both mo dels share the same conditional outcome distribution: ( θ 00 , θ 01 , θ 10 , θ 11 ) =  1 5 , 4 5 , 4 5 , 1 5  . Mo del A has P ( X = 1) = 1 4 , P ( T = 1 | X = 0) = 1 4 , P ( T = 1 | X = 1) = 3 4 , with resp onse probabilities ( π 00 , π 01 , π 10 , π 11 ) =  1 2 , 3 5 , 7 10 , 2 5  . Mo del B has P ( X = 1) = 1 2 , P ( T = 1 | X = 0) = 17 50 , P ( T = 1 | X = 1) = 21 25 , with resp onse probabilities ( π 00 , π 01 , π 10 , π 11 ) =  75 88 , 45 68 , 35 64 , 5 28  . S2 Both parameterizations yield the same observ ed data distribution ( p 11 xt , p 01 xt , p +1 , p +0 ). Ho w ever, with the shared θ xt ab o ve, θ 01 − θ 00 = 3 5 and θ 11 − θ 10 = − 3 5 , so A TE = 3 5 { 1 − 2 P ( X = 1) } . Therefore, A TE A = 3 10 but A TE B = 0 , sho wing that the A TE is not iden tifiable from the observed data under the missingness mec hanisms for ( X , T ) under Assumptions 1 – 3. S1.2 Coun terexample 2 W e consider X and T fully observ ed, while allo wing an additional dependence path in to R Y b ey ond Assumptions 1 – 3: P ( X , T , Y , R Y ) = P ( X ) P ( T | X ) P ( Y | X , T ) P ( R Y | X , T , Y ) . Assume X , T , Y ∈ { 0 , 1 } . The observed data identify p 11 xt = P ( X = x, T = t, Y = 1 , R Y = 1) , p 01 xt = P ( X = x, T = t, Y = 0 , R Y = 1) , p +0 xt = P ( X = x, T = t, R Y = 0) , with p 11 xt + p 01 xt + p +0 xt = P ( X = x, T = t ). In our counterexample, ( p 11 00 , p 01 00 , p +0 00 ) =  30 800 , 192 800 , 78 800  , ( p 11 01 , p 01 01 , p +0 01 ) =  24 800 , 28 800 , 48 800  , ( p 11 10 , p 01 10 , p +0 10 ) =  12 800 , 36 800 , 52 800  , ( p 11 11 , p 01 11 , p +0 11 ) =  126 800 , 81 800 , 93 800  . S3 Let θ xt = P ( Y = 1 | X = x, T = t ) and π xty = P ( R Y = 1 | X = x, T = t, Y = y ). Then, p 11 xt = P ( X = x, T = t ) θ xt π xt 1 , p 01 xt = P ( X = x, T = t ) (1 − θ xt ) π xt 0 . Th us, the observ ed data iden tify only the pro ducts θ xt π xt 1 and (1 − θ xt ) π xt 0 , lea ving θ xt uniden tified without further assumptions. W e no w construct t w o parameterizations, mo dels A and B, that b oth repro duce the same observed data probabilities but imply different CA TEs. Both mo dels share the same ( X , T ) distribution: P ( X = 1) = 1 2 , P ( T = 1 | X = 0) = 1 4 , P ( T = 1 | X = 1) = 3 4 . Mo del A has ( θ 00 , θ 01 , θ 10 , θ 11 ) =  1 5 , 3 5 , 2 5 , 7 10  , with resp onse probabilities ( π 000 , π 001 , π 010 , π 011 , π 100 , π 101 , π 110 , π 111 ) =  4 5 , 1 2 , 7 10 , 2 5 , 3 5 , 3 10 , 9 10 , 3 5  . Mo del B has ( θ 00 , θ 01 , θ 10 , θ 11 ) =  3 10 , 1 2 , 1 2 , 3 5  , with resp onse probabilities ( π 000 , π 001 , π 010 , π 011 , π 100 , π 101 , π 110 , π 111 ) =  32 35 , 1 3 , 14 25 , 12 25 , 18 25 , 6 25 , 27 40 , 7 10  . Both parameterizations yield the same observ ed data distribution. How ev er, with τ 1 , 0 ( x ) = θ x 1 − θ x 0 , τ 1 , 0; A (0) = 2 5 , τ 1 , 0; A (1) = 3 10 , τ 1 , 0; B (0) = 1 5 , τ 1 , 0; B (1) = 1 10 . Therefore, the same observed data are compatible with distinct v alues of the CA TE, sho w- ing that the CA TE is not iden tifiable under the general outcome missingness mec hanism. S4 S2 Pro ofs Throughout this section, we main tain the causal assumptions ( i ) to ( iii ), so iden tification of P ( Y | T , X ) implies identification of τ t 1 ,t 0 ( x ). S2.1 Pro of of Theorem 1 Iden tification of P ( Y = y | X = x, T = t ) follo ws from P ( Y = y | X = x, T = t ) = P ( Y = y | X = x, T = t, R X = 1 , R T = 1 , R Y = 1) . S2.2 Pro of of Theorem 2 P art (i). W e fo cus on identifying P ( R Y = 1 | X = x, Y = y , R X = 1 , R T = 1). Define P y 1 | tx 11 = P ( Y = y , R Y = 1 | T = t, X = x, R T = 1 , R X = 1) , P +0 | tx 11 = P ( R Y = 0 | T = t, X = x, R T = 1 , R X = 1) , ζ x ( y ) = P ( R Y = 0 | X = x, Y = y , R X = 1 , R T = 1) P ( R Y = 1 | X = x, Y = y , R X = 1 , R T = 1) . Since P y 1 | tx 11 = P ( Y = y | T = t, X = x ) P ( R Y = 1 | X = x, Y = y , R X = 1 , R T = 1) , it follo ws that P +0 | tx 11 = Z y ∈Y P ( Y = y , R Y = 0 | T = t, X = x, R T = 1 , R X = 1) d y = Z y ∈Y P ( Y = y | T = t, X = x ) P ( R Y = 0 | X = x, Y = y , R X = 1 , R T = 1) d y = Z y ∈Y P y 1 | tx 11 P ( R Y = 0 | X = x, Y = y , R X = 1 , R T = 1) P ( R Y = 1 | X = x, Y = y , R X = 1 , R T = 1) d y = Z y ∈Y P y 1 | tx 11 ζ x ( y ) d y . S5 The uniqueness of solutions ζ x ( y ) requires that P ( Y , R Y = 1 | T , X = x, R T = 1 , R X = 1) is complete in T for x . F or discrete T and discrete Y , the completeness assumption is equiv alen t to Rank (Θ x ) = K , where Θ x is a J × K matrix with P y 1 | tx 11 as the ( t, y )th elemen t. F or binary Y , the rank condition further reduces to T ⊥ ⊥ Y | X . F or con tin uous T and con tinuous Y , the dimension of T must b e no smaller than that of Y in general, as required b y the completeness assumption. W e can subsequently iden tify P ( R Y = 1 | X = x, Y = y , R X = 1 , R T = 1) once ζ x ( y ) is iden tified. Then, the iden tification of P ( Y = y | X = x, T = t ) follo ws from P ( Y = y | X = x, T = t ) = P ( Y = y | X = x, T = t, R X = 1 , R T = 1) = P ( Y = y , R Y = 1 | X = x, T = t, R X = 1 , R T = 1) P ( R Y = 1 | X = x, Y = y , R X = 1 , R T = 1) . P art (ii). F or any t 1 , t 0 , w e hav e P ( Y = y | X = x, T = t 1 ) − P ( Y = y | X = x, T = t 0 ) =    P ( Y = y , R Y = 1 | X = x, T = t 1 , R X = 1 , R T = 1) − P ( Y = y , R Y = 1 | X = x, T = t 0 , R X = 1 , R T = 1)    P ( R Y = 1 | X = x, Y = y , R X = 1 , R T = 1) . When Y ⊥ ⊥ T | X = x , the numerator do es not dep end on T and is therefore zero for all y . Consequen tly , the null effect τ t 1 ,t 0 ( x ) = 0 is identified. S2.3 Pro of of Theorem 3 W e fo cus on identifying P ( R Y = 1 | T = t, Y = y , R X = 1 , R T = 1). Define P y 1 | tx 11 = P ( Y = y , R Y = 1 | T = t, X = x, R T = 1 , R X = 1) , P +0 | tx 11 = P ( R Y = 0 | T = t, X = x, R T = 1 , R X = 1) , ζ t ( y ) = P ( R Y = 0 | T = t, Y = y , R X = 1 , R T = 1) P ( R Y = 1 | T = t, Y = y , R X = 1 , R T = 1) . S6 Since P y 1 | tx 11 = P ( Y = y | T = t, X = x ) P ( R Y = 1 | T = t, Y = y , R X = 1 , R T = 1) , it follo ws that P +0 | tx 11 = Z y ∈Y P ( Y = y , R Y = 0 | T = t, X = x, R T = 1 , R X = 1) dy = Z y ∈Y P ( Y = y | T = t, X = x ) P ( R Y = 0 | T = t, Y = y , R X = 1 , R T = 1) dy = Z y ∈Y P y 1 | tx 11 P ( R Y = 0 | T = t, Y = y , R X = 1 , R T = 1) P ( R Y = 1 | T = t, Y = y , R X = 1 , R T = 1) dy = Z y ∈Y P y 1 | tx 11 ζ t ( y ) dy . for each t ∈ T . The uniqueness of solutions ζ t ( y ) requires that P ( Y , R Y = 1 | T = t, X , R T = 1 , R X = 1) is complete in X for all t . F or discrete X and discrete Y , the completeness assumption is equiv alen t to Rank (Θ t ) = K , where Θ t is a L × K matrix with P y 1 | tx 11 as the ( x, y )th element. F or binary Y , the rank condition further reduces to X ⊥ ⊥ Y | T . F or con tinuous X and contin uous Y , the dimension of X must b e no smaller than that of Y in general, as required by the completeness assumption. W e can subsequen tly identify P ( R Y = 1 | T = t, Y = y , R X = 1 , R T = 1) once ζ t ( y ) is iden tified. Then, the iden tification of P ( Y = y | X = x, T = t ) follo ws from P ( Y = y | X = x, T = t ) = P ( Y = y | X = x, T = t, R X = 1 , R T = 1) = P ( Y = y , R Y = 1 | X = x, T = t, R X = 1 , R T = 1) P ( R Y = 1 | T = t, Y = y , R X = 1 , R T = 1) . S3 P arametric examples for completeness conditions This section presents parametric examples under which the completeness conditions in Theorems 2 and 3 hold. Theorem 2.2 in Newey and P o well (2003) establishes a general S7 completeness result for exp onential-family data distributions, whic h w e state here for ref- erence. Result 1 The distribution P ( A, B ) = ψ ( B ) h ( A ) exp { λ ( A ) ⊤ η ( B ) } is c omplete in A if (i) ψ ( B ) > 0 , (ii) the supp ort of λ ( A ) c ontains an op en set, and (iii) the mapping B 7→ η ( B ) is one-to-one. F or illustration, we present parametric outcome mo dels that satisfy the corresp onding completeness assumptions for Theorems 2 and 3. Prop osition 1 (example for Theorem 2) F or c ontinuous Y , under the line ar mo del Y = β 0 + β t T + β ⊤ x X + β tx T · X + ε, ε ∼ N (0 , σ 2 ) , if β t + β tx x  = 0 for a given x , then the c onditional joint distribution P ( T , Y , R T = 1 , R Y = 1 | X = x, R X = 1) = P ( Y | T , X = x ) P ( T , R T = 1 , R Y = 1 | X = x, R X = 1) = 1 (2 π σ 2 ) 1 / 2 exp ( −  Y − β 0 − β t T − β ⊤ x x − β tx T · x  2 2 σ 2 ) P ( T , R T = 1 , R Y = 1 | X = x, R X = 1) is c omplete in T . Prop osition 1 follows from Result 1 with A = T and B = Y by rewriting the join t distribution ab o ve in exp onential-family form. Sp ecifically , one can take λ ( T ) = T , η x ( Y ) = σ − 2  β t + β tx x  Y . Prop osition 2 (example for Theorem 3) F or c ontinuous Y , under the line ar mo del Y = α 0 + α t T + α x X + α tx T · X + ε, ε ∼ N (0 , σ 2 ) , S8 if α x + α tx t  = 0 for a given t , then the c onditional joint distribution P ( X , Y , R X = 1 , R Y = 1 | T = t, R T = 1) = P ( Y | X , T = t ) P ( X , R X = 1 , R Y = 1 | T = t, R T = 1) = 1 (2 π σ 2 ) 1 / 2 exp ( −  Y − α 0 − α t t − α x X − α tx t · X  2 2 σ 2 ) P ( X , R X = 1 , R Y = 1 | T = t, R T = 1) is c omplete in X . Prop osition 2 follo ws from Result 1 with A = X and B = Y b y rewriting the join t distribution ab o ve in exp onential-family form. Sp ecifically , one can take λ ( X ) = X , η t ( Y ) = σ − 2  α x + α tx t  Y . S4 Implemen tation details for the 2SLS estimator This section provides implementation details for the series 2SLS estimator under Assump- tion 2. W e fo cus on recov ering the resp onse-o dds function ζ from the F redholm equation and then using b ζ to construct selection-corrected w eights and estimate τ t 1 ,t 0 ( x ). In tegral equation. The resp onse o dds function is ζ ( x, y ) = P ( R Y = 0 | X = x, Y = y , R X = 1 , R T = 1) P ( R Y = 1 | X = x, Y = y , R X = 1 , R T = 1) , π ( x, y ) = { 1 + ζ ( x, y ) } − 1 , and it solves P ( R Y = 0 | T = t, X = x, R X = 1 , R T = 1) = Z P ( Y = y , R Y = 1 | T = t, X = x, R X = 1 , R T = 1) ζ ( x, y ) dy . (S1) Siev e appro ximation and sample analogue. When Y is con tinuous, ζ ( x, y ) is infinite dimensional, so we approximate ζ ( x, y ) ≈ h ( y ) ⊤ β ( x ) , S9 where h ( y ) = ( h 1 ( y ) , . . . , h J ( y )) ⊤ is a chosen siev e basis in y . W e use a Hermite-type en v elop e basis h j ( y ) = exp( − ˜ y 2 ) ˜ y j − 1 with ˜ y = ( y − ¯ y ) /s y and j = 1 , . . . , J , where ¯ y and s y are the sample mean and sample standard deviation computed from complete cases. When Y is binary , we use a saturated t wo-point basis. Define p 0 x ( t ) = P ( R Y = 0 | T = t, X = x, R X = 1 , R T = 1) , (S2) p 1 x ( t ) = P ( R Y = 1 | T = t, X = x, R X = 1 , R T = 1) , (S3) H x ( t ) = E  h ( Y ) | T = t, R Y = 1 , X = x, R X = 1 , R T = 1  . (S4) When X is discrete, (S1) reduces to p 0 x ( t ) ≈ p 1 x ( t ) H x ( t ) ⊤ β ( x ) . (S5) Replacing p 0 x ( t ), p 1 x ( t ), and H x ( t ) by nonparametric estimates and stac king o v er the t - v alues used in the sample analogue yields a linear system b b x ≈ c M x β ( x ), where b b x stac ks b p 0 x ( t ) and c M x stac ks { b p 1 x ( t ) b H x ( t ) ⊤ } ro w-wise. When X is con tinuous, we use a tensor-pro duct siev e ζ ( x, y ) ≈ ϕ ( x, y ) ⊤ θ , ϕ ( x, y ) = g ( x ) ⊗ h ( y ) , where g ( x ) = ( g 1 ( x ) , . . . , g J x ( x )) ⊤ is a siev e basis in x (w e use the same Hermite-type en v elop e basis on the standardized x ). Th us ϕ ( x, y ) ∈ R J x J and θ ∈ R J x J . When both X and Y are contin uous, w e optionally apply an affine whitening transformation to ( X , Y ) based on complete cases before ev aluating the Hermite-en velope basis to improv e n umerical conditioning. S10 Define p 0 ( t, x ) = P ( R Y = 0 | T = t, X = x, R X = 1 , R T = 1) , p 1 ( t, x ) = P ( R Y = 1 | T = t, X = x, R X = 1 , R T = 1) , H ( t, x ) = E  h ( Y ) | T = t, R Y = 1 , X = x, R X = 1 , R T = 1  . Then equation (S1) implies the momen t condition p 0 ( t, x ) ≈ p 1 ( t, x ) ( g ( x ) ⊗ H ( t, x )) ⊤ θ , (S6) whose sample analogue yields a linear system b b ≈ c M θ in stac k ed form. Regularized solution. Because (S1) is an ill-p osed in verse problem, noise in the first- stage estimates can b e amplified when solving for ζ . Given b b x and c M x , we obtain b β ( x ) by solving the regularized least-squares problem b β ( x ) = arg min β ∥ b b x − c M x β ∥ 2 2 sub ject to β ⊤ Λ β ≤ B , and analogously obtain b θ in the tensor-siev e case. Here, Λ is a p ositiv e definite p enalty matrix imp osing a compactness restriction on the siev e co efficients (Y ang et al., 2019). The sieve dimensions and the b ound B join tly control the flexibility of b ζ : larger J (and J x ) can reduce appro ximation bias but often w orsen conditioning and amplify sampling noise, leading to unstable implied weigh ts, whereas smaller B stabilizes the solution at the cost of additional shrink age bias. Selection-corrected regression. W e then form the implied w eights b ω i = 1 + b ζ ( X i , Y i ) , and estimate µ ( t, x ) = E ( Y | T = t, X = x ) by a weigh ted nonparametric regression on complete cases. The complexity of this final regression introduces an additional bias– v ariance trade-off: stronger smo othing reduces v ariance but can atten uate heterogeneity in S11 the estimated treatmen t effect, while weak er smo othing can reco ver finer structure at the cost of increased v ariability when weigh ts are unstable. Finally , b τ t 1 ,t 0 ( x ) = b µ ( t 1 , x ) − b µ ( t 0 , x ) . T uning and inference. In practice, the first-stage smo othing lev els, siev e dimensions, and regularization b ound B are selected jointly to ac hieve a small integral equation residual and w ell-b ehav ed implied weigh ts, av oiding solutions with many nonp ositive or excessively large w eights. Let π min > 0 b e a user-sp ecified lo wer b ound on the response probability . In implementation, w e enforce feasibility/bounded-p ositivity b y truncating the implied w eigh ts to lie in [1 , 1 /π min ] (equiv alently , truncating ˆ π to lie in [ π min , 1]) b efore the final w eigh ted regression. The complexity of the outcome regression is selected separately . Once selected, all tuning parameters are held fixed across bo otstrap resamples for inference. S5 Sim ulation details and additional results S5.1 Data generating pro cess Binary v ariables are generated from Bernoulli distributions and contin uous v ariables from Gaussian distributions. The co v ariate X is generated marginally . The treatment T is generated conditional on X using either a logistic mo del (binary T ) or a linear Gaussian mo del (con tinuous T ). The outcome Y is generated conditional on ( T , X ) using either a logistic mo del (binary Y ) or a linear Gaussian mo del (contin uous Y ), allowing for main effects of T and X and a T × X interaction. Resp onse indicators ( R X , R T , R Y ) follow logistic regression mo dels with fixed slop e co efficien ts. In tercepts are calibrated so that P ( R · = 1) ≈ 0 . 8. Outcome missingness is S12 sp ecified to align with Assumptions 1 – 3 via logit P ( R Y = 1 | · ) = ϕ 0 + 0 . 4 R X + 0 . 4 R T + u a , a ∈ { 1 , 2 , 3 } . T able S1 lists all fixed slop e co efficients and outcome mo del parameters. In tercepts ( γ 0 , η 0 , ϕ 0 ) are calibrated to achiev e the target missingness rates and are omitted. T able S1: Sim ulation parameter v alues. Mo del Binary case Con tinuous case X p X = 0 . 5 ( µ X , σ X ) = (0 . 2 , 1 . 0) T | X ( α 0 , α x ) = ( − 0 . 3 , 0 . 9) ( α 0 , α x , σ T ) = (0 . 5 , 0 . 9 , 1 . 0) Y | ( T , X ) ( β 0 , β t , β x , β tx ) = ( − 0 . 4 , 1 . 1 , 0 . 9 , 0 . 5) ( β 0 , β t , β x , β tx , σ Y ) = ( − 0 . 3 , 1 . 0 , 0 . 8 , 0 . 5 , 1 . 0) R X | ( X, T ) ( γ x , γ t ) = (0 . 6 , − 0 . 4) ( γ x , γ t ) = ( − 1 . 0 , 0 . 6) R T | ( R X , X, T ) ( η x , η t , η r ) = (0 . 4 , 0 . 4 , 0 . 5) ( η x , η t , η r ) = ( − 0 . 6 , − 0 . 4 , 0 . 5) R Y A1 ( u 1 ) u 1 = − 0 . 8 X + 0 . 9 T u 1 = − 0 . 4 X − 0 . 4 T R Y A2 ( u 2 ) u 2 = − 0 . 8 X + 2 . 2 Y u 2 = − 0 . 4 X − 1 . 8 Y R Y A3 ( u 3 ) u 3 = 0 . 9 T + 2 . 2 Y u 3 = − 0 . 4 T − 1 . 8 Y S5.2 Additional results under a null treatmen t effect. W e consider a null effect v arian t of the simulation in whic h τ t 1 ,t 0 ( x ) = 0 for all t 1 , t 0 , x . This setting uses the same data generating pro cess as the main simulation, except that the treatmen t effect parameters in the outcome mo del are set to zero, so that Y ⊥ ⊥ T | X = x for ev ery x . As a result, the completeness condition in Theorem 2 fails by construction. Because p ercen t bias is not meaningful when τ t 1 ,t 0 ( x ) = 0, we assess p erformance using the estimation error b τ t 1 ,t 0 ( x ) − τ t 1 ,t 0 ( x ). In this n ull setting (Figure S1), b oth NP and S13 Figure S1: Under a null effect ( τ 1 , 0 ( x ) = 0), b oth NP and Para remain cen tered near zero. The figure sho ws estimation error b τ 1 , 0 ( x ) − τ 1 , 0 ( x ) across eigh t data-t yp e combinations under Assumption 2; v alues closer to zero indicate b etter p erformance. P ara remain centered near zero under Assumption 2, providing empirical supp ort for the iden tification result in part (ii) of Theorem 2. References New ey , W. and Po well, J. (2003). Instrumental v ariable estimation of nonparametric mo d- els. Ec onometric a , 71(5):1565–1578. S14

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment