Testing Effect Homogeneity and Confounding in High-Dimensional Experimental and Observational Studies

T esting Eﬀect Homogeneit y and Confounding in High-Dimensional Exp erimen tal and Observ ational Studies Ana Armendariz Univ ersity of St. Gallen, Sc ho ol of Economics and Political Science Martin Hub er Univ ersity of F rib ourg, Dept. of Economics F ebruary 24, 2026 Abstract W e prop ose a framew ork for testing the homogeneity of conditional av erage treat- men t eﬀects (CA TEs) across m ultiple exp erimen tal and observ ational studies. Our approac h lev erages m ultiple randomized trials to assess whether treatment eﬀects v ary with unobserved heterogeneit y that diﬀers across trials: if CA TEs are homogeneous, this indicates the absence of in teractions b et ween treatmen t and unobserv ables in the mean eﬀect. Comparing CA TEs b et ween exp erimen tal and observ ational data further allows ev aluation of p ot en tial confounding: if the estimands coincide, there is no unobserved confounding; if they diﬀer, deviations may arise from unobserv ed confounding, eﬀect heterogeneit y , or b oth. W e extend the framework to settings with alternativ e iden tiﬁcation strategies, namely instrumental v ariable settings and panel data with parallel trends assumptions based on diﬀerences in diﬀerences, where eﬀects are identiﬁed only lo cally for subp opulations suc h as compliers or treated units. In these contexts, testing homogeneity is useful for assessing whether lo cal eﬀects can b e extrapolated to the total p opulation. W e suggest a test based on double machine learning that accommodates high-dimensional cov ariates in a data-driven wa y and in vestigate its ﬁnite-sample p erformance through a sim ulation study . Finally , we apply the test to the In ternational Strok e T rial (IST), a large multi-coun try randomized con trolled trial in patients with acute isc haemic strok e that ev aluated whether early treatmen t with aspirin altered subsequen t clinical outcomes. Our metho dology pro vides a ﬂexible to ol for b oth v alidating identiﬁcation assumptions and understanding the generalizabilit y of estimated treatment eﬀects. Keyw ords: T reatmen t Eﬀect Heterogeneit y , Combining D ata, Conditional A v erage T reatment Eﬀects, Observ ational Data, Randomized Controlled T rials. JEL Codes: C10, C12, C21 ∗ W e thank F ederica Mascolo for helpful comments. 1 In tro duction Exp erimen ts are widely considered the b enchmark for credible causal inference b ecause randomization eliminates confounding and deliv ers unbiased estimates of treatmen t eﬀects. In practice, how ev er, exp erimental data often come with imp ortant limitations: sample sizes tend to b e mo dest, co v ariate information is restricted, and external v alidity is frequently uncertain. By contrast, mo dern observ ational data sets are t ypically muc h larger and ric her, oﬀering detailed high-dimensional cov ariates and broad p opulation cov erage. These features mak e them attractive for studying treatment eﬀect heterogeneit y , but causal in terpretation is hindered by the p ossibility of unobserved confounding. The increasing av ailabilit y of b oth experimental and observ ational data raises an interesting question: to what extent do treatmen t eﬀect estimates align across data sources once we condition on observed co v ariates, and what do es this reveal ab out the internal and external v alidity of iden tiﬁcation strategies in exp eriments and observ ational studies? This paper prop oses a framew ork for testing the homogeneit y of conditional av erage treatmen t eﬀects (CA TEs), giv en cov ariates, across multiple exp erimen tal and observ ational studies (sites) suc h as diﬀeren t regions or countries. When applied solely to randomized exp erimen ts, where unobserv ed confounding is ruled out b y design, our test allows assessing treatmen t eﬀect heterogeneit y arising from unobserv ables that diﬀer across experiments. If CA TEs are homogeneous across experiments, this indicates an absence of in teractions b et w een the treatmen t and unobserv ed factors in the av erage treatmen t eﬀect, implying that CA TEs are externally v alid across exp erimen tal settings. Mo ving b ey ond experiments, com- paring CA TEs betw een exp erimen tal and observ ational data additionally permits assessing the presence of confounding. If the CA TE estimates are asymptotically equiv alen t across exp erimen ts and observ ational studies, then CA TEs are unconfounded by unobserv ables (i.e., internally v alid) and homogeneous across settings (i.e., externally v alid). Conv ersely , if they diﬀer, the discrepancy may stem from unobserved confounding, eﬀect heterogeneit y , or b oth. This may motiv ate a sequential application of the testing approac h; ﬁrst, across exp erimen ts to assess eﬀect homogeneit y , and, if homogeneity is not rejected, subsequently across exp erimental and observ ational studies to assess unconfoundedness in addition to external v alidity . Our testing approach builds on the double machine learning (DML) framework ( Cher- nozh uko v et al. , 2018 ), whic h com bines doubly robust (DR) treatmen t eﬀect estimation ( Hahn , 1998 ; Robins & Rotnitzky , 1995 ; Robins, Rotnitzky , & Zhao , 1994 ) with machine learning to ﬂexibly adjust for high-dimensional co v ariates. More sp eciﬁcally , w e extend the Neyman ( 1959 )-orthogonal score function introduced b y Apfel, Hatam yar, Hub er, and Kuec k ( 2024 ), who prop ose a test for whether CA TEs within a single study are join tly zero. In con trast, we use an analogous orthogonal formulation to test whether diﬀerences in CA TEs across studies, experimental and/or observ ational, are jointly zero. The resulting test is √ n -consisten t (where n denotes the sample size) and asymptotically normal under sp eciﬁc regularity conditions, in particular when the machine learning estimators for the treatmen t and outcome models con verge at rate o ( n − 1 / 4 ) . As an additional metho dological contribution, w e extend the framework to settings where iden tiﬁcation relies on alternative strategies. These include instrumental-v ariable designs that identify the lo cal av erage treatment eﬀect (LA TE) for compliers whose treatmen t status resp onds to the instrumen t ( Angrist, Im b ens, & Rubin , 1996 ; G. W. Imbens & Angrist , 1994 ), 1 as well as panel-data settings that rely on parallel trends to iden tify the a verage treatmen t eﬀect on the treated (A TET) based on diﬀerence-in-diﬀerences ( Snow , 1855 ). Because b oth the conditional LA TE and the conditional A TET p ertain to sp eciﬁc subpopulations, rather than the full p opulation conditional on cov ariates, testing homogeneity is informative for ev aluating whether these lo cal eﬀects can b e extrap olated to the total p opulation, in the spirit of Angrist and F ernández-V al ( 2010 ) and Aronow and Carnegie ( 2013 ). W e then in v estigate the ﬁnite-sample behavior of our testing approach in a simulation study that mimics settings with m ultiple experimental and/or observ ational sites. First, w e study the ﬁnite-sample size and p ow er of the test in a m ulti-site exp erimen tal design by comparing CA TEs across randomized sites under data-generating pro cesses with homoge- neous versus heterogeneous CA TEs. Second, we consider a mixed design with randomized treatmen t in some sites and observ ational iden tiﬁcation in others; w e in tro duce unobserv ed confounding in the observ ational sites and ev aluate the ability of the test to detect cross-site CA TE diﬀerences b oth in the absence and presence of confounding. F or eac h design, w e run 1000 Monte Carlo replications and rep ort summary statistics that describe the p erformance of our metho d, such as the rejection rate, standard deviation, and sample size. F urthermore, we provide an empirical application for our testing approach using data from the International Strok e T rial, a large multi-coun try randomized controlled trial whic h ev aluated whether early aspirin allocation to patien ts with acute ischaemic strok e impro ved patien t health after they had exp erienced a stok e. W e treat coun tries as separate exp erimen tal sites and test whether the conditional eﬀect of randomized aspirin assignmen t on six-month death or dep endency is homogeneous across coun tries after adjusting for baseline patient characteristics. A gro wing literature combines exp erimental and observ ational data to improv e causal inference ( Colnet et al. , 2024 ). Existing studies use suc h designs to (i) generalize randomized trial ﬁndings to broader populations ( A they , Chett y , & Im b ens , 2025 ; Cole & Stuart , 2010 ; Ghassami, Y ang, Richardson, Shpitser, & T chetgen , 2022 ; Hatt, T schern utter, & F euerriegel , 2022 ; G. Imbens, Kallus, Mao, & W ang , 2025 ; Lelo v a, Co op er, & T rian taﬁllou , 2025 ; Parikh et al. , 2025 ; P ark & Sasaki , 2024 ; P earl & Barein b oim , 2011 ; Stuart, Bradshaw, & Leaf , 2015 ; T riantaﬁllou, Jabbari, & Coop er , 2023 ; V an Goﬀrier, Maystre, & Gilligan-Lee , 2023 ), (ii) increase statistical eﬃciency , especially for heterogeneous treatmen t eﬀects ( Bran tner et al. , 2024 ; Cheng & Cai , 2021 ; Epanomeritakis & Viviano , 2025 ; Hatt, Berrevoets, Curth, F euerriegel, & v an der Sc haar , 2022 ; E. T. Rosenman, Basse, Ow en, & Baio cchi , 2023 ; E. T. R. Rosenman, Owen, Baio cchi, & Banac k , 2022 ; W u & Y ang , 2022 ; S. Y ang, Gao, Zeng, & W ang , 2023 ; S. Y ang, Zeng, & W ang , 2020 ; X. Y ang, Lin, A they , Jordan, & Im b ens , 2025 ), and (iii) diagnose or correct bias in observ ational analyses by using exp erimental b enc hmarks ( Chen, Aebersold, Puhan, & Serra-Burriel , 2025 ; Kallus, Puli, & Shalit , 2018 ; Lelo v a et al. , 2025 ; Liu & Xie , 2025 ; P arikh et al. , 2025 ; T rian taﬁllou et al. , 2023 ; W u & Y ang , 2022 ; S. Y ang et al. , 2023 , 2020 ). W e complemen t these eﬀorts by introducing a framew ork that tests whether CA TEs are homogeneous across multiple exp erimental and observ ational sites. By comparing CA TEs across exp erimen ts, the test can detect heterogeneity across exp erimen tal sites driv en b y unobserv ables and ev aluate the external v alidit y of exp erimental CA TEs. While b y comparing CA TEs from experiments and observ ational sites, our test can diagnose the presence of hidden confounding and assess internal v alidity of treatment eﬀects. Therefore, our approach pro vides a uniﬁed w a y to assess b oth in ternal and external v alidit y of CA TEs 2 using distinct research designs. The remainder of this pap er is organized as follo ws. Section 2 pro vides a detailed literature surv ey on studies combining m ultiple exp erimental and observ ational data for treatmen t eﬀect ev aluation. Section 3 in tro duces the iden tifying assumptions and outlines our metho d. Section 4 extends the testing framew ork to instrumen tal and panel data con texts. Section 5 pro vid es a sim ulation study that in v estigates the ﬁnite sample performance of our proposed test. Section 6 illustrates the metho d using the International Stroke T rial. Section 7 concludes. 2 Literature surv ey Our study contributes to the three strands of research mentioned ab ov e by combining exp erimen tal and observ ational evidence to address unobserved confounding and to ev aluate the internal as well as the external v alidit y of CA TEs. Within this b o dy of work, the literature largely follows t wo approac hes: (i) one explicitly mo dels and estimates a confounding function b y p o oling information from RCT s and observ ational datasets, (ii) the other develops metho ds that combine or p o ol CA TE estimators from exp erimen tal and observ ational samples using adaptive w eights c hosen to balance bias and eﬃciency . The ﬁrst approach fo cuses on the confounding function, deﬁned as the conditional gap b et w een causal and observ ational treatment eﬀects, as a central ob ject of interest. Early con tributions suc h as Kallus et al. ( 2018 ) prop ose a metho d that ﬁrst learns an observ ational CA TE function and then estimates a lo w-dimensional correction term using experimental data, so that the observ ational estimate matches the randomized b enchmark even with only partial co v ariate o verlap. Building on this idea, S. Y ang et al. ( 2020 ) in tro duce a data fusion framew ork in whic h b oth the CA TE and the confounding function are iden tiﬁable once observ ational and exp erimen tal samples are coupled. They show that their metho d improv es eﬃciency relative to cases where there are only exp erimen tal samples. Subsequen t w ork by W u and Y ang ( 2022 ) extend this framew ork by prop osing an R-learner that incorp orates ﬂexible mac hine learning metho ds to appro ximate the CA TE, the confounding function, and other nuisance comp onents. Ev en more recen tly , S. Y ang et al. ( 2023 ) introduce a test-based elastic approach for in tegrating trial and real-world data. Their metho d ﬁrst uses the R CT as a b enchmark to test whether the observ ational sample suﬀers from bias. If the test fails, only the data from the R CT is used, but if the test supp orts comparabilit y , the t wo sources are com bined for eﬃciency . Complementing these elastic in tegration ideas, Liu and Xie ( 2025 ) dev elop a direct h yp othesis test for unconfoundedness b y comparing treatment–outcome contrasts estimated from the RCT and the observ ational data. Unlik e S. Y ang et al. ( 2023 ), which couples a pretest with an adaptiv e estimator, Liu and Xie ( 2025 ) fo cuses on diagnosis b y ﬂagging when the observ ational sample is likely confounded b efore applying fusion or machine learning estimators that assume ignorability . P arikh et al. ( 2025 ) push this diagnostic idea further b y asking which assumption breaks when the tw o sources disagree: they dev elop a double mac hine learning framew ork by in tro ducing a statistical quan tity that distinguishes failures of ignorability in the observ ational sample from failures of external v alidity of the exp eriment. A complementary Bay esian approach incorp orates uncertaint y ab out when observ ational 3 data are safe to use. T riantaﬁllou et al. ( 2023 ) prop ose Ba y esian CA TE estimation that adaptiv ely borrows from observ ational data, while Lelo v a et al. ( 2025 ) study iden tiﬁcation and transp ortabilit y of CA TEs under an unkno wn causal graph when com bining exp erimental and observ ational samples. Bey ond the econometric literature, related work in computer science by Hatt, Berrevoets, et al. ( 2022 ) prop oses a represen tation learning framework whic h ﬁrst learns the shared cov ariate structure from observ ational data and then uses exp erimen tal data to calibrate the estimation of treatment eﬀects. They formalize the bias from unmeasured confounding as a confounding function, learn this bias by comparing observ ational and experimental predictions, and then use it to debias CA TE estimates. The second strand of research av oids mo deling a confounding function and instead com bines CA TE estimators computed separately in exp erimental and observ ational samples. Cheng and Cai ( 2021 ) combine kernel-based CA TE estimates usin g data-driv en weigh ts that default to the exp erimen tal data when bias is suspected and combine b oth exp erimental and observ ational sources when estimates align. X. Y ang et al. ( 2025 ) generalize this idea b y c ho osing the weigh ts given to each data source through cross-v alidation in a joint loss framew ork, trading oﬀ bias and v ariance across the tw o sources. Related work by E. T. R. Rosenman et al. ( 2022 ) prop ose to com bine R CT s and obser- v ational data by stratifying on the observ ational propensity score and placing exp erimental units in to the same strata. Within eac h stratum, they estimate treatmen t eﬀects from exp erimen tal and observ ational sources and then merge them either by spiking exp erimental data into observ ational bins or through a data-driven w eighting sc heme that balances bias and v ariance. In subsequent w ork, E. T. Rosenman et al. ( 2023 ) extends this approach using Stein-t yp e shrink age estimators. Their metho d adaptively shrinks observ ational estimates to ward un biased exp erimen tal estimates. They sho w that these estimators reduce the mean squared error relative to p erforming only exp erimental analyzes. While this line of w ork focuses on optimally combining exp erimental and observ ational evidence to improv e eﬃciency , it implicitly assumes that treatmen t eﬀects are suﬃciently stable across settings. In contrast, Chen et al. ( 2025 ) inv estigates whether causal machine learning methods can produce reliable CA TE estimates using data from t w o large R CT s. They sho w that individualized treatment eﬀects deriv ed from a wide range of mac hine learning metho ds fail to replicate across training and test splits or across trials, even in the absence of confounding. This highlights the diﬃculty of obtaining externally v alid CA TE estimates and the need for systematic approaches to test the stabilit y of CA TEs across settings. F or a broader synthesis of the literature, Colnet et al. ( 2024 ) provides a systematic review of approac hes that in tegrate exp erimen tal and observ ational data. Additionally , Bran tner et al. ( 2023 ) provide a review of metho ds to com bine multiple R CT s or R CT s with observ ational data, with a fo cus on treatment eﬀect heterogeneity . They classify approaches b y the type of data av ailable and discuss both parametric and mac hine learning strategies for estimating CA TEs. A key tak eaw ay is that comparing CA TEs across sources provides a w ay to assess s tabilit y and detect potential confounding. Despite adv ances in com bining exp erimental and observ ational data, most applications fo cus on a single R CT paired with one observ ational dataset. As a result, little is known ab out whether CA TEs align across multiple exp erimen ts and observ ational sources. An exception is Bran tner et al. ( 2024 ), dev elop and who study metho ds for estimating CA TEs when several R CT s are av ailable. They adapt S-learner, X-learner, and causal forest estimators to the m ulti-trial setting. They sho w that strategies allowing trial-lev el heterogeneit y outp erform 4 naiv e po oling that ignores study diﬀerences. Their w ork highligh ts the challenges of in tegrating m ultiple exp eriments, bu t do es not examine the alignment of CA TEs b etw een exp erimen tal and observ ational data. Our study ﬁlls this gap b y fo cusing on testing the homogeneit y of CA TEs across b oth exp erimental and observ ational sources. 3 Assumptions and testing approac h D denotes the binary treatment and Y the outcome of interest. Using the p otential outcomes framew ork as adv o cated in Neyman ( 1923 ) and D. B. Rubin ( 1974 ), w e denote by Y ( d ) the p oten tial outcome when exogenously setting the treatment D of a sub ject to v alue d ∈ 1 , 0 . More generally , we will use capital letters for random v ariables and low er case letters for their realizations. By representing the potential outcome Y ( d ) as a function solely of a sub ject’s o wn treatment status D = d , we implicitly assume that the p otential outcomes of one sub ject are not inﬂuenced b y the treatmen t status of others. This assumption is kno wn as the stable unit treatment v alue assumption (SUTV A), see D. Rubin ( 1980 ) and Cox ( 1958 ), and is in vok ed throughout. F urthermore, let X denote a set of observ ed pretreatment co v ariates, and let Z b e a discrete v ariable that indexes diﬀeren t setups or studies in which the exp erimental or observ ational data were collected (for example, sites or regions). The v ariable can tak e integer v alues z ∈ 1 , ..., L , with L denoting the n um b er of setups. W e suggest a metho d to test eﬀect homogeneit y in conditional a verage treatment eﬀects (CA TEs) across diﬀeren t exp eriments, or selection-on-observ ables (and eﬀect homogeneity) across exp erimental and observ ational data, resp ectively . First, w e consider the case of comparisons within experimental studies. Supp ose that treatment is randomly assigned within each site or region Z , p ossibly conditional on co v ariates X (as in stratiﬁed random- ization). This corresp onds to the standard selection-on-observ ables assumption, also kno wn as unconfoundedness or conditional indep endence ( G. W. Imbens , 2004 ). Assumption 1 (Conditional independence of the treatment) . { Y (1) , Y (0) }⊥ ⊥ D | X , Z , where ⊥ ⊥ denotes statistical independence. In addition, we require a condition ensuring that treated and untreated units are observed in all relev ant subp opulations of X and Z : Assumption 2 (Common support) . 0 < Pr( D = d, Z = z | X ) < 1 , ∀ d ∈ { 1 , 0 } and z ∈ { 1 , ..., L } . The common supp ort assumption guaran tees ov erlap in the treatment assignmen t across diﬀeren t experiments and cov ariate proﬁles. It rules out situations where, conditional on co v ariates X , treatment assignment or assignment to a speciﬁc experiment is deterministic. The conditional a v erage treatmen t eﬀect (CA TE) given co v ariates X and exp erimen t Z is deﬁned as ∆ x,z = E [ Y (1) − Y (0) | X = x, Z = z ] (1) 5 Under (conditional) treatmen t randomization, whic h implies the satisfaction of Assump- tion 1 , we ha ve that ∆ X,Z corresp onds to δ x,z = E [ Y | D = 1 , X = x, Z = z ] − E [ Y | D = 0 , X = x, Z = z ] . (2) W e are in terested in testing whether these conditional eﬀects v ary across exp erimen ts. F ormally , w e impose: Assumption 3 (Conditional eﬀect homogeneity) . E [ Y (1) − Y (0) | X , Z ] = E [ Y (1) − Y (0) | X ] . This assumption states that CA TEs are homogeneous across experiments Z . Suc h homogeneit y ma y hold for tw o distinct reasons. First, treatment eﬀects may not in teract with unobserved heterogeneity once we condition on X . An example where this is satisﬁed is the following structural mo del: Y = κ ( D, X ) + η ( U ) , (3) where U denotes unobserved c haracteristics and κ and η are unknown functions. While the eﬀect of D ma y v ary arbitrarily across X , it does not v ary with U conditional on X due to the additive separabilit y of κ and η . In such cases, even if unobserved c haracteristics diﬀer across exp eriments, they do not generate treatmen t eﬀect heterogeneity . Second, treatmen t eﬀects may indeed interact with unobserved heterogeneity , but the distribution of this heterogeneity remains stable across exp erimen ts. F or instance, consider the structural mo del Y = κ ( D, X, U ) , (4) where the eﬀect of D on Y ma y arbitrarily interact with U . This second case, eﬀect heterogeneit y existing but not b eing detectable b ecause the distribution of U is iden tical across exp eriments, app ears less plausible when exp erimen tal sites diﬀer substan tially in institutional, geographic, or temp oral contexts. Such diﬀerences typically shift the distribution of unobserved c haracteristics. The more v ariation there is in u nobserv ed heterogeneit y across exp eriments, the more informative the data become for detecting in teractions betw een treatment eﬀects and unobserv ables. Conditional on Assumption 1 , Assumption 3 yields the following testable n ull h yp othesis: H 0 : µ 1 ,x,z − µ 0 ,x,z | {z } δ x,z − ( µ 1 ,x,z ′ − µ 0 ,x,z ′ | {z } δ x,z ′ ) = 0 , ∀ z , z ′ ∈ { 1 , .., L } and x ∈ X , (5) where X denotes the supp ort of X and µ d,x,z = E [ Y | D = d, X = x, Z = z ] denotes the conditional mean outcome. T o test the null h yp othesis in ( 5 ) , we adapt the doubly robust conditional indep endence test of Apfel et al. ( 2024 ) to the problem of eﬀect homogeneity . That is, we extend their approac h, whic h tests whether CA TEs diﬀer from zero, to instead test whether diﬀerences 6 in CA TEs across exp eriments are equal to zero. Note that ( 5 ) can equiv alently b e written as H 0 : θ 0 = E " L X z =1 [( µ 1 ,x,Z = z − µ 0 ,x,Z = z − µ 1 ,x,z − + µ 0 ,x,z − ) 2 + ( µ 1 ,x,Z = z − µ 0 ,x,Z = z − µ 1 ,x,z − + µ 0 ,x,z − )] # = 0 , (6) where z − denotes v alues of Z diﬀerent from z , suc h that Z  = z . Denote b y p d,z ( X ) = Pr ( D = d, Z = z | X ) the joint prop ensity of treatment and being in a sp eciﬁc exp eriment or site. W e denote the nuisance parameters by η = ( p 1 ,z ( X ) , p 1 ,z − ( X ) , µ 1 ,X,z , µ 1 ,X,z − , p 0 ,z ( X ) , p 0 ,z − ( X ) , µ 0 ,X,z , µ 0 ,X,z − ) . Mo difying Apfel et al. ( 2024 ), who consider simple diﬀerence in conditional means to test whether any CA TE is diﬀerent from zero, to double (or diﬀerences in) diﬀerences, to test whether an y diﬀerence in CA TEs (across experiments) is diﬀerent from zero. More concisely , testing with a multiv alued Z can b e based on the follo wing score function, in whic h O = ( Y , D , X , Z ) denotes the random v ariables: ψ ( O , θ , η ) (7) = L X z =1 ( µ 1 ,X,z − µ 0 ,X,z − µ 1 ,X,z − + µ 0 ,X,z − ) 2 + L X z =1 2( µ 1 ,X,z − µ 0 ,X,z − µ 1 ,X,z − + µ 0 ,X,z − )  ( Y − µ 1 ,X,z )1( D = 1 , Z = z ) p 1 ,z ( X ) − ( Y − µ 0 ,X,z )1( D = 0 , Z = z ) p 0 ,z ( X ) − ( Y − µ 1 ,X,z − )1( D = 1 , Z  = z ) p 1 ,z − ( X ) + ( Y − µ 0 ,X,z − )1( D = 0 , Z  = z ) p 0 ,z − ( X )  + L X l =1 ( µ 1 ,X,z − µ 0 ,X,z − µ 1 ,X,z − + µ 0 ,X,z − ) + L X z =1  ( Y − µ 1 ,X,z )1( D = 1 , Z = z ) p 1 ,z ( X ) − ( Y − µ 0 ,X,z )1( D = 0 , Z = z ) p 0 ,z ( X ) − ( Y − µ 1 ,X,z − )1( D = 1 , Z  = z ) p 1 ,z − ( X ) + ( Y − µ 0 ,X,z − )1( D = 0 , Z  = z ) p 0 ,z − ( X )  − θ . This score has a v ariance that is b ounded a wa y from zero, is zero in exp ectation under the n ull h yp othesis in equation ( 6 ) when θ 0 = 0 , and is Neyman-orthogonal, see pro of pro vided in App endix of Apfel et al. ( 2024 ). This follows directly from the pro ofs in Apfel et al. ( 2024 ), as our score function is based on applying their type of score function twice to turn it into a double (rather than a single) diﬀerence across µ d,X,z . As a the double diﬀerence is just a linear combination of the single diﬀerences, the asymptotic ﬁndings in Apfel et al. ( 2024 ) directly apply to our case, to o. In particular, cross-ﬁtted estimators of θ 0 based on the score function ( 7 ) is asymptotically normal and √ n − consisten t under sp eciﬁc regularit y conditions, in particular if machine learners used for estimating nuisance parameters η ha ve a conv ergence rate of o ( n − 1 / 4 ) . 7 The same testing framew ork can also b e applied for comparing experimental and observ ational studies in a second step following the within-exp eriments comparison. In purely observ ational data, where Assumption 1 cannot be taken for granted, rejection of ( 5 ) or ( 6 ) ma y reﬂect violations of Assumption 1 , Assumption 3 , or b oth. Ho wev er, if exp erimen tal data suggest that CA TEs are homogeneous, i.e., Assumption 3 holds, then comparisons of CA TEs b etw een experimental and observ ational studies pro vide a direct test of Assumption 1 . Sp eciﬁcally , one can deﬁne Z suc h that Z = 1 indicates observ ations from exp erimental studies, while Z = 2 , . . . , L index observ ational studies. Alternativ ely , Z ma y distinguish b etw een observ ational studies only . In both cases, testing can again b e implemen ted using the score in ( 7 ). 4 Alternativ e iden tifying assumptions Our metho d for testing eﬀect homogeneity can also b e adapted to settings where treat- men t is not conditionally exogenous. F or instance, access to treatment may b e randomly assigned conditional on X , while actual treatment take-up deviates from assignment due to noncompliance. In this case, w e ma y use assignment, henceforth denoted by W , as an instrumen t for actual treatment D . F ollo wing G. W. Im b ens and Angrist ( 1994 ) and Angrist et al. ( 1996 ), we denote by D ( w ) the potential treatment as a function of instrumen t W and by Y ( w , d ) the potential outcome as a function of W and D . W e imp ose the follo wing instrumen tal v ariable (IV) assumptions conditional on co v ariates X and exp eriment Z , see Abadie ( 2003 ): Assumption 4 (IV assumptions) . { D ( w ) , Y ( w ′ , d ) }⊥ ⊥ W | X , Z for w , w ′ , d ∈ { 0 , 1 } , Pr( Y (1 , d ) = Y (0 , d ) = Y ( d ) | X , Z ) = 1 , Pr( D (1) ≥ D (0) | X , Z ) = 1 , E [ D | W = 1 , X, Z ] − E [ D | W = 0 , X , Z ]  = 0 , 0 < Pr( W = 1 | X, Z ) < 1 . The ﬁrst line of Assumption 4 requires that the instrumen t is as go o d as randomly assigned and satisﬁes the exclusion restriction conditional on X and Z . The second line rules out the existence of deﬁers, but it also requires the existence of compliers conditional on X , due to the nonzero conditional ﬁrst stage. The third line imp oses common supp ort on the instrument, implying that assignment is not determinis tic in X and Z . Assumption 4 p ermits iden tifying conditional lo cal a v erage treatmen t (CLA TE) eﬀect among the subgroup compliers, denoted by c , who are treated only if the instrument is equal to one: c : D (1) = 1 , D (0) = 0 . The CLA TE given cov ariates X and exp eriment Z is deﬁned as ∆ c,x,z = E [ Y (1) − Y (0) | D (1) = 1 , D (0) = 0 , X = x, Z = z ] . (8) The CLA TE is iden tiﬁed using a W ald-t yp e estimand ( W ald , 1940 ), deﬁned as the ratio of the reduced-form eﬀect of the instrumen t on the outcome to the ﬁrst-stage eﬀect of the 8 instrumen t on the treatmen t, conditional on X and Z : δ c,x,z = E [ Y | W = 1 , X = x, Z = z ] − E [ Y | W = 0 , X = x, Z = z ] E [ D | W = 1 , X = x, Z = z ] − E [ D | W = 0 , X = x, Z = z ] = g x,z h x,z , (9) where we deﬁne the reduced-form and ﬁrst-stage eﬀects as ¯ g x,z = m 1 ,x,z − m 0 ,x,z and ¯ h x,z = r 1 ,x,z − r 0 ,x,z , with m w,x,z = E [ Y | W = w , X = x, Z = z ] and r w,x,z = E [ D | W = w , X = x, Z = z ] . Considering, in analogy to equation ( 5 ) for the CA TE, the follo wing n ull hypothesis, H 0 : δ c,x,z − δ c,x,z ′ = 0 , ∀ z , z ′ ∈ { 1 , .., L } and x ∈ X , (10) p ermits testing the following eﬀect homogeneit y assumption among compliers: Assumption 5 (Conditional eﬀect homogeneity among compliers) . E [ Y (1) − Y (0) | D (1) = 1 , D (0) = 0 , X, Z ] = E [ Y (1) − Y (0) | D (1) = 1 , D (0) = 0 , X ] . Since δ c,x,z = g x,z /h x,z , w e note that the null h yp othesis ( 10 ) can b e equiv alen tly written in cross-multiplied form as ¯ Θ x,z = ¯ g x,z ¯ h x,z ′ − ¯ g x,z ′ ¯ h x,z = 0 , ∀ z , z ′ ∈ { 1 , .., L } and x ∈ X . (11) Hence, a ratio-free v ersion of the n ull h yp othesis for the CLA TE that is analogous to equation ( 6 ) for the CA TE is giv en b y H 0 : θ 0 = E " L X z =1  ( ¯ g X,z ¯ h X,z − − ¯ g X,z − ¯ h X,z ) 2 + ( ¯ g X,z ¯ h X,z − − ¯ g X,z − ¯ h X,z )  # = 0 , (12) where z − denotes v alues of Z diﬀeren t from z . W e also deﬁne the prop ensit y score π w,z ( x ) = Pr ( W = w , Z = z | X = x ) and collect the n uisance parameters in η = ( m w,x,z , r w,x,z , π w,z ( x ) , π w,z − ( x )) w ∈{ 0 , 1 } ,z ∈{ 1 ,...,L } . F ollowing a similar logic as in the CA TE case, we construct a DR score function based on ( 12 ) for the CLA TE setting. T o this end, deﬁne the DR augmentations for the reduced form and ﬁrst stage eﬀects as g X,z = m 1 ,X,z − m 0 ,X,z + ( Y − m 1 ,X,z )1( W = 1 , Z = z ) π 1 ,z ( X ) − ( Y − m 0 ,X,z )1( W = 0 , Z = z ) π 0 ,z ( X ) h X,z = r 1 ,X,z − r 0 ,X,z + ( D − r 1 ,X,z )1( W = 1 , Z = z ) π 1 ,z ( X ) − ( D − r 0 ,X,z )1( W = 0 , Z = z ) π 0 ,z ( X ) Using these, we denote the cross-pro duct diﬀerence by Θ X,z = g X,z h X,z − − g X,z − h X,z . Denoting by O = ( Y , D , X , Z , W ) the random v ariables, a DR score function for testing the 9 n ull h yp othesis in ( 12 ) is ψ CLA TE ( O , θ , η ) = L X z =1  ¯ Θ 2 X,z + Θ X,z  (13) + L X z =1 2 ¯ Θ X,z  h X,z −  ( Y − m 1 ,X,z )1( W = 1 , Z = z ) π 1 ,z ( X ) − ( Y − m 0 ,X,z )1( W = 0 , Z = z ) π 0 ,z ( X )  − h X,z  ( Y − m 1 ,X,z − )1( W = 1 , Z  = z ) π 1 ,z − ( X ) + ( Y − m 0 ,X,z − )1( W = 0 , Z  = z ) π 0 ,z − ( X )  + L X z =1 2 ¯ Θ X,z  g X,z −  ( D − r 1 ,X,z )1( W = 1 , Z = z ) π 1 ,z ( X ) − ( D − r 0 ,X,z )1( W = 0 , Z = z ) π 0 ,z ( X )  − g X,z  ( D − r 1 ,X,z − )1( W = 1 , Z  = z ) π 1 ,z − ( X ) + ( D − r 0 ,X,z − )1( W = 0 , Z  = z ) π 0 ,z − ( X )  − θ . This score function has zero mean under the null hypothesis H 0 : θ 0 = 0 and is Neyman- orthogonal with resp ect to the nuisance parameters η , see the pro of pro vided in App endix 7 . Consequen tly , cross-ﬁtted estimators of θ 0 based on ( 13 ) are √ n -consisten t and asymptoti- cally normal under sp eciﬁc regularity conditions, in particular if machine learning estimators of the nuisance functions conv erge at rate o ( n − 1 / 4 ) . W e note that, although the construction of the CLA TE score function is related to the approac h of Apfel et al. ( 2024 ), there is a conceptual diﬀerence compared to the CA TE framew ork considered in their paper and in our Section 3 . While the DR function for the CA TE is linear (but not quadratic) in the debiasing terms, in whic h outcome regression residuals are reweigh ted by the in verse of prop ensit y scores (also kno wn as augmen ted residuals), the CLA TE score in v olves a cross-pro duct b et ween the DR estimators of the reduced form and ﬁrst-stage eﬀects. Consequen tly , the CLA TE momen t condition con tains b oth Θ X,z and its square, reﬂecting the bilinear structure of the CLA TE, whic h depends on the ratio of t wo conditional eﬀects. This introduces second-order terms in the score but do es not alter Neyman orthogonality , as the inﬂuence of the n uisance parameters still cancels out through the residual orthogonality conditions E [ Y − m w,X ,z | X , Z , W ] = 0 and E [ D − r w,X ,z | X , Z , W ] = 0 . In this sense, the CLA TE score extends the construction of scores based on squared diﬀerences in regression functions to a setting where b oth the n umerator (reduced form) and denominator (ﬁrst stage) of the parameter of interest must b e debiased sim ultaneously . As for Assumption 3 in the CA TE case, it is worth noting that Assumption 5 ma y hold for t wo reasons: either CLA TEs do not depend on unobserv ables, or the distribution of unobserv ables is stable across exp erimen ts. T o illustrate, consider the outcome mo del in equation ( 4 ) together with a threshold-crossing treatment mo del D = I { λ ( Z , X ) > η ( V ) } , (14) where I { ·} is the indicator function that is equal to one if its argumen t is satisﬁed and zero otherwise, λ and η are unkno wn functions, and V are unobserv ables aﬀecting the treatment. W e note that the threshold-crossing mod el for treatment assignmen t in equation ( 14 ) b oth implies and is implied by treatment monotonicity , as shown in V ytlacil ( 2002 ). Regarding 10 eﬀect heterogeneity , the unobserv able V ma y be arbitrarily asso ciated with U under our IV assumptions, so that heterogeneity of treatment eﬀects in U generally also induces heterogeneit y with resp ect to V and hence across compliance types, deﬁned b y whether I { λ (1 , X ) > η ( V ) } = 1 and I { λ (0 , X ) > η ( V ) } = 0 . Therefore, if it can b e assumed that unobserv ables diﬀer across Z , then satisfaction of Assumption 5 p oin ts to homogeneous eﬀects. This, in turn, implies that treatmen t eﬀects do not dep end on compliance b ehavior - which is itself a function of unobserv ables - conditional on X , an assumption discussed in Angrist and F ernández-V al ( 2010 ) and Aronow and Carnegie ( 2013 ): Assumption 6 (CLA TE equals CA TE) . E [ Y (1) − Y (0) | D (1) , D (0) , X , Z ] = E [ Y (1) − Y (0) | X ] . An imp ortant implication of this assumption is that it allo ws extrap olating the CLA TE to the entire p opulation, since under Assumption 6 the CLA TE coincides with the CA TE. In other words, the identiﬁed eﬀect is no longer lo cal to compliers, but represents the av erage conditional eﬀect for the full population. F urther, alternative identifying assumptions can b e considered when panel data (or also rep eated cross sections) are av ailable, in which outcomes are observed b oth b efore and after the in tro duction of treatment. T o this end, we introduce time index t ∈ 0 , 1 , where t = 0 refers to the pre-treatmen t p erio d and t = 1 to the p ost-treatmen t perio d, to denote b y Y t and Y t ( d ) the outcome and the p otential outcome (giv en D = d ) at time t , resp ectiv ely . This setting p ermits eﬀect iden tiﬁcation based on the parallel trends assumption, whic h requires conditional indep endence in outcome trends rather than in outcome levels (as imp osed b y Assumption 1 ). A set of suﬃcient assumptions for iden tifying the conditional av erage treatmen t eﬀect on the treated (CA TET) in panel data based on the diﬀerence-in-diﬀerences (DiD) approach is the following, see, e.g., Abadie ( 2005 ); Lechner ( 2011 ): Assumption 7 (DiD assumptions) . E [ Y 1 (0) − Y 0 (0) | D = 1 , X , Z ] = E [ Y 1 (0) − Y 0 (0) | D = 0 , X , Z ] , E [ Y 0 (1) − Y 0 (0) | D = 1 , X , Z ] = 0 , Pr( D = 1 | X , Z ) < 1 . The ﬁrst condition in Assumption 7 formalizes the conditional common trends assumption: giv en (presumably exogenous) cov ariates X and exp eriment Z , no unobserved factors sim ultaneously aﬀect b oth treatmen t assignment and the trend of mean p oten tial outcomes under non-treatment. In DiD settings, it is w orth n oting that in the con text of DiD, multiple exp erimen ts Z ma y for instance corresp ond to multiple treated regions observ ed within the same dataset. The second condition rules out av erage anticipation eﬀects among the treated, conditional on X . It requires that treatmen t status D do es not causally inﬂuence pretreatmen t outcomes in exp ectation of the treatment to come. The third line imp oses a sp eciﬁc common support condition for identifying the CA TET, requiring that for every co v ariate proﬁle X and exp eriment Z observ ed among the treated, there also exist some un treated observ ations with the same ( X , Z ) . When replacing Y b y the outcome diﬀerence Y 1 − Y 0 in the deﬁnitions of the conditional mean outcomes µ D,X ,Z in tro duced in Section 3 , the n ull hypotheses ( 5 ) and ( 6 ), as well as 11 the score function ( 7 ) , can b e redeﬁned to ev aluate eﬀect heterogeneit y across CA TET s in diﬀeren t experiments. A natural question is whether these eﬀects can be extrap olated to the total population, i.e., whether the CA TET coincides with the CA TE. In general, this is not the case b ecause the parallel trends condition in Assumption 7 is only imposed for the un treated potential outcomes, such that identiﬁcation is restricted to the treated group. In particular, treatment eﬀects may diﬀer with respect to time-in v ariant confounders that are allo wed to diﬀer b et ween treated and un treated units. How ever, if exp eriments diﬀer in such time-in v arian t confounders, and eﬀect homogeneit y across exp erimen ts is not rejected, this suggests that treatmen t eﬀects do not dep end on them. In this case, the CA TET coincides with CA TE, as expressed in the follo wing assumption: Assumption 8 (CA TET equals CA TE) . E [ Y (1) − Y (0) | D = 1 , X , Z ] = E [ Y (1) − Y (0) | X ] . The following structural model illustrates the role of time-in v ariant unobserv ables. Let Y t = κ t ( D , X ) + η ( D , U ) + ε t , (15) where κ t ( D , X ) is an unkno wn, time-v arying function of co v ariates X and treatmen t D , η ( D , U ) is a time-in v arian t function of unobserv ables U that may interact with treatmen t, and ε t is an idiosyncratic, time-v arying error. F or the potential outcome under non-treatmen t, Y t (0) = κ t (0 , X ) + η (0 , U ) + ε t , (16) it follows that diﬀerencing across time, Y 1 (0) − Y 0 (0) , eliminates η (0 , U ) due to its additive separabilit y . Hence, the parallel trends condition holds with respect to Y t (0) conditional on X , even if the distribution of U diﬀers b etw een treatment groups. How ever, treatmen t eﬀects ma y still v ary across groups, since arbitrary in teractions b et ween D and U are allo wed in η ( D , U ) . No w consider the alternativ e mo del Y t = κ t ( D , X ) + η ( U ) + ε t , (17) whic h rules out such in teractions and implies additiv e separability of U and D . In this case, treatment eﬀects are homogeneous in U , as in classical linear panel regression mo dels. Therefore, if U plausibly v aries across exp erimen ts Z but CA TET s are found to be constant across Z (and th us across distributions of U ), this pro vides evidence in fa vor of Assumption 8 , which justiﬁes extrap olating eﬀects identiﬁed for the treated (CA TET) to the en tire p opulation (CA TE). 5 Sim ulation Study This section describ es a sim ulation study to inv estigate the ﬁnite sample b eha vior of our prop osed test of homogeneity of CA TEs across exp erimental and observ ational sites. W e ﬁrst consider comparisons across exp eriments and and base our sim ulations on the follo wing data generating pro cess (DGP): 12 Y = D + D X ′ β + δ D Z + X ′ β + U, D ∼ Bernoulli ( q ) , X ∼ N (0 , Σ) , Z ∼ Bernoulli ( π ) , U ∼ N (0 , 1) Where the outcome Y is a function of the treatment D , X the co v ariates, Z is an indicator of an exp erimen tal site and U denotes the error term. In the exp erimental design, D is randomly assigned based on a Bernoulli distribution with probability q and Assumption 1 (conditional indep endence) holds by construction. X is a v ector of co v ariates of dimension p , dra wn from a multiv ariate normal distribution with zero mean and co v ariance matrix Σ . In this sp eciﬁcation, Σ equals the identit y matrix, implying that all cov ariates are indep endent and hav e u nit v ariance. Z is an indicator of an exp erimental (or observ ational) site, generated indep endently from a Bernoulli distribution with probability π . The co eﬃcients β determine the impact of the cov ariates X on Y . Finally , U is a random and normally distributed error term. In the observ ational design, the elemen t which changes is the treatment assignment whic h is no longer randomized: D obs = I { X ′ β + ρU + V > 0 } , V ∼ N (0 , 1) Th us, w e consider the case where treatmen t ( D obs ) depends on cov ariates X and the error terms U and V . Where ρ determines the strength of confounding in the observ ational sites. A t the same time, the parameter δ , from the outcome equation, gov erns the degree of eﬀect heterogeneity across experimental and/or observ ational sites. Consequen tly , when δ = 0 , treatmen t eﬀects are constant across sites, while δ  = 0 induces heterogeneity in CA TEs across sites, indexed b y Z . Likewise, when ρ = 0 , there is no confounding in the observ ational sites, while ρ  = 0 in tro duces confounding. W e implement a cross-ﬁtted, doubly robust (DR) score test based on double diﬀerence score in ( 7 ) , adapted from Apfel et al. ( 2024 ) to test whether CA TEs are homogeneous across sites Z . F or eac h Z , we estimate conditional means and joint prop ensities b y a p enalized lasso regression using ﬁve fold k = 5 cross ﬁtting with default parameters as in the glmnet pac k age in R. W e then build DR residuals and com bine them into a double diﬀerence score across individual sites z . T o ensure ov erlap, w e trim the estimated conditional probabilities b elo w 0 . 05 and abov e 0 . 95 . The ov erall test statistic is computed as the sample mean of the individual scores o ver the retained sample after trimming. In addition, the standard error is obtained from the score v ariance scaled b y the sample size. Finally , a normal approximation is used to compute p -v alues. The simulation scenarios v ary in several dimensions. W e consider three sample sizes, n = 500 , 2000 and 8000 . In our main sp eciﬁcation, we set k = 5 , l = 2 , p = 100 , use Lasso as the machine learner, and ﬁx the trimming threshold at ε = 0 . 05 . W e p erform R = 1000 Mon te Carlo replications. T o assess size, we imp ose the null of cross-site 13 homogeneit y b y setting δ = 0 . T o assess p o wer, w e in tro duce cross-site heterogeneity b y setting δ = 1 . In the mixed design, w e further allo w for unobserved confounding in observ ational sites by setting ρ ∈ { 0 , 0 . 5 } , so that w e study the p erformance of the test under b oth unconfounded and confounded observ ational assignment. On the one hand, when δ = 0 where w e imp ose homogeneity of CA TEs across sites, the rejection rate of the test should approach the nominal signiﬁcance lev el (5%) as N increases, reﬂecting correct size of the test. On the other hand, when δ = 1 where w e imp ose heterogeneous CA TEs across sites, the rejection rate should increase with N , demonstrating the correct pow er of the test. W e assess the performance of the prop osed test using sev eral summary measures. Across R = 1000 Monte Carlo replications, w e rep ort the a verage estimate ( ˆ θ ) based on ( 7 ) , its standard deviation (std), and the av erage estimated standard error (mean se). W e rep ort empirical rejection rates at the 5% lev el, in terpreted as the size when δ = 0 and p o wer when δ  = 0 , as functions of N and δ . Finally , we rep ort the eﬀectiv e sample size as a function of the trimming rate. T able 1: Simulations: Size under δ = 0 acr oss exp erimental sites. N ˆ θ std mean se reject 5% n_eﬀ (mean) 500 0.041 0.063 0.063 10% 492 2000 0.027 0.028 0.028 8% 1994 8000 0.005 0.014 0.013 8% 7994 Notes. ‘ N ’ is the sample size p er replication. ‘ ˆ θ ’ is the av erage of the test statistic; ‘std’ is the standard deviation; ‘mean se’ is the av erage estimated standard error. ‘reject 5%’ is the fraction of replications with p < 0 . 05 (empirical size under δ = 0 , p ow er under δ = 1 ). ‘n_eﬀ (mean)’ is the av erage eﬀective sample size under normalized weigh ts. Baseline: K = 5 folds, Lasso, L = 2 , p = 100 , ε = 0 . 05 . T able 1 rep orts the simulation results based on our main sp eciﬁcation when δ = 0 across exp erimen tal sites only . That is, when the null hypothesis of homogeneous CA TEs across exp erimen tal sites is true. The a v erage estimate ˆ θ of the test decreases tow ard zero as the sample size increases, consistent under the null. The standard deviation and the av erage standard error decrease by roughly half when the sample size N quadruples, indicating that the estimator is root- N consisten t. The empirical rejection rate is slightly ab o ve the nominal 5%. Ho w ev er, it mov es tow ards the correct levels as N increases. Lastly , the eﬀectiv e sample size is close to the nominal sample size N in all sp eciﬁcations, suggesting that trimming is limited and w eigh ts are stable. Overall, the results indicate that our test b eha v es correctly under homogeneous treatmen t eﬀects across exp erimental sites and is w ell calibrated under the n ull. 14 T able 2: Simulations: Power under δ = 1 acr oss exp erimental sites. N ˆ θ std mean se reject 5% n_eﬀ (mean) 500 -0.204 0.073 0.073 79% 492 2000 -0.231 0.030 0.032 100% 1993 8000 -0.243 0.015 0.015 100% 7994 Notes. ‘ N ’ is the sample size p er replication. ‘ ˆ θ ’ is the av erage of the test statistic; ‘std’ is the standard deviation; ‘mean se’ is the av erage estimated standard error. ‘reject 5%’ is the fraction of replications with p < 0 . 05 (empirical size under δ = 0 , p ow er under δ = 1 ). ‘n_eﬀ (mean)’ is the av erage eﬀective sample size under normalized weigh ts. Baseline: K = 5 folds, Lasso, L = 2 , p = 100 , ε = 0 . 05 . T able 2 rep orts the sim ulation results based on our main sp eciﬁcation when δ = 1 , where CA TEs v ary across exp erimental sites and the n ull h yp othesis of homogeneous CA TEs is false. In this case, the test should reject with high probability . The av erage estimate ˆ θ b ecomes increasingly negativ e as the sample size increases. Both the standard deviation and the mean estimated standard error of the estimator decrease roughly by half when the sample size quadruples, again consistent with ro ot- N con vergence. The empirical rejection rate rises sharply with sample size increasing from 79% to 100% for larger samples. Finally , the eﬀective sample size remains close to the nominal sample size for all N in all sp eciﬁcations. The results sho w that our test has strong pow er to detect violations of eﬀect homogeneit y across sites. W e now consider the setting which combines exp erimental and observ ational sites in the absence of confounding where ρ = 0 . This mixed design mirrors man y empirical applications in whic h treatmen t is randomized in some sites but not in others, with the latter relying on observ ational v ariation. This allows us to examine whether our metho d can distinguish violations of homogeneit y from violations of unconfoundedness. When exp erimental sites in- dicate homogeneous CA TEs, systematic diﬀerences b et ween exp erimen tal and observ ational CA TEs provide evidence against the v alidity of the observ ational identiﬁcation strategy . T able 3: Simulations: Size under δ = 0 & ρ = 0 acr oss exp erimental & observational sites. N ˆ θ std mean se reject 5% n_eﬀ (mean) 500 0.057 0.064 0.065 14% 489 2000 0.020 0.027 0.028 9.1% 1971 8000 0.006 0.013 0.014 7.2% 7854 Notes. ‘ N ’ is the sample size p er replication. ‘ ˆ θ ’ is the av erage of the test statistic; ‘std’ is the standard deviation; ‘mean se’ is the av erage estimated standard error. ‘reject 5%’ is the fraction of replications with p < 0 . 05 (empirical size under δ = 0 , p ow er under δ = 1 ). ‘n_eﬀ (mean)’ is the av erage eﬀective sample size under normalized weigh ts. Baseline: K = 5 folds, Lasso, L = 2 , p = 100 , ε = 0 . 05 . T able 3 rep orts simulation results for the mixed exp erimen tal–observ ational design under the n ul l h yp othesis of homogeneous treatment eﬀects across sites, δ = 0 , and no unobserved confounding in the observ ational sites, ρ = 0 . Consequen tly , the iden tifying assumptions hold, and the test should reject at the nominal signiﬁcance level. The mean estimate ˆ θ mo ves tow ard zero as N increases, consisten t with the null. The standard deviation and the a verage estimated standard error decline at the exp ected rate. The empirical rejection rate 15 is ab ov e the nominal 5% lev el in smaller samples but it declines with sample size. Finally , the eﬀectiv e sample size remains close to the nominal sample size in all designs. Ov erall, the results indicate that our test b ehav es as exp ected when exp erimental and observ ational sites are com bined and the observ ational iden tifying assumptions hold, although it exhibits some ov er-rejection in smaller samples. W e next consider a mixed design whic h combines exp erimental and observ ational sites, but now w e allow for unobserved confounding in the observ ational sites captured by ρ = 0 . 5 . This v alue introduces strong confounding in the observ ational sites. In this setting, the observ ational iden tifying assumptions fail, so discrepancies b et w een experimental and observ ational CA TEs are driv en by confounding rather than by true eﬀect heterogeneity . As a result, the test ma y experience distortions in its size, reﬂecting sensitivit y of the test to violations of the identifying assumptions in the observ ational sites. T able 4: Simulations: Size u nder δ = 0 & ρ = 0 . 5 acr oss exp erimental & observational sites. N ˆ θ std mean se reject 5% n_eﬀ (mean) 500 0.174 0.070 0.068 72.6% 489 2000 0.137 0.028 0.030 99.8% 1974 8000 0.123 0.014 0.014 100% 7870 Notes. ‘ N ’ is the sample size p er replication. ‘ ˆ θ ’ is the av erage of the test statistic; ‘std’ is the standard deviation; ‘mean se’ is the av erage estimated standard error. ‘reject 5%’ is the fraction of replications with p < 0 . 05 (empirical size under δ = 0 , p ow er under δ = 1 ). ‘n_eﬀ (mean)’ is the av erage eﬀective sample size under normalized weigh ts. Baseline: K = 5 folds, Lasso, L = 2 , p = 100 , ε = 0 . 05 . T able 4 rep orts the results for the mixed experimental-observ ational design under homogeneous treatment eﬀects across sites where δ = 0 and strong unobserved confounding in the observ ational sites where ρ = 0 . 5 . The mean estimate ˆ θ decreases at a slow er rate than under the scenario of no confounding as the sample size increases. Ho w ever, b oth the Monte Carlo standard deviation and the mean estimated standard error shrinks at the exp ected ro ot- N rate. The eﬀective sample size also remains close to the nominal sample size. Moreo ver, the rejection rates increase substantially even for mo derate sample sizes. The results suggest that, as N gro ws, our test increasingly detects diﬀerences in CA TEs b et w een experimental and observ ational sites which are driv en b y unmeasured confounding rather than by true treatment eﬀect heterogeneit y across sites. T able 5: Simulations: Power under δ = 1 & ρ = 0 acr oss exp erimental & observational sites. N ˆ θ std mean se reject 5% n_eﬀ (mean) 500 -0.177 0.073 0.074 67.7% 488 2000 -0.225 0.031 0.032 100% 1972 8000 -0.242 0.015 0.015 100% 7855 Notes. ‘ N ’ is the sample size p er replication. ‘ ˆ θ ’ is the av erage of the test statistic; ‘std’ is the standard deviation; ‘mean se’ is the av erage estimated standard error. ‘reject 5%’ is the fraction of replications with p < 0 . 05 (empirical size under δ = 0 , p ow er under δ = 1 ). ‘n_eﬀ (mean)’ is the av erage eﬀective sample size under normalized weigh ts. Baseline: K = 5 folds, Lasso, L = 2 , p = 100 , ε = 0 . 05 . T able 5 rep orts the results for the mixed experimental-observ ational design under 16 heterogeneous treatment eﬀects across sites where δ = 1 and there is no unobserv ed confounding where ρ = 0 in the observ ational sites. A cross sample sizes, the mean test statistic ˆ θ is negativ e and b ecomes sligh tly more negativ e as N increases. The standard deviation and the estimated mean standard error decrease appro ximately at the ro ot rate N . Consisten t with this, the rejection rate rises as the sample size increases and reaches essen tially one for N ≥ 2000 . The eﬀectiv e sample size remains close to the nominal N in all cases. Ov erall, the sim ulation results indicate that our test has high p ow er to detect cross-site heterogeneity when iden tiﬁcation is v alid in both experimental and observ ational sites and its pow er increases rapidly with sample size. T able 6: Simulations: Power under δ = 1 & ρ = 0 . 5 acr oss exp erimental & observational sites. N ˆ θ std mean se reject 5% n_eﬀ (mean) 500 -0.063 0.080 0.078 15.1% 489 2000 -0.108 0.033 0.034 89.8% 1975 8000 -0.125 0.016 0.016 100% 7869 Notes. ‘ N ’ is the sample size p er replication. ‘ ˆ θ ’ is the av erage of the test statistic; ‘std’ is the standard deviation; ‘mean se’ is the av erage estimated standard error. ‘reject 5%’ is the fraction of replications with p < 0 . 05 (empirical size under δ = 0 , p ow er under δ = 1 ). ‘n_eﬀ (mean)’ is the av erage eﬀective sample size under normalized weigh ts. Baseline: K = 5 folds, Lasso, L = 2 , p = 100 , ε = 0 . 05 . Finally , T able 6 rep orts the results for the mixed exp erimental-observ ational design under heterogeneous treatment eﬀects across sites where δ = 1 and unobserv ed confounding in the observ ational sites where ρ = 0 . 5 . The results sho w that the test statistic b ecomes increasingly negative as the sample size increases while the standard deviation and a verage standard error decrease at ro ot N . Regarding p o wer, the results show that the test has low er p o w er in small samples, while the rejection rates increase sharply as the sample size gro ws. This indicates that strong confounding can weak en the detection of eﬀect heterogeneit y in ﬁnite samples, but the test regains p ow er for larger sample sizes. Finally , the retained sample size is close to the original sample size. Imp ortantly , b ecause eﬀect heterogeneity and confounding are present sim ultaneously in this scenario, the test detects diﬀerences in treatmen t eﬀects b etw een exp erimental and observ ational sites, but it cannot attribute that diﬀerence uniquely to true treatment eﬀect heterogeneity v ersus violations of the iden tifying assumptions in the observ ational sites. Overall, the simulations show that our prop osed test is well b ehav ed when its iden tifying assumptions hold and b ecomes increasingly informative as the sample size grows. 6 Application W e illustrate our metho d using data from the International Stroke T rial (IST), a large m ulti-centre randomized controlled trial in acute ischaemic stroke conducted by the IST Collab orativ e Group ( Sanderco ck, Niew ada, Członko wsk a, & Group , 2011 ). The IST in vestigated whether early administration of aspirin, heparin, b oth, or neither aﬀects clinical outcomes after strok e. P atien ts w ere eligible if they had a clinical diagnosis of acute isc haemic stroke within 48 hours of symptom onset and had no clear indication for, or con traindication to, either treatmen t. After a CT scan to support the diagnosis, clinicians 17 con tacted a central randomization service that recorded baseline characteristics and returned the assigned treatment. The dataset con tains anonymized individual-lev el information on 19,435 patients treated in 467 hospitals across 36 countries. It in cludes baseline characteristics, clinical status at randomization, short-run outcomes measured at 14 da ys, and follo w-up outcomes at six mon ths. The primary outcome of interest of the trial w as death or dep endency in daily living six months after randomization. Our empirical analysis focuses on the randomized assignmen t to aspirin. W e deﬁne the treatmen t indicator D as assignment to aspirin and the outcome Y as an indicator equal to one if the patien t is dead or dep endent at six months, and zero otherwise. W e restrict the sample to patients with non-missing information on treatmen t assignment and the six-month outcome. Additionally , for comparabilit y , w e fo cus on observ ations from the main trial and discard observ ations from the pilot trial. T able 7: Sample c onstruction. Step N F ull IST dataset 19,435 Main trial only (drop pilot) 18,451 Non-missing D and Y 18,273 Final analysis sample ( ≥ 50 p er coun try) 18,189 T able 7 summarizes the sample construction. Starting from the full IST dataset, w e ﬁrst observ ations from the pilot phase to focus on the main trial for comparability reasons. W e then restrict the sample to patien ts with non-missing treatment assignment and six-month outcome. Finally , we imp ose a minimum sample size requirement at the site lev el and retain only coun tries with at least 50 observ ation. The ﬁnal analysis sample contains N = 18 , 189 patients from 31 coun tries listed in T able 10 . As baseline co v ariates X , we use pre-treatmen t patien t c haracteristics measured at randomization: age, sex, systolic blo o d pressure, indicators for baseline lev el of consciousness (fully alert, drowsy , unconscious), and whether a CT scan was p erformed b efore randomization. These co v ariates capture clinically relev ant diﬀerences at baseline health b et ween patients. T able 8: Baseline c ovariate b alanc e: Standar dize d Diﬀer enc es in Me ans. V ariable Mean Control (SD) Mean T reated (SD) SMD Age 71.87 (11.53) 71.89 (11.61) 0.002 Systolic blo o d pressure 160.45 (27.63) 160.04 (27.83) 0.015 F emale 0.46 (0.50) 0.47 (0.50) 0.018 CT b efore randomization 0.68 (0.47) 0.67 (0.47) 0.016 F ully alert 0.77 (0.42) 0.77 (0.42) 0.003 Dro wsy 0.22 (0.41) 0.22 (0.41) 0.003 Unconscious 0.01 (0.12) 0.01 (0.12) < 0 . 001 N 9,101 9,088 Notes. Entries report mean (standard deviation). SMD denotes the standardized mean diﬀerence b et w een treated and con trol groups. 18 T able 8 summarizes baseline cov ariate balance b et ween treated and control patien ts using standardized diﬀerence in means (SMDs). The treated and con trol groups are v ery similar across all characteristics as all SMDs are b elo w con ven tional thresholds for meaningful im balance. In particular, age, systolic bloo d pressure, sex, pre-randomization CT use, and baseline consciousness status are nearly identical across treatment arms. These patterns supp ort a successful treatment randomization in the IST study . In addition, treatmen t assignmen t is also balanced within eac h country , although sample sizes within countries v ary substan tially (see Figure 2 and T able 10 ). In this context, w e tak e coun tries as exp erimental sites, indexed b y Z , and apply our test of eﬀect homogeneit y b etw een these countries. This setting is w ell suited to our approach because a common randomized protocol was implemen ted across all participating coun tries. Ho wev er, clinical en vironments and baseline risk may diﬀer betw een sites, as can b e seen in Figure 1 . Figure 1: Outc ome r ates by c ountry (site): Pr(de ad or dep endent at 6 months) with 95% c onﬁdenc e intervals. Figure 1 displays outcome rates by country . The outcome rates v ary markedly across sites, ranging from 0.324 to 0.802, whic h lik ely reﬂects diﬀerences in baseline risk, case mix, and clinical practice across coun tries. This heterogeneit y in outcome lev els motiv ates testing whether the treatment eﬀects are homogeneous across countries, sp eciﬁcally if the conditional eﬀect of aspirin on death or dep endency is stable b et ween coun tries. W e therefore apply our prop osed test of CA TE homogeneit y across m ultip le exp erimental sites to the IST dataset. The estimand is the av erage treatmen t eﬀect of randomized aspirin assignmen t on six-month death or dep endency , adjusting for baseline cov ariates. In the spirit of our framework, failing to reject homogeneity indicates no systematic interactions b etw een treatment and unobserv ed determinan ts of outcomes in the mean eﬀect and supp orts generalizing the estimated eﬀect across countries. In contrast, rejecting homogeneit y suggests that the eﬀect v aries across sites in wa ys not captured by observ ed co v ariates, consisten t with treatmen t–unobserv able interactions and/or other forms of site-level heterogeneit y . 19 T able 9: T est of eﬀe ct homo geneity acr oss exp erimental sites (c ountries) ε ˆ θ se p-v alue n eﬀ 0.05 0.012 0.006 0.046 15805 0.10 0.002 0.004 0.706 9349 Notes. ε is the trimming threshold for the prop ensity-score comp onents used to construct weigh ts. n eﬀ is the eﬀective sample size after trimming. T able 9 rep orts the results of our test of CA TE homogeneity across coun tries, implemented with tw o trimming thresholds, ε ∈ 0 . 05 , 0 . 10 . F or the baseline threshold ε = 0 . 05 , the test rejects homogeneity at the 5% level ( p = 0 . 046 ), providing marginal evidence th at the conditional eﬀect of aspirin may v ary across coun tries. Increasing trimming to ε = 0 . 10 yields a muc h smaller test statistic and a large p -v alue ( p = 0 . 706 ), so we no longer reject homogeneit y . The shift in inference is accompanied by a sharp decline in the eﬀectiv e sample size. The results show the impact that trimming and therefore, ov erlap can hav e on the conclusions. Importantly , in this context, the sensitivit y of the results to trimming is not due to the limited o verlap in treatmen t assignment, as aspirin is randomized within each coun try and the estimated propensity scores are tightly concentrated around 0 . 5 . Rather, it reﬂects limited supp ort for certain country co v ariate com binations. In some countries, sp eciﬁc cov ariate proﬁles are rare and/or the country sample size is small, resulting in estimated site-sp eciﬁc prop ensit y scores that can b e v ery lo w. This p oses a problem b ecause the estimation w eights are based on the inv erse of the prop ensit y scores, so small scores translate in to v ery large weigh ts. As a result, a small n um b er of observ ations in sparsely supp orted regions of the country-speciﬁc co v ariate distribution can receive extreme weigh ts and disproportionately inﬂuence the test statistic. T rimming preven ts this by excluding suc h cases with extreme prop ensit y scores. With stricter trimming that focuses on regions with impro ved common supp ort, the test do es not reject the n ull hypothesis of homogeneous treatmen t eﬀects across coun tries. 7 Conclusion In this w ork, w e introduced a framework for testing the homogeneit y of conditional av erage treatmen t eﬀects (CA TEs) across multiple exp erimen tal and observ ational sites. The prop osed test is built on a Neyman orthogonal score that extends Apfel et al. ( 2024 ) to a double diﬀerence setting. Under sp eciﬁc regularit y conditions (in particular, o ( n − 1 / 4 ) con vergence rates for the nuisance estimators) and with cross-ﬁtting, the resulting estimator is √ n -consisten t and asymptotically normal. W e also show ed how the same logic carried o ver to settings with alternative iden tiﬁcation strategies, such as instrumental v ariables and panel designs with parallel trends. The sim ulation study indicated that the test is w ell b ehav ed when the iden tifying assumptions hold and b ecomes increasingly informative as sample size gro ws. In randomized m ulti-site designs, the test has the correct size under homogeneity and has high p ow er against heterogeneity . In mixed designs that combine exp erimental and observ ational sites, the test rejects systematically when unobserv ed confounding is present in the observ ational 20 sites, even when treatmen t eﬀects are homogeneous, highligh ting the usefulness of the test as a diagnostic for ﬂagging potential confounding. W e then illustrated the approach using data from the In ternational Strok e T rial, treating coun tries as sites and testing whether the conditional eﬀect of randomized aspirin assignmen t on six mon th death or dep endency is homogeneous across coun tries. With more p ermissiv e trimming, w e reject homogeneity , while with stricter trimming we do not reject. This sensitivit y likely arises b ecause some coun tries ha ve very few observ ations for certain patient proﬁles. With a more p ermissiv e trimming rule, these rare proﬁles can receiv e v ery large w eights and therefore ha ve a disprop ortionate impact on the results. Ho wev er, with stricter trimming, w e exclude those p o orly represented cases, and the analysis relies on patien t proﬁles that are more common within each coun try . Ov erall, the prop osed framework pro vides a practical and ﬂexible to ol for assessing homogeneit y of CA TEs across exp erimental and observ ational data. The test helps re- searc hers to diagnose confounding as well as to ev aluate the internal and external v alidity of their estimates. This is increasingly v aluable as researc hers now often ha v e access to b oth randomized trials and ric h observ ational datasets, but lac k principled w a ys to determine when estimates from these sources can b e compared, combined, or extrap olated across settings. 21 References Abadie, A. (2003). Semiparametric instrumen tal v ariable estimation of treatmen t response mo dels. Journal of Ec onometrics , 113 , 231-263. Abadie, A. (2005). Semiparametric diﬀerence-in-diﬀerences estimators. R eview of Ec onomic Studies , 72 , 1-19. Angrist, J., & F ernández-V al, I. (2010). Extrap olate-ing: External v alidit y and ov eridentiﬁ- cation in the late framework. NBER working p ap er 16566 . Angrist, J., Imbens, G., & Rubin, D. (1996). Identiﬁcation of causal eﬀects using instrumental v ariables. Journal of A meric an Statistic al Asso ciation , 91 , 444-472 (with discussion). Apfel, N., Hatam y ar, J., Hub er, M., & Kuec k, J. (2024). Learning control v ariables and in- strumen ts for causal analysis in observ ational data. arXiv pr eprint arXiv:2407.04448 . Arono w, P. M., & Carnegie, A. (2013). Beyond late: Estimation of the a v erage treatmen t eﬀect with an instrumen tal v ariable. Politic al Analysis , 21 , 492-506. A they , S., Chett y , R., & Imbens, G. (2025, Ma y). The exp erimental sele ction c orr e ction estimator: Using exp eriments to r emove biases in observational estimates (W orking P ap er No. 33817). National Bureau of Economic Researc h. Retriev ed from http:// www.nber.org/papers/w33817 doi: 10.3386/w33817 Bran tner, C. L., Chang, T. H., Nguyen, T. Q., Hong, H., Di Stefano, L., & Stuart, E. A. (2023, No vem b er). Metho ds for integrating trials and non-exp erimental data to examine treatmen t eﬀect heterogeneit y . Statistic al Scienc e , 38 (4), 640–654. (Epub 2023 Nov 6) doi: 10.1214/23-sts890 Bran tner, C. L., Nguyen, T. Q., T ang, T., Zhao, C., Hong, H., & Stuart, E. A. (2024). Comparison of metho ds that combine multiple randomized trials to estimate hetero- geneous treatmen t eﬀects. Statistics in Me dicine , 43 (7), 1291-1314. Retriev ed from https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.9955 doi: h ttps://doi.org/10.1002/sim.9955 Chen, H., Aebersold, H., Puhan, M. A., & Serra-Burriel, M. (2025). Causal machine learning methods for estimating personalised treatmen t eﬀects–insigh ts on v alidity from tw o large trials. arXiv pr eprint arXiv:2501.04061 . Cheng, D., & Cai, T. (2021). A daptive com bination of randomized and observ ational data. arXiv pr eprint arXiv:2111.15012 . Chernozh uko v, V., Chetv eriko v, D., Demirer, M., Duﬂo, E., Hansen, C., Newey , W., & Robins, J. (2018). Double/debiased machine learning for treatmen t and structural parameters. The Ec onometrics Journal , 21 , C1-C68. Cole, S. R., & Stuart, E. A. (2010). Generalizing evidence from randomized clinical trials to target p opulations: the actg 320 trial. Americ an journal of epidemiolo gy , 172 (1), 107–115. Colnet, B., Ma y er, I., Chen, G., Dieng, A., Li, R., V aro quaux, G., . . . Y ang, S. (2024). Causal inference methods for combining randomized trials and observ ational studies: a review. Statistic al scienc e , 39 (1), 165–191. Co x, D. (1958). Planning of exp eriments . New Y ork: Wiley. Epanomeritakis, A., & Viviano, D. (2025). Cho osing what to learn: Exp erimental design when combining exp erimen tal with observ ational evidence. arXiv pr eprint arXiv:2510.23434 . Ghassami, A., Y ang, A., Richardson, D., Shpitser, I., & T chetgen, E. T. (2022). Com bining 22 exp erimen tal and observ ational data for iden tiﬁcation and estimation of long-term causal eﬀects. arXiv pr eprint arXiv:2201.10743 . Hahn, J. (1998, Mar.). On the role of the prop ensit y score in eﬃcient semiparametric estimation of av erage treatment eﬀects. Ec onometric a , 66 , 315-331. Hatt, T., Berrevoets, J., Curth, A., F euerriegel, S., & v an der Sc haar, M. (2022). Com bin ing observ ational and randomized data for estimating heterogeneous treatmen t eﬀects. arXiv pr eprint arXiv:2202.12891 . Hatt, T., T sc hernutter, D., & F euerriegel, S. (2022, 01–05 Aug). Generalizing oﬀ-policy learning under sample selection bias. In J. Cussens & K. Zhang (Eds.), Pr o c e e dings of the thirty-eighth c onfer enc e on unc ertainty in artiﬁcial intel ligenc e (V ol. 180, pp. 769–779). PMLR. Retriev ed from https://proceedings.mlr.press/v180/ hatt22a.html Im b ens, G., Kallus, N., Mao, X., & W ang, Y. (2025). Long-term causal inference under p ersisten t confounding via data combination. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 87 (2), 362–388. Im b ens, G. W. (2004). Nonparametric estimation of a v erage treatment eﬀects under exogeneit y: a review. The R eview of Ec onomics and Statistics , 86 , 4-29. Im b ens, G. W., & Angrist, J. (1994). Identiﬁcation and estimation of lo cal a verage treatmen t eﬀects. Ec onometric a , 62 , 467-475. Kallus, N., Puli, A. M., & Shalit, U. (2018). Removing hidden confounding by exp erimen tal grounding. A dvanc es in neur al information pr o c essing systems , 31 . Lec hner, M. (2011). The estimation of causal eﬀects by diﬀerence-in-diﬀerence methods. F oundations and T r ends in Ec onometrics , 4 , 165-224. Lelo v a, K., Co op er, G. F., & T riantaﬁllou, S. (2025). T esting iden tiﬁabilit y and transporta- bilit y with observ ational and experimental data. arXiv pr eprint arXiv:2505.12801 . Liu, M., & Xie, J. (2025). When is causal inference p ossible? a statistical test for unmeasured confounding. arXiv pr eprint arXiv:2508.20366 . Neyman, J. (1923). On the application of probability theory to agricultural exp eriments. essa y on principles. Statistic al Scienc e , R eprint, 5 , 463-480. Neyman, J. (1959). Optimal asymptotic tests of composite statistical h yp otheses. In Pr ob ability and statistics (p. 416-444). Wiley. P arikh, H., et al. (2025). A double mac hine learning approac h for combining exp erimental and observ ational studies. Observational Studies , 11 (3), 249–300. Retrieved from https://dx.doi.org/10.1353/obs.2025.a973068 doi: 10.1353/obs.2025 .a973068 P ark, Y., & Sasaki, Y. (2024). The informativ eness of com bined experimental and observ a- tional data under dynamic selection. arXiv pr eprint arXiv:2403.16177 . P earl, J., & Bareinboim, E. (2011, Aug.). T ransp ortabilit y of causal and statistical relations: A formal approac h. Pr o c e e dings of the AAAI Confer enc e on Artiﬁcial Intel ligenc e , 25 (1), 247-254. Retrieved from https://ojs.aaai.org/index.php/AAAI/ article/view/7861 doi: 10.1609/aaai.v25i1.7861 Robins, J. M., & Rotnitzky , A. (1995). Semiparametric eﬃciency in m ultiv ariate regression mo dels with missing data. Journal of the Americ an Statistic al Asso ciation , 90 , 122-129. Robins, J. M., Rotnitzky , A., & Zhao, L. (1994). Estimation of regression co eﬃcients when some regressors are not alwa ys observed. Journal of the Americ an Statistic al 23 Asso ciation , 90 , 846-866. Rosenman, E. T., B asse, G., Owen, A. B., & Baio cc hi, M. (2023). Com bining observ ational and exp erimental datasets using shrink age estimators. Biometrics , 79 (4), 2961–2973. Rosenman, E. T. R., Owen, A. B., Baio cc hi, M., & Banac k, H. R. (2022, January). Prop ensity score metho ds for merging observ ational and exp erimen tal datasets. Statistics in Me dicine , 41 (1), 65–86. (Epub 2021 Oct 20) doi: 10.1002/sim.9223 Rubin, D. (1980). Comment on ’randomization analysis of experimental data: The ﬁsher randomization test’ b y d. basu. Journal of A meric an Statistic al Asso ciation , 75 , 591-593. Rubin, D. B. (1974). Estimating causal eﬀects of treatments in randomized and nonran- domized studies. Journal of Educ ational Psycholo gy , 66 , 688-701. Sanderco c k, P. A., Niew ada, M., Członk owsk a, A., & Group, I. S. T. C. (2011). The in ternational strok e trial database. T rials , 12 , 101. Retrieved from https:// doi.org/10.1186/1745-6215-12-101 doi: 10.1186/1745-6215-12-101 Sno w, J. (1855). On the mo de of c ommunic ation of choler a (J. Churc hill, Ed.). Stuart, E. A., Bradshaw, C. P ., & Leaf, P. J. (2015). Assessing the generalizabilit y of randomized trial results to target p opulations. Pr evention Scienc e , 16 (3), 475–485. doi: 10.1007/s11121-014-0513-z T rian taﬁllou, S., Jabbari, F., & Co op er, G. F. (2023). Learning treatmen t eﬀects from obser- v ational and exp erimental data. In International c onfer enc e on artiﬁcial intel ligenc e and statistics (pp. 7126–7146). V an Goﬀrier, G., Maystre, L., & Gilligan-Lee, C. M. (2023). Estimating long-term causal eﬀects from short-term exp eriments and long-term observ ational data with unobserved confounding. In Confer enc e on c ausal le arning and r e asoning (pp. 791–813). V ytlacil, E. (2002). Indep endence, monotonicity , and latent index mo dels: An equiv alence result. Ec onometric a , 70 , 331-341. W ald, A. (1940). The ﬁtting of straigh t lines if b oth v ariables are sub ject to error. Annals of Mathematic al Statistics , 11 , 284-300. W u, L., & Y ang, S. (2022). Integrativ e r -learner of heterogeneous treatment eﬀects com bining experimental and observ ational studies. In Confer enc e on c ausal le arning and r e asoning (pp. 904–926). Y ang, S. , Gao, C., Zeng, D., & W ang, X. (2023). Elastic integrativ e analysis of randomised trial and real-w orld data for treatment heterogeneit y estimation. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 85 (3), 575–596. Y ang, S., Zeng, D., & W ang, X. (2020). Impro ved inference for heterogeneous treat- men t eﬀects using real-w orld data sub ject to hidden confounding. arXiv pr eprint arXiv:2007.12922 . Y ang, X., Lin, L., Athey , S., Jordan, M. I., & Imbens, G. W. (2025). Cross-v alidated causal inference: a mo dern method to com bine exp erimen tal and observ ational data. arXiv pr eprint arXiv:2511.00727 . 24 App endix Pro of: Momen t condition and Neyman orthogonality for ψ CLA TE Step 1: Deﬁnitions. Let m w,x,z = E [ Y | W = w, X = x, Z = z ] , r w,x,z = E [ D | W = w , X = x, Z = z ] , ¯ g X,z = m 1 ,X,z − m 0 ,X,z , ¯ h X,z = r 1 ,X,z − r 0 ,X,z . Deﬁne the augmen tation (or debiasing) terms con taining the residuals that are weigh ted b y the inv erse of the prop ensity score: A ( Y ) X,z = ( Y − m 1 ,X,z ) 1 { W = 1 , Z = z } π 1 ,z ( X ) − ( Y − m 0 ,X,z ) 1 { W = 0 , Z = z } π 0 ,z ( X ) A ( Y ) X,z − = ( Y − m 1 ,X,z − ) 1 { W = 1 , Z  = z } π 1 ,z − ( X ) − ( Y − m 0 ,X,z − ) 1 { W = 0 , Z  = z } π 0 ,z − ( X ) . and A ( D ) X,z , A ( D ) X,z − analogously with D instead of Y . Let g X,z = ¯ g X,z + A ( Y ) X,z , h X,z = ¯ h X,z + A ( D ) X,z . The DR cross-pro ducts are Θ X,z = g X,z h X,z − − g X,z − h X,z , whic h composed of the plain regression cross-products ¯ Θ X,z = ¯ g X,z ¯ h X,z − − ¯ g X,z − ¯ h X,z , and the augmentation term cross-pro ducts A ( Y ) X,z = A ( Y ) X,z h X,z − − A ( Y ) X,z − h X,z , A ( D ) X,z = A ( D ) X,z g X,z − − A ( D ) X,z − g X,z . The diﬀerence in the exp ectation of score function ( 13 ) and θ corresp onds to the map M ( η ) = E h L X z =1 n ¯ Θ 2 X,z + Θ X,z + 2 ¯ Θ X,z A ( Y ) X,z + 2 ¯ Θ X,z A ( D ) X,z oi . (A.1) W e note that quadratic augmentation terms like ( A ( Y ) X,z ) 2 or ( A ( D ) X,z ) 2 could b e added in the exp ectation deﬁning M ( η ) , which w ould recognize that the null h yp othesis in ( 12 ) is based on E h P L z =1  Θ 2 X,z + Θ X,z i . Ho wev er, such terms are of second order in the residuals and for this reason, they do not aﬀect Neyman orthogonality , and are not required for deﬁning the moment condition underlying our test either. They are for this reason not included in M ( η ) and the following pro of. 25 Step 2: Momen t condition. Expanding Θ X,z yields Θ X,z = ( ¯ g X,z + A ( Y ) X,z )( ¯ h X,z − + A ( D ) X,z − ) − ( ¯ g X,z − + A ( Y ) X,z − )( ¯ h X,z + A ( D ) X,z ) (A.2) = ¯ g X,z ¯ h X,z − | {z } ( a ) + ¯ g X,z A ( D ) X,z − | {z } ( b ) + A ( Y ) X,z ¯ h X,z − | {z } ( c ) + A ( Y ) X,z A ( D ) X,z − | {z } ( d ) − ¯ g X,z − ¯ h X,z | {z } ( e ) − ¯ g X,z − A ( D ) X,z | {z } ( f ) − A ( Y ) X,z − ¯ h X,z | {z } ( g ) − A ( Y ) X,z − A ( D ) X,z | {z } ( h ) . Grouping, we see that Θ X,z = ¯ Θ X,z + h ( b ) + ( c ) − ( f ) − ( g ) i + h ( d ) − ( h ) i . W e note that score function ( 13 ) is equal to ψ CLA TE ( O , θ , η ) = L X z =1 n ¯ Θ 2 X,z + Θ X,z + 2 ¯ Θ X,z A ( Y ) X,z + 2 ¯ Θ X,z A ( D ) X,z o − θ . (A.3) Therefore, M ( η ) = E [ ψ CLA TE ( W , X, Z ; η ) + θ ] . A t the true n uisance functions η 0 , the augmentation terms conditionally mean zero giv en X : E [ A ( Y ) X,z | X ] = E [ A ( D ) X,z | X ] = 0 . This implies that terms ( b ) , ( c ) , ( f ) , and ( g ) in equation ( A.2 ) , whic h in volv e augmentation terms, are equal to zero. Also terms ( d ) and ( h ) , which are pro ducts of augmentation terms, are conditionally mean zero. It follows that E [Θ X,z | X ] = E [ ¯ Θ X,z | X ] . F urthermore, the terms 2 ¯ Θ X,z A ( Y ) X,z and 2 ¯ Θ X,z A ( D ) X,z in equation ( A.3 ) are conditionally mean zero as well. It follo ws b y the law of iterated expectations that E h L X z =1  ¯ Θ 2 X,z + Θ X,z + 2 ¯ Θ X,z A ( Y ) X,z + 2 ¯ Θ X,z A ( D ) X,z  i = E h L X z =1 ¯ Θ 2 X,z + ¯ Θ X,z i . (A.4) As the term E h P L z =1 ¯ Θ 2 X,z + ¯ Θ X,z i corresp onds to the deﬁnition of θ 0 in equation ( 12 ) , the momen t condition E [ ψ CLA TE ( O , θ 0 , η )] = 0 (A.5) holds, implying that the exp ectation of the score function is zero at the true v alues of the n uisance parameters η and the test statistic θ 0 . Step 3: Neyman orthogonalit y Perturb ations in outc ome mo del m . Consider m ( t ) w,x, · = m w,x, · + t δ m w,x, · and let ∂ t denote diﬀerentiation w.r.t. t at t = 0 . 26 The deriv ative of M ( η ) w.r.t. t as deﬁned in equation ( A.1 ) is ∂ t M ( η ) (A.6) = L X z =1 E h 2 ¯ Θ X,z ∂ t ¯ Θ X,z | {z } I + ∂ t Θ X,z | {z } I I + 2( ∂ t ¯ Θ X,z ) A ( Y ) X,z + 2( ∂ t ¯ Θ X,z ) A ( D ) X,z | {z } I I I + 2 ¯ Θ X,z ( ∂ t A ( Y ) X,z ) | {z } I V + 2 ¯ Θ X,z ( ∂ t A ( D ) X,z ) | {z } V i . First, consider the deriv ativ e of the augmen tation terms ∂ t A ( Y ) X,z = − δ m 1 ,X,z 1 { W = 1 , Z = z } π 1 ,z ( X ) + δ m 1 ,X,z − 1 { W = 1 , Z  = z } π 1 ,z − ( X ) , (A.7) ∂ t A ( Y ) X,z − = − δ m 1 ,X,z − 1 { W = 1 , Z  = z } π 1 ,z − ( X ) + δ m 0 ,X,z − 1 { W = 0 , Z  = z } π 0 ,z − ( X ) , ∂ t A ( Y ) X,z = ∂ t A ( Y ) X,z h X,z − − ∂ t A ( Y ) X,z − h X,z . (A.8) T aking conditional exp ectations giv en X implies that E [ h X,z | X ] = ¯ h X,z b y the augmen tation term prop erty E [ A ( D ) X,z | X ] = 0 . When additionally considering the prop ensity score prop erty E [ 1 { W = w, Z = z } /π w,z ( X ) | X ] = 1 , w e obtain the conditional av erage deriv atives E [ ∂ t A ( Y ) X,z | X ] = − δ ¯ g X,z ¯ h X,z − + δ ¯ g X,z − ¯ h X,z . (A.9) F urthermore, E [ ∂ t A ( D ) X,z | X ] = 0 . (A.10) Therefore, it follo ws from the law of iterated expectations and ( A.10 ) that the expectation of term V in equation ( A.6 ) is zero. By the augmen tation term property E [ A ( Y ) X,z | X ] = E [ A ( D ) X,z | X ] = 0 , the expectation of term I I I is zero, too Next, we expand Θ X,z = g X,z h X,z − − g X,z − h X,z in to the blo c ks Θ X,z = ¯ g X,z ¯ h X,z − + ¯ g X,z A ( D ) X,z − + A ( Y ) X,z ¯ h X,z − + A ( Y ) X,z A ( D ) X,z − − ¯ g X,z − ¯ h X,z − ¯ g X,z − A ( D ) X,z − A ( Y ) X,z − ¯ h X,z − A ( Y ) X,z − A ( D ) X,z , and take deriv ativ es w.r.t. t : ∂ t Θ X,z = δ ¯ g X,z ¯ h X,z − | {z } ( a ) + δ ¯ g X,z A ( D ) X,z − | {z } ( b ) + ∂ t A ( Y ) X,z ¯ h X,z − | {z } ( c ) + ∂ t A ( Y ) X,z A ( D ) X,z − | {z } ( d ) (A.11) − δ ¯ g X,z − ¯ h X,z | {z } ( e ) − δ ¯ g X,z − A ( D ) X,z | {z } ( f ) − ∂ t A ( Y ) X,z − ¯ h X,z | {z } ( g ) − ∂ t A ( Y ) X,z − A ( D ) X,z | {z } ( h ) . T aking conditional exp ectations sets blo cks con taining an augmentation term to zero, namely ( b, d, f , h ) . F urthermore, making use of ( A.9 ) , ( a ) and ( c ) cancel out, as w ell as ( e ) and ( g ) . It follows that the exp ectation of term I I in equation ( A.6 ) is zero (when also applying the la w of iterated expectations). 27 Finally , we note that ∂ t ¯ Θ X,z = δ ¯ g X,z ¯ h X,z − − δ ¯ g X,z − ¯ h X,z , (A.12) whic h enters term I in equation ( A.6 ) and corresp onds to the negative of equation ( A.9 ) , whic h en ters term I V . F or this reason, terms I and I V cancel out. Therefore, we hav e that ∂ t M ( η ) = 0 . (A.13) Perturb ations in the tr e atment mo del r . By symmetry (in terchanging Y ↔ D ), the same arguments demonstrate orthogonalit y w.r.t. p erturbations in r . Perturb ations in the pr op ensity sc or e π . Diﬀeren tiating with resp ect to π aﬀects only the augmentation terms. The deriv ative introduces terms of the form − ( Y − m w,X , · ) 1 { W = w , Z = ·} δ π w, · ( X ) π w, · ( X ) 2 , (A.14) whose conditional exp ectation giv en X is zero b ecause E [ Y − m w,X , · | X , W = w , Z = · ] = 0 . Hence the deriv ativ e with resp ect to π v anishes in exp ectation. The same argumen ts holds for D -residuals. Hence any deriv ative w.r.t. π also v anishes. Step 4: Conclusion. The moment condition E h ψ CLA TE ( O , θ 0 , η i = 0 is satisﬁed at the true v alue of η , identifying θ 0 . F urthermore, ∂ t E h ψ CLA TE ( O , θ 0 , η + t ) i   t =0 = 0 , suc h that ψ CLA TE is Neyman-orthogonal at the true v alue of η . 28 T ables T able 10: Sample Size by Country and T r e atment Status Coun try Control T reated T otal AR GE 266 253 519 A USL 281 281 562 A UST 115 114 229 BELG 132 131 263 BRAS 38 40 78 CANA 59 58 117 CHIL 29 29 58 CZEC 214 217 431 EIRE 26 26 52 FINL 26 27 53 GREE 74 76 150 HONG 53 55 108 HUNG 52 52 104 INDI 102 104 206 ISRA 53 57 110 IT AL 1557 1554 3111 NETH 361 350 711 NEW 224 225 449 NOR W 263 262 525 POLA 377 378 755 POR T 190 189 379 SING 71 68 139 SLOK 43 41 84 SLO V 26 27 53 SOUT 30 32 62 SP AI 232 231 463 SWED 313 317 630 SWIT 815 814 1629 TURK 138 140 278 UK 2882 2883 5765 USA 59 57 116 T otal 9101 9088 18189 29 Figures Figure 2: Sample size by c ountry (site) and tr e ate d vs. c ontr ol shar es. Note: Figure 2 sho ws coun try sample sizes and aspirin assignment shares. Coun try sizes are uneven, ranging from 52 to 5,765 patients (median 229) p er coun try . Nonetheless, treatment assignment is well balanced within coun tries. 30

Testing Effect Homogeneity and Confounding in High-Dimensional Experimental and Observational Studies

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment