Testing Effect Homogeneity and Confounding in High-Dimensional Experimental and Observational Studies

We propose a framework for testing the homogeneity of conditional average treatment effects (CATEs) across multiple experimental and observational studies. Our approach leverages multiple randomized trials to assess whether treatment effects vary wit…

Authors: Ana Armendariz, Martin Huber

Testing Effect Homogeneity and Confounding in High-Dimensional Experimental and Observational Studies
T esting Effect Homogeneit y and Confounding in High-Dimensional Exp erimen tal and Observ ational Studies Ana Armendariz Univ ersity of St. Gallen, Sc ho ol of Economics and Political Science Martin Hub er Univ ersity of F rib ourg, Dept. of Economics F ebruary 24, 2026 Abstract W e prop ose a framew ork for testing the homogeneity of conditional av erage treat- men t effects (CA TEs) across m ultiple exp erimen tal and observ ational studies. Our approac h lev erages m ultiple randomized trials to assess whether treatment effects v ary with unobserved heterogeneit y that differs across trials: if CA TEs are homogeneous, this indicates the absence of in teractions b et ween treatmen t and unobserv ables in the mean effect. Comparing CA TEs b et ween exp erimen tal and observ ational data further allows ev aluation of p ot en tial confounding: if the estimands coincide, there is no unobserved confounding; if they differ, deviations may arise from unobserv ed confounding, effect heterogeneit y , or b oth. W e extend the framework to settings with alternativ e iden tification strategies, namely instrumental v ariable settings and panel data with parallel trends assumptions based on differences in differences, where effects are identified only lo cally for subp opulations suc h as compliers or treated units. In these contexts, testing homogeneity is useful for assessing whether lo cal effects can b e extrapolated to the total p opulation. W e suggest a test based on double machine learning that accommodates high-dimensional cov ariates in a data-driven wa y and in vestigate its finite-sample p erformance through a sim ulation study . Finally , we apply the test to the In ternational Strok e T rial (IST), a large multi-coun try randomized con trolled trial in patients with acute isc haemic strok e that ev aluated whether early treatmen t with aspirin altered subsequen t clinical outcomes. Our metho dology pro vides a flexible to ol for b oth v alidating identification assumptions and understanding the generalizabilit y of estimated treatment effects. Keyw ords: T reatmen t Effect Heterogeneit y , Combining D ata, Conditional A v erage T reatment Effects, Observ ational Data, Randomized Controlled T rials. JEL Codes: C10, C12, C21 ∗ W e thank F ederica Mascolo for helpful comments. 1 In tro duction Exp erimen ts are widely considered the b enchmark for credible causal inference b ecause randomization eliminates confounding and deliv ers unbiased estimates of treatmen t effects. In practice, how ev er, exp erimental data often come with imp ortant limitations: sample sizes tend to b e mo dest, co v ariate information is restricted, and external v alidity is frequently uncertain. By contrast, mo dern observ ational data sets are t ypically muc h larger and ric her, offering detailed high-dimensional cov ariates and broad p opulation cov erage. These features mak e them attractive for studying treatment effect heterogeneit y , but causal in terpretation is hindered by the p ossibility of unobserved confounding. The increasing av ailabilit y of b oth experimental and observ ational data raises an interesting question: to what extent do treatmen t effect estimates align across data sources once we condition on observed co v ariates, and what do es this reveal ab out the internal and external v alidity of iden tification strategies in exp eriments and observ ational studies? This paper prop oses a framew ork for testing the homogeneit y of conditional av erage treatmen t effects (CA TEs), giv en cov ariates, across multiple exp erimen tal and observ ational studies (sites) suc h as differen t regions or countries. When applied solely to randomized exp erimen ts, where unobserv ed confounding is ruled out b y design, our test allows assessing treatmen t effect heterogeneit y arising from unobserv ables that differ across experiments. If CA TEs are homogeneous across experiments, this indicates an absence of in teractions b et w een the treatmen t and unobserv ed factors in the av erage treatmen t effect, implying that CA TEs are externally v alid across exp erimen tal settings. Mo ving b ey ond experiments, com- paring CA TEs betw een exp erimen tal and observ ational data additionally permits assessing the presence of confounding. If the CA TE estimates are asymptotically equiv alen t across exp erimen ts and observ ational studies, then CA TEs are unconfounded by unobserv ables (i.e., internally v alid) and homogeneous across settings (i.e., externally v alid). Conv ersely , if they differ, the discrepancy may stem from unobserved confounding, effect heterogeneit y , or b oth. This may motiv ate a sequential application of the testing approac h; first, across exp erimen ts to assess effect homogeneit y , and, if homogeneity is not rejected, subsequently across exp erimental and observ ational studies to assess unconfoundedness in addition to external v alidity . Our testing approach builds on the double machine learning (DML) framework ( Cher- nozh uko v et al. , 2018 ), whic h com bines doubly robust (DR) treatmen t effect estimation ( Hahn , 1998 ; Robins & Rotnitzky , 1995 ; Robins, Rotnitzky , & Zhao , 1994 ) with machine learning to flexibly adjust for high-dimensional co v ariates. More sp ecifically , w e extend the Neyman ( 1959 )-orthogonal score function introduced b y Apfel, Hatam yar, Hub er, and Kuec k ( 2024 ), who prop ose a test for whether CA TEs within a single study are join tly zero. In con trast, we use an analogous orthogonal formulation to test whether differences in CA TEs across studies, experimental and/or observ ational, are jointly zero. The resulting test is √ n -consisten t (where n denotes the sample size) and asymptotically normal under sp ecific regularity conditions, in particular when the machine learning estimators for the treatmen t and outcome models con verge at rate o ( n − 1 / 4 ) . As an additional metho dological contribution, w e extend the framework to settings where iden tification relies on alternative strategies. These include instrumental-v ariable designs that identify the lo cal av erage treatment effect (LA TE) for compliers whose treatmen t status resp onds to the instrumen t ( Angrist, Im b ens, & Rubin , 1996 ; G. W. Imbens & Angrist , 1994 ), 1 as well as panel-data settings that rely on parallel trends to iden tify the a verage treatmen t effect on the treated (A TET) based on difference-in-differences ( Snow , 1855 ). Because b oth the conditional LA TE and the conditional A TET p ertain to sp ecific subpopulations, rather than the full p opulation conditional on cov ariates, testing homogeneity is informative for ev aluating whether these lo cal effects can b e extrap olated to the total p opulation, in the spirit of Angrist and F ernández-V al ( 2010 ) and Aronow and Carnegie ( 2013 ). W e then in v estigate the finite-sample behavior of our testing approach in a simulation study that mimics settings with m ultiple experimental and/or observ ational sites. First, w e study the finite-sample size and p ow er of the test in a m ulti-site exp erimen tal design by comparing CA TEs across randomized sites under data-generating pro cesses with homoge- neous versus heterogeneous CA TEs. Second, we consider a mixed design with randomized treatmen t in some sites and observ ational iden tification in others; w e in tro duce unobserv ed confounding in the observ ational sites and ev aluate the ability of the test to detect cross-site CA TE differences b oth in the absence and presence of confounding. F or eac h design, w e run 1000 Monte Carlo replications and rep ort summary statistics that describe the p erformance of our metho d, such as the rejection rate, standard deviation, and sample size. F urthermore, we provide an empirical application for our testing approach using data from the International Strok e T rial, a large multi-coun try randomized controlled trial whic h ev aluated whether early aspirin allocation to patien ts with acute ischaemic strok e impro ved patien t health after they had exp erienced a stok e. W e treat coun tries as separate exp erimen tal sites and test whether the conditional effect of randomized aspirin assignmen t on six-month death or dep endency is homogeneous across coun tries after adjusting for baseline patient characteristics. A gro wing literature combines exp erimental and observ ational data to improv e causal inference ( Colnet et al. , 2024 ). Existing studies use suc h designs to (i) generalize randomized trial findings to broader populations ( A they , Chett y , & Im b ens , 2025 ; Cole & Stuart , 2010 ; Ghassami, Y ang, Richardson, Shpitser, & T chetgen , 2022 ; Hatt, T schern utter, & F euerriegel , 2022 ; G. Imbens, Kallus, Mao, & W ang , 2025 ; Lelo v a, Co op er, & T rian tafillou , 2025 ; Parikh et al. , 2025 ; P ark & Sasaki , 2024 ; P earl & Barein b oim , 2011 ; Stuart, Bradshaw, & Leaf , 2015 ; T riantafillou, Jabbari, & Coop er , 2023 ; V an Goffrier, Maystre, & Gilligan-Lee , 2023 ), (ii) increase statistical efficiency , especially for heterogeneous treatmen t effects ( Bran tner et al. , 2024 ; Cheng & Cai , 2021 ; Epanomeritakis & Viviano , 2025 ; Hatt, Berrevoets, Curth, F euerriegel, & v an der Sc haar , 2022 ; E. T. Rosenman, Basse, Ow en, & Baio cchi , 2023 ; E. T. R. Rosenman, Owen, Baio cchi, & Banac k , 2022 ; W u & Y ang , 2022 ; S. Y ang, Gao, Zeng, & W ang , 2023 ; S. Y ang, Zeng, & W ang , 2020 ; X. Y ang, Lin, A they , Jordan, & Im b ens , 2025 ), and (iii) diagnose or correct bias in observ ational analyses by using exp erimental b enc hmarks ( Chen, Aebersold, Puhan, & Serra-Burriel , 2025 ; Kallus, Puli, & Shalit , 2018 ; Lelo v a et al. , 2025 ; Liu & Xie , 2025 ; P arikh et al. , 2025 ; T rian tafillou et al. , 2023 ; W u & Y ang , 2022 ; S. Y ang et al. , 2023 , 2020 ). W e complemen t these efforts by introducing a framew ork that tests whether CA TEs are homogeneous across multiple exp erimental and observ ational sites. By comparing CA TEs across exp erimen ts, the test can detect heterogeneity across exp erimen tal sites driv en b y unobserv ables and ev aluate the external v alidit y of exp erimental CA TEs. While b y comparing CA TEs from experiments and observ ational sites, our test can diagnose the presence of hidden confounding and assess internal v alidity of treatment effects. Therefore, our approach pro vides a unified w a y to assess b oth in ternal and external v alidit y of CA TEs 2 using distinct research designs. The remainder of this pap er is organized as follo ws. Section 2 pro vides a detailed literature surv ey on studies combining m ultiple exp erimental and observ ational data for treatmen t effect ev aluation. Section 3 in tro duces the iden tifying assumptions and outlines our metho d. Section 4 extends the testing framew ork to instrumen tal and panel data con texts. Section 5 pro vid es a sim ulation study that in v estigates the finite sample performance of our proposed test. Section 6 illustrates the metho d using the International Stroke T rial. Section 7 concludes. 2 Literature surv ey Our study contributes to the three strands of research mentioned ab ov e by combining exp erimen tal and observ ational evidence to address unobserved confounding and to ev aluate the internal as well as the external v alidit y of CA TEs. Within this b o dy of work, the literature largely follows t wo approac hes: (i) one explicitly mo dels and estimates a confounding function b y p o oling information from RCT s and observ ational datasets, (ii) the other develops metho ds that combine or p o ol CA TE estimators from exp erimen tal and observ ational samples using adaptive w eights c hosen to balance bias and efficiency . The first approach fo cuses on the confounding function, defined as the conditional gap b et w een causal and observ ational treatment effects, as a central ob ject of interest. Early con tributions suc h as Kallus et al. ( 2018 ) prop ose a metho d that first learns an observ ational CA TE function and then estimates a lo w-dimensional correction term using experimental data, so that the observ ational estimate matches the randomized b enchmark even with only partial co v ariate o verlap. Building on this idea, S. Y ang et al. ( 2020 ) in tro duce a data fusion framew ork in whic h b oth the CA TE and the confounding function are iden tifiable once observ ational and exp erimen tal samples are coupled. They show that their metho d improv es efficiency relative to cases where there are only exp erimen tal samples. Subsequen t w ork by W u and Y ang ( 2022 ) extend this framew ork by prop osing an R-learner that incorp orates flexible mac hine learning metho ds to appro ximate the CA TE, the confounding function, and other nuisance comp onents. Ev en more recen tly , S. Y ang et al. ( 2023 ) introduce a test-based elastic approach for in tegrating trial and real-world data. Their metho d first uses the R CT as a b enchmark to test whether the observ ational sample suffers from bias. If the test fails, only the data from the R CT is used, but if the test supp orts comparabilit y , the t wo sources are com bined for efficiency . Complementing these elastic in tegration ideas, Liu and Xie ( 2025 ) dev elop a direct h yp othesis test for unconfoundedness b y comparing treatment–outcome contrasts estimated from the RCT and the observ ational data. Unlik e S. Y ang et al. ( 2023 ), which couples a pretest with an adaptiv e estimator, Liu and Xie ( 2025 ) fo cuses on diagnosis b y flagging when the observ ational sample is likely confounded b efore applying fusion or machine learning estimators that assume ignorability . P arikh et al. ( 2025 ) push this diagnostic idea further b y asking which assumption breaks when the tw o sources disagree: they dev elop a double mac hine learning framew ork by in tro ducing a statistical quan tity that distinguishes failures of ignorability in the observ ational sample from failures of external v alidity of the exp eriment. A complementary Bay esian approach incorp orates uncertaint y ab out when observ ational 3 data are safe to use. T riantafillou et al. ( 2023 ) prop ose Ba y esian CA TE estimation that adaptiv ely borrows from observ ational data, while Lelo v a et al. ( 2025 ) study iden tification and transp ortabilit y of CA TEs under an unkno wn causal graph when com bining exp erimental and observ ational samples. Bey ond the econometric literature, related work in computer science by Hatt, Berrevoets, et al. ( 2022 ) prop oses a represen tation learning framework whic h first learns the shared cov ariate structure from observ ational data and then uses exp erimen tal data to calibrate the estimation of treatment effects. They formalize the bias from unmeasured confounding as a confounding function, learn this bias by comparing observ ational and experimental predictions, and then use it to debias CA TE estimates. The second strand of research av oids mo deling a confounding function and instead com bines CA TE estimators computed separately in exp erimental and observ ational samples. Cheng and Cai ( 2021 ) combine kernel-based CA TE estimates usin g data-driv en weigh ts that default to the exp erimen tal data when bias is suspected and combine b oth exp erimental and observ ational sources when estimates align. X. Y ang et al. ( 2025 ) generalize this idea b y c ho osing the weigh ts given to each data source through cross-v alidation in a joint loss framew ork, trading off bias and v ariance across the tw o sources. Related work by E. T. R. Rosenman et al. ( 2022 ) prop ose to com bine R CT s and obser- v ational data by stratifying on the observ ational propensity score and placing exp erimental units in to the same strata. Within eac h stratum, they estimate treatmen t effects from exp erimen tal and observ ational sources and then merge them either by spiking exp erimental data into observ ational bins or through a data-driven w eighting sc heme that balances bias and v ariance. In subsequent w ork, E. T. Rosenman et al. ( 2023 ) extends this approach using Stein-t yp e shrink age estimators. Their metho d adaptively shrinks observ ational estimates to ward un biased exp erimen tal estimates. They sho w that these estimators reduce the mean squared error relative to p erforming only exp erimental analyzes. While this line of w ork focuses on optimally combining exp erimental and observ ational evidence to improv e efficiency , it implicitly assumes that treatmen t effects are sufficiently stable across settings. In contrast, Chen et al. ( 2025 ) inv estigates whether causal machine learning methods can produce reliable CA TE estimates using data from t w o large R CT s. They sho w that individualized treatment effects deriv ed from a wide range of mac hine learning metho ds fail to replicate across training and test splits or across trials, even in the absence of confounding. This highlights the difficulty of obtaining externally v alid CA TE estimates and the need for systematic approaches to test the stabilit y of CA TEs across settings. F or a broader synthesis of the literature, Colnet et al. ( 2024 ) provides a systematic review of approac hes that in tegrate exp erimen tal and observ ational data. Additionally , Bran tner et al. ( 2023 ) provide a review of metho ds to com bine multiple R CT s or R CT s with observ ational data, with a fo cus on treatment effect heterogeneity . They classify approaches b y the type of data av ailable and discuss both parametric and mac hine learning strategies for estimating CA TEs. A key tak eaw ay is that comparing CA TEs across sources provides a w ay to assess s tabilit y and detect potential confounding. Despite adv ances in com bining exp erimental and observ ational data, most applications fo cus on a single R CT paired with one observ ational dataset. As a result, little is known ab out whether CA TEs align across multiple exp erimen ts and observ ational sources. An exception is Bran tner et al. ( 2024 ), dev elop and who study metho ds for estimating CA TEs when several R CT s are av ailable. They adapt S-learner, X-learner, and causal forest estimators to the m ulti-trial setting. They sho w that strategies allowing trial-lev el heterogeneit y outp erform 4 naiv e po oling that ignores study differences. Their w ork highligh ts the challenges of in tegrating m ultiple exp eriments, bu t do es not examine the alignment of CA TEs b etw een exp erimen tal and observ ational data. Our study fills this gap b y fo cusing on testing the homogeneit y of CA TEs across b oth exp erimental and observ ational sources. 3 Assumptions and testing approac h D denotes the binary treatment and Y the outcome of interest. Using the p otential outcomes framew ork as adv o cated in Neyman ( 1923 ) and D. B. Rubin ( 1974 ), w e denote by Y ( d ) the p oten tial outcome when exogenously setting the treatment D of a sub ject to v alue d ∈ 1 , 0 . More generally , we will use capital letters for random v ariables and low er case letters for their realizations. By representing the potential outcome Y ( d ) as a function solely of a sub ject’s o wn treatment status D = d , we implicitly assume that the p otential outcomes of one sub ject are not influenced b y the treatmen t status of others. This assumption is kno wn as the stable unit treatment v alue assumption (SUTV A), see D. Rubin ( 1980 ) and Cox ( 1958 ), and is in vok ed throughout. F urthermore, let X denote a set of observ ed pretreatment co v ariates, and let Z b e a discrete v ariable that indexes differen t setups or studies in which the exp erimental or observ ational data were collected (for example, sites or regions). The v ariable can tak e integer v alues z ∈ 1 , ..., L , with L denoting the n um b er of setups. W e suggest a metho d to test effect homogeneit y in conditional a verage treatment effects (CA TEs) across differen t exp eriments, or selection-on-observ ables (and effect homogeneity) across exp erimental and observ ational data, resp ectively . First, w e consider the case of comparisons within experimental studies. Supp ose that treatment is randomly assigned within each site or region Z , p ossibly conditional on co v ariates X (as in stratified random- ization). This corresp onds to the standard selection-on-observ ables assumption, also kno wn as unconfoundedness or conditional indep endence ( G. W. Imbens , 2004 ). Assumption 1 (Conditional independence of the treatment) . { Y (1) , Y (0) }⊥ ⊥ D | X , Z , where ⊥ ⊥ denotes statistical independence. In addition, we require a condition ensuring that treated and untreated units are observed in all relev ant subp opulations of X and Z : Assumption 2 (Common support) . 0 < Pr( D = d, Z = z | X ) < 1 , ∀ d ∈ { 1 , 0 } and z ∈ { 1 , ..., L } . The common supp ort assumption guaran tees ov erlap in the treatment assignmen t across differen t experiments and cov ariate profiles. It rules out situations where, conditional on co v ariates X , treatment assignment or assignment to a specific experiment is deterministic. The conditional a v erage treatmen t effect (CA TE) given co v ariates X and exp erimen t Z is defined as ∆ x,z = E [ Y (1) − Y (0) | X = x, Z = z ] (1) 5 Under (conditional) treatmen t randomization, whic h implies the satisfaction of Assump- tion 1 , we ha ve that ∆ X,Z corresp onds to δ x,z = E [ Y | D = 1 , X = x, Z = z ] − E [ Y | D = 0 , X = x, Z = z ] . (2) W e are in terested in testing whether these conditional effects v ary across exp erimen ts. F ormally , w e impose: Assumption 3 (Conditional effect homogeneity) . E [ Y (1) − Y (0) | X , Z ] = E [ Y (1) − Y (0) | X ] . This assumption states that CA TEs are homogeneous across experiments Z . Suc h homogeneit y ma y hold for tw o distinct reasons. First, treatment effects may not in teract with unobserved heterogeneity once we condition on X . An example where this is satisfied is the following structural mo del: Y = κ ( D, X ) + η ( U ) , (3) where U denotes unobserved c haracteristics and κ and η are unknown functions. While the effect of D ma y v ary arbitrarily across X , it does not v ary with U conditional on X due to the additive separabilit y of κ and η . In such cases, even if unobserved c haracteristics differ across exp eriments, they do not generate treatmen t effect heterogeneity . Second, treatmen t effects may indeed interact with unobserved heterogeneity , but the distribution of this heterogeneity remains stable across exp erimen ts. F or instance, consider the structural mo del Y = κ ( D, X, U ) , (4) where the effect of D on Y ma y arbitrarily interact with U . This second case, effect heterogeneit y existing but not b eing detectable b ecause the distribution of U is iden tical across exp eriments, app ears less plausible when exp erimen tal sites differ substan tially in institutional, geographic, or temp oral contexts. Such differences typically shift the distribution of unobserved c haracteristics. The more v ariation there is in u nobserv ed heterogeneit y across exp eriments, the more informative the data become for detecting in teractions betw een treatment effects and unobserv ables. Conditional on Assumption 1 , Assumption 3 yields the following testable n ull h yp othesis: H 0 : µ 1 ,x,z − µ 0 ,x,z | {z } δ x,z − ( µ 1 ,x,z ′ − µ 0 ,x,z ′ | {z } δ x,z ′ ) = 0 , ∀ z , z ′ ∈ { 1 , .., L } and x ∈ X , (5) where X denotes the supp ort of X and µ d,x,z = E [ Y | D = d, X = x, Z = z ] denotes the conditional mean outcome. T o test the null h yp othesis in ( 5 ) , we adapt the doubly robust conditional indep endence test of Apfel et al. ( 2024 ) to the problem of effect homogeneity . That is, we extend their approac h, whic h tests whether CA TEs differ from zero, to instead test whether differences 6 in CA TEs across exp eriments are equal to zero. Note that ( 5 ) can equiv alently b e written as H 0 : θ 0 = E " L X z =1 [( µ 1 ,x,Z = z − µ 0 ,x,Z = z − µ 1 ,x,z − + µ 0 ,x,z − ) 2 + ( µ 1 ,x,Z = z − µ 0 ,x,Z = z − µ 1 ,x,z − + µ 0 ,x,z − )] # = 0 , (6) where z − denotes v alues of Z different from z , suc h that Z  = z . Denote b y p d,z ( X ) = Pr ( D = d, Z = z | X ) the joint prop ensity of treatment and being in a sp ecific exp eriment or site. W e denote the nuisance parameters by η = ( p 1 ,z ( X ) , p 1 ,z − ( X ) , µ 1 ,X,z , µ 1 ,X,z − , p 0 ,z ( X ) , p 0 ,z − ( X ) , µ 0 ,X,z , µ 0 ,X,z − ) . Mo difying Apfel et al. ( 2024 ), who consider simple difference in conditional means to test whether any CA TE is different from zero, to double (or differences in) differences, to test whether an y difference in CA TEs (across experiments) is different from zero. More concisely , testing with a multiv alued Z can b e based on the follo wing score function, in whic h O = ( Y , D , X , Z ) denotes the random v ariables: ψ ( O , θ , η ) (7) = L X z =1 ( µ 1 ,X,z − µ 0 ,X,z − µ 1 ,X,z − + µ 0 ,X,z − ) 2 + L X z =1 2( µ 1 ,X,z − µ 0 ,X,z − µ 1 ,X,z − + µ 0 ,X,z − )  ( Y − µ 1 ,X,z )1( D = 1 , Z = z ) p 1 ,z ( X ) − ( Y − µ 0 ,X,z )1( D = 0 , Z = z ) p 0 ,z ( X ) − ( Y − µ 1 ,X,z − )1( D = 1 , Z  = z ) p 1 ,z − ( X ) + ( Y − µ 0 ,X,z − )1( D = 0 , Z  = z ) p 0 ,z − ( X )  + L X l =1 ( µ 1 ,X,z − µ 0 ,X,z − µ 1 ,X,z − + µ 0 ,X,z − ) + L X z =1  ( Y − µ 1 ,X,z )1( D = 1 , Z = z ) p 1 ,z ( X ) − ( Y − µ 0 ,X,z )1( D = 0 , Z = z ) p 0 ,z ( X ) − ( Y − µ 1 ,X,z − )1( D = 1 , Z  = z ) p 1 ,z − ( X ) + ( Y − µ 0 ,X,z − )1( D = 0 , Z  = z ) p 0 ,z − ( X )  − θ . This score has a v ariance that is b ounded a wa y from zero, is zero in exp ectation under the n ull h yp othesis in equation ( 6 ) when θ 0 = 0 , and is Neyman-orthogonal, see pro of pro vided in App endix of Apfel et al. ( 2024 ). This follows directly from the pro ofs in Apfel et al. ( 2024 ), as our score function is based on applying their type of score function twice to turn it into a double (rather than a single) difference across µ d,X,z . As a the double difference is just a linear combination of the single differences, the asymptotic findings in Apfel et al. ( 2024 ) directly apply to our case, to o. In particular, cross-fitted estimators of θ 0 based on the score function ( 7 ) is asymptotically normal and √ n − consisten t under sp ecific regularit y conditions, in particular if machine learners used for estimating nuisance parameters η ha ve a conv ergence rate of o ( n − 1 / 4 ) . 7 The same testing framew ork can also b e applied for comparing experimental and observ ational studies in a second step following the within-exp eriments comparison. In purely observ ational data, where Assumption 1 cannot be taken for granted, rejection of ( 5 ) or ( 6 ) ma y reflect violations of Assumption 1 , Assumption 3 , or b oth. Ho wev er, if exp erimen tal data suggest that CA TEs are homogeneous, i.e., Assumption 3 holds, then comparisons of CA TEs b etw een experimental and observ ational studies pro vide a direct test of Assumption 1 . Sp ecifically , one can define Z suc h that Z = 1 indicates observ ations from exp erimental studies, while Z = 2 , . . . , L index observ ational studies. Alternativ ely , Z ma y distinguish b etw een observ ational studies only . In both cases, testing can again b e implemen ted using the score in ( 7 ). 4 Alternativ e iden tifying assumptions Our metho d for testing effect homogeneity can also b e adapted to settings where treat- men t is not conditionally exogenous. F or instance, access to treatment may b e randomly assigned conditional on X , while actual treatment take-up deviates from assignment due to noncompliance. In this case, w e ma y use assignment, henceforth denoted by W , as an instrumen t for actual treatment D . F ollo wing G. W. Im b ens and Angrist ( 1994 ) and Angrist et al. ( 1996 ), we denote by D ( w ) the potential treatment as a function of instrumen t W and by Y ( w , d ) the potential outcome as a function of W and D . W e imp ose the follo wing instrumen tal v ariable (IV) assumptions conditional on co v ariates X and exp eriment Z , see Abadie ( 2003 ): Assumption 4 (IV assumptions) . { D ( w ) , Y ( w ′ , d ) }⊥ ⊥ W | X , Z for w , w ′ , d ∈ { 0 , 1 } , Pr( Y (1 , d ) = Y (0 , d ) = Y ( d ) | X , Z ) = 1 , Pr( D (1) ≥ D (0) | X , Z ) = 1 , E [ D | W = 1 , X, Z ] − E [ D | W = 0 , X , Z ]  = 0 , 0 < Pr( W = 1 | X, Z ) < 1 . The first line of Assumption 4 requires that the instrumen t is as go o d as randomly assigned and satisfies the exclusion restriction conditional on X and Z . The second line rules out the existence of defiers, but it also requires the existence of compliers conditional on X , due to the nonzero conditional first stage. The third line imp oses common supp ort on the instrument, implying that assignment is not determinis tic in X and Z . Assumption 4 p ermits iden tifying conditional lo cal a v erage treatmen t (CLA TE) effect among the subgroup compliers, denoted by c , who are treated only if the instrument is equal to one: c : D (1) = 1 , D (0) = 0 . The CLA TE given cov ariates X and exp eriment Z is defined as ∆ c,x,z = E [ Y (1) − Y (0) | D (1) = 1 , D (0) = 0 , X = x, Z = z ] . (8) The CLA TE is iden tified using a W ald-t yp e estimand ( W ald , 1940 ), defined as the ratio of the reduced-form effect of the instrumen t on the outcome to the first-stage effect of the 8 instrumen t on the treatmen t, conditional on X and Z : δ c,x,z = E [ Y | W = 1 , X = x, Z = z ] − E [ Y | W = 0 , X = x, Z = z ] E [ D | W = 1 , X = x, Z = z ] − E [ D | W = 0 , X = x, Z = z ] = g x,z h x,z , (9) where we define the reduced-form and first-stage effects as ¯ g x,z = m 1 ,x,z − m 0 ,x,z and ¯ h x,z = r 1 ,x,z − r 0 ,x,z , with m w,x,z = E [ Y | W = w , X = x, Z = z ] and r w,x,z = E [ D | W = w , X = x, Z = z ] . Considering, in analogy to equation ( 5 ) for the CA TE, the follo wing n ull hypothesis, H 0 : δ c,x,z − δ c,x,z ′ = 0 , ∀ z , z ′ ∈ { 1 , .., L } and x ∈ X , (10) p ermits testing the following effect homogeneit y assumption among compliers: Assumption 5 (Conditional effect homogeneity among compliers) . E [ Y (1) − Y (0) | D (1) = 1 , D (0) = 0 , X, Z ] = E [ Y (1) − Y (0) | D (1) = 1 , D (0) = 0 , X ] . Since δ c,x,z = g x,z /h x,z , w e note that the null h yp othesis ( 10 ) can b e equiv alen tly written in cross-multiplied form as ¯ Θ x,z = ¯ g x,z ¯ h x,z ′ − ¯ g x,z ′ ¯ h x,z = 0 , ∀ z , z ′ ∈ { 1 , .., L } and x ∈ X . (11) Hence, a ratio-free v ersion of the n ull h yp othesis for the CLA TE that is analogous to equation ( 6 ) for the CA TE is giv en b y H 0 : θ 0 = E " L X z =1  ( ¯ g X,z ¯ h X,z − − ¯ g X,z − ¯ h X,z ) 2 + ( ¯ g X,z ¯ h X,z − − ¯ g X,z − ¯ h X,z )  # = 0 , (12) where z − denotes v alues of Z differen t from z . W e also define the prop ensit y score π w,z ( x ) = Pr ( W = w , Z = z | X = x ) and collect the n uisance parameters in η = ( m w,x,z , r w,x,z , π w,z ( x ) , π w,z − ( x )) w ∈{ 0 , 1 } ,z ∈{ 1 ,...,L } . F ollowing a similar logic as in the CA TE case, we construct a DR score function based on ( 12 ) for the CLA TE setting. T o this end, define the DR augmentations for the reduced form and first stage effects as g X,z = m 1 ,X,z − m 0 ,X,z + ( Y − m 1 ,X,z )1( W = 1 , Z = z ) π 1 ,z ( X ) − ( Y − m 0 ,X,z )1( W = 0 , Z = z ) π 0 ,z ( X ) h X,z = r 1 ,X,z − r 0 ,X,z + ( D − r 1 ,X,z )1( W = 1 , Z = z ) π 1 ,z ( X ) − ( D − r 0 ,X,z )1( W = 0 , Z = z ) π 0 ,z ( X ) Using these, we denote the cross-pro duct difference by Θ X,z = g X,z h X,z − − g X,z − h X,z . Denoting by O = ( Y , D , X , Z , W ) the random v ariables, a DR score function for testing the 9 n ull h yp othesis in ( 12 ) is ψ CLA TE ( O , θ , η ) = L X z =1  ¯ Θ 2 X,z + Θ X,z  (13) + L X z =1 2 ¯ Θ X,z  h X,z −  ( Y − m 1 ,X,z )1( W = 1 , Z = z ) π 1 ,z ( X ) − ( Y − m 0 ,X,z )1( W = 0 , Z = z ) π 0 ,z ( X )  − h X,z  ( Y − m 1 ,X,z − )1( W = 1 , Z  = z ) π 1 ,z − ( X ) + ( Y − m 0 ,X,z − )1( W = 0 , Z  = z ) π 0 ,z − ( X )  + L X z =1 2 ¯ Θ X,z  g X,z −  ( D − r 1 ,X,z )1( W = 1 , Z = z ) π 1 ,z ( X ) − ( D − r 0 ,X,z )1( W = 0 , Z = z ) π 0 ,z ( X )  − g X,z  ( D − r 1 ,X,z − )1( W = 1 , Z  = z ) π 1 ,z − ( X ) + ( D − r 0 ,X,z − )1( W = 0 , Z  = z ) π 0 ,z − ( X )  − θ . This score function has zero mean under the null hypothesis H 0 : θ 0 = 0 and is Neyman- orthogonal with resp ect to the nuisance parameters η , see the pro of pro vided in App endix 7 . Consequen tly , cross-fitted estimators of θ 0 based on ( 13 ) are √ n -consisten t and asymptoti- cally normal under sp ecific regularity conditions, in particular if machine learning estimators of the nuisance functions conv erge at rate o ( n − 1 / 4 ) . W e note that, although the construction of the CLA TE score function is related to the approac h of Apfel et al. ( 2024 ), there is a conceptual difference compared to the CA TE framew ork considered in their paper and in our Section 3 . While the DR function for the CA TE is linear (but not quadratic) in the debiasing terms, in whic h outcome regression residuals are reweigh ted by the in verse of prop ensit y scores (also kno wn as augmen ted residuals), the CLA TE score in v olves a cross-pro duct b et ween the DR estimators of the reduced form and first-stage effects. Consequen tly , the CLA TE momen t condition con tains b oth Θ X,z and its square, reflecting the bilinear structure of the CLA TE, whic h depends on the ratio of t wo conditional effects. This introduces second-order terms in the score but do es not alter Neyman orthogonality , as the influence of the n uisance parameters still cancels out through the residual orthogonality conditions E [ Y − m w,X ,z | X , Z , W ] = 0 and E [ D − r w,X ,z | X , Z , W ] = 0 . In this sense, the CLA TE score extends the construction of scores based on squared differences in regression functions to a setting where b oth the n umerator (reduced form) and denominator (first stage) of the parameter of interest must b e debiased sim ultaneously . As for Assumption 3 in the CA TE case, it is worth noting that Assumption 5 ma y hold for t wo reasons: either CLA TEs do not depend on unobserv ables, or the distribution of unobserv ables is stable across exp erimen ts. T o illustrate, consider the outcome mo del in equation ( 4 ) together with a threshold-crossing treatment mo del D = I { λ ( Z , X ) > η ( V ) } , (14) where I { ·} is the indicator function that is equal to one if its argumen t is satisfied and zero otherwise, λ and η are unkno wn functions, and V are unobserv ables affecting the treatment. W e note that the threshold-crossing mod el for treatment assignmen t in equation ( 14 ) b oth implies and is implied by treatment monotonicity , as shown in V ytlacil ( 2002 ). Regarding 10 effect heterogeneity , the unobserv able V ma y be arbitrarily asso ciated with U under our IV assumptions, so that heterogeneity of treatment effects in U generally also induces heterogeneit y with resp ect to V and hence across compliance types, defined b y whether I { λ (1 , X ) > η ( V ) } = 1 and I { λ (0 , X ) > η ( V ) } = 0 . Therefore, if it can b e assumed that unobserv ables differ across Z , then satisfaction of Assumption 5 p oin ts to homogeneous effects. This, in turn, implies that treatmen t effects do not dep end on compliance b ehavior - which is itself a function of unobserv ables - conditional on X , an assumption discussed in Angrist and F ernández-V al ( 2010 ) and Aronow and Carnegie ( 2013 ): Assumption 6 (CLA TE equals CA TE) . E [ Y (1) − Y (0) | D (1) , D (0) , X , Z ] = E [ Y (1) − Y (0) | X ] . An imp ortant implication of this assumption is that it allo ws extrap olating the CLA TE to the entire p opulation, since under Assumption 6 the CLA TE coincides with the CA TE. In other words, the identified effect is no longer lo cal to compliers, but represents the av erage conditional effect for the full population. F urther, alternative identifying assumptions can b e considered when panel data (or also rep eated cross sections) are av ailable, in which outcomes are observed b oth b efore and after the in tro duction of treatment. T o this end, we introduce time index t ∈ 0 , 1 , where t = 0 refers to the pre-treatmen t p erio d and t = 1 to the p ost-treatmen t perio d, to denote b y Y t and Y t ( d ) the outcome and the p otential outcome (giv en D = d ) at time t , resp ectiv ely . This setting p ermits effect iden tification based on the parallel trends assumption, whic h requires conditional indep endence in outcome trends rather than in outcome levels (as imp osed b y Assumption 1 ). A set of sufficient assumptions for iden tifying the conditional av erage treatmen t effect on the treated (CA TET) in panel data based on the difference-in-differences (DiD) approach is the following, see, e.g., Abadie ( 2005 ); Lechner ( 2011 ): Assumption 7 (DiD assumptions) . E [ Y 1 (0) − Y 0 (0) | D = 1 , X , Z ] = E [ Y 1 (0) − Y 0 (0) | D = 0 , X , Z ] , E [ Y 0 (1) − Y 0 (0) | D = 1 , X , Z ] = 0 , Pr( D = 1 | X , Z ) < 1 . The first condition in Assumption 7 formalizes the conditional common trends assumption: giv en (presumably exogenous) cov ariates X and exp eriment Z , no unobserved factors sim ultaneously affect b oth treatmen t assignment and the trend of mean p oten tial outcomes under non-treatment. In DiD settings, it is w orth n oting that in the con text of DiD, multiple exp erimen ts Z ma y for instance corresp ond to multiple treated regions observ ed within the same dataset. The second condition rules out av erage anticipation effects among the treated, conditional on X . It requires that treatmen t status D do es not causally influence pretreatmen t outcomes in exp ectation of the treatment to come. The third line imp oses a sp ecific common support condition for identifying the CA TET, requiring that for every co v ariate profile X and exp eriment Z observ ed among the treated, there also exist some un treated observ ations with the same ( X , Z ) . When replacing Y b y the outcome difference Y 1 − Y 0 in the definitions of the conditional mean outcomes µ D,X ,Z in tro duced in Section 3 , the n ull hypotheses ( 5 ) and ( 6 ), as well as 11 the score function ( 7 ) , can b e redefined to ev aluate effect heterogeneit y across CA TET s in differen t experiments. A natural question is whether these effects can be extrap olated to the total population, i.e., whether the CA TET coincides with the CA TE. In general, this is not the case b ecause the parallel trends condition in Assumption 7 is only imposed for the un treated potential outcomes, such that identification is restricted to the treated group. In particular, treatment effects may differ with respect to time-in v ariant confounders that are allo wed to differ b et ween treated and un treated units. How ever, if exp eriments differ in such time-in v arian t confounders, and effect homogeneit y across exp erimen ts is not rejected, this suggests that treatmen t effects do not dep end on them. In this case, the CA TET coincides with CA TE, as expressed in the follo wing assumption: Assumption 8 (CA TET equals CA TE) . E [ Y (1) − Y (0) | D = 1 , X , Z ] = E [ Y (1) − Y (0) | X ] . The following structural model illustrates the role of time-in v ariant unobserv ables. Let Y t = κ t ( D , X ) + η ( D , U ) + ε t , (15) where κ t ( D , X ) is an unkno wn, time-v arying function of co v ariates X and treatmen t D , η ( D , U ) is a time-in v arian t function of unobserv ables U that may interact with treatmen t, and ε t is an idiosyncratic, time-v arying error. F or the potential outcome under non-treatmen t, Y t (0) = κ t (0 , X ) + η (0 , U ) + ε t , (16) it follows that differencing across time, Y 1 (0) − Y 0 (0) , eliminates η (0 , U ) due to its additive separabilit y . Hence, the parallel trends condition holds with respect to Y t (0) conditional on X , even if the distribution of U differs b etw een treatment groups. How ever, treatmen t effects ma y still v ary across groups, since arbitrary in teractions b et ween D and U are allo wed in η ( D , U ) . No w consider the alternativ e mo del Y t = κ t ( D , X ) + η ( U ) + ε t , (17) whic h rules out such in teractions and implies additiv e separability of U and D . In this case, treatment effects are homogeneous in U , as in classical linear panel regression mo dels. Therefore, if U plausibly v aries across exp erimen ts Z but CA TET s are found to be constant across Z (and th us across distributions of U ), this pro vides evidence in fa vor of Assumption 8 , which justifies extrap olating effects identified for the treated (CA TET) to the en tire p opulation (CA TE). 5 Sim ulation Study This section describ es a sim ulation study to inv estigate the finite sample b eha vior of our prop osed test of homogeneity of CA TEs across exp erimental and observ ational sites. W e first consider comparisons across exp eriments and and base our sim ulations on the follo wing data generating pro cess (DGP): 12 Y = D + D X ′ β + δ D Z + X ′ β + U, D ∼ Bernoulli ( q ) , X ∼ N (0 , Σ) , Z ∼ Bernoulli ( π ) , U ∼ N (0 , 1) Where the outcome Y is a function of the treatment D , X the co v ariates, Z is an indicator of an exp erimen tal site and U denotes the error term. In the exp erimental design, D is randomly assigned based on a Bernoulli distribution with probability q and Assumption 1 (conditional indep endence) holds by construction. X is a v ector of co v ariates of dimension p , dra wn from a multiv ariate normal distribution with zero mean and co v ariance matrix Σ . In this sp ecification, Σ equals the identit y matrix, implying that all cov ariates are indep endent and hav e u nit v ariance. Z is an indicator of an exp erimental (or observ ational) site, generated indep endently from a Bernoulli distribution with probability π . The co efficients β determine the impact of the cov ariates X on Y . Finally , U is a random and normally distributed error term. In the observ ational design, the elemen t which changes is the treatment assignment whic h is no longer randomized: D obs = I { X ′ β + ρU + V > 0 } , V ∼ N (0 , 1) Th us, w e consider the case where treatmen t ( D obs ) depends on cov ariates X and the error terms U and V . Where ρ determines the strength of confounding in the observ ational sites. A t the same time, the parameter δ , from the outcome equation, gov erns the degree of effect heterogeneity across experimental and/or observ ational sites. Consequen tly , when δ = 0 , treatmen t effects are constant across sites, while δ  = 0 induces heterogeneity in CA TEs across sites, indexed b y Z . Likewise, when ρ = 0 , there is no confounding in the observ ational sites, while ρ  = 0 in tro duces confounding. W e implement a cross-fitted, doubly robust (DR) score test based on double difference score in ( 7 ) , adapted from Apfel et al. ( 2024 ) to test whether CA TEs are homogeneous across sites Z . F or eac h Z , we estimate conditional means and joint prop ensities b y a p enalized lasso regression using five fold k = 5 cross fitting with default parameters as in the glmnet pac k age in R. W e then build DR residuals and com bine them into a double difference score across individual sites z . T o ensure ov erlap, w e trim the estimated conditional probabilities b elo w 0 . 05 and abov e 0 . 95 . The ov erall test statistic is computed as the sample mean of the individual scores o ver the retained sample after trimming. In addition, the standard error is obtained from the score v ariance scaled b y the sample size. Finally , a normal approximation is used to compute p -v alues. The simulation scenarios v ary in several dimensions. W e consider three sample sizes, n = 500 , 2000 and 8000 . In our main sp ecification, we set k = 5 , l = 2 , p = 100 , use Lasso as the machine learner, and fix the trimming threshold at ε = 0 . 05 . W e p erform R = 1000 Mon te Carlo replications. T o assess size, we imp ose the null of cross-site 13 homogeneit y b y setting δ = 0 . T o assess p o wer, w e in tro duce cross-site heterogeneity b y setting δ = 1 . In the mixed design, w e further allo w for unobserved confounding in observ ational sites by setting ρ ∈ { 0 , 0 . 5 } , so that w e study the p erformance of the test under b oth unconfounded and confounded observ ational assignment. On the one hand, when δ = 0 where w e imp ose homogeneity of CA TEs across sites, the rejection rate of the test should approach the nominal significance lev el (5%) as N increases, reflecting correct size of the test. On the other hand, when δ = 1 where w e imp ose heterogeneous CA TEs across sites, the rejection rate should increase with N , demonstrating the correct pow er of the test. W e assess the performance of the prop osed test using sev eral summary measures. Across R = 1000 Monte Carlo replications, w e rep ort the a verage estimate ( ˆ θ ) based on ( 7 ) , its standard deviation (std), and the av erage estimated standard error (mean se). W e rep ort empirical rejection rates at the 5% lev el, in terpreted as the size when δ = 0 and p o wer when δ  = 0 , as functions of N and δ . Finally , we rep ort the effectiv e sample size as a function of the trimming rate. T able 1: Simulations: Size under δ = 0 acr oss exp erimental sites. N ˆ θ std mean se reject 5% n_eff (mean) 500 0.041 0.063 0.063 10% 492 2000 0.027 0.028 0.028 8% 1994 8000 0.005 0.014 0.013 8% 7994 Notes. ‘ N ’ is the sample size p er replication. ‘ ˆ θ ’ is the av erage of the test statistic; ‘std’ is the standard deviation; ‘mean se’ is the av erage estimated standard error. ‘reject 5%’ is the fraction of replications with p < 0 . 05 (empirical size under δ = 0 , p ow er under δ = 1 ). ‘n_eff (mean)’ is the av erage effective sample size under normalized weigh ts. Baseline: K = 5 folds, Lasso, L = 2 , p = 100 , ε = 0 . 05 . T able 1 rep orts the simulation results based on our main sp ecification when δ = 0 across exp erimen tal sites only . That is, when the null hypothesis of homogeneous CA TEs across exp erimen tal sites is true. The a v erage estimate ˆ θ of the test decreases tow ard zero as the sample size increases, consistent under the null. The standard deviation and the av erage standard error decrease by roughly half when the sample size N quadruples, indicating that the estimator is root- N consisten t. The empirical rejection rate is slightly ab o ve the nominal 5%. Ho w ev er, it mov es tow ards the correct levels as N increases. Lastly , the effectiv e sample size is close to the nominal sample size N in all sp ecifications, suggesting that trimming is limited and w eigh ts are stable. Overall, the results indicate that our test b eha v es correctly under homogeneous treatmen t effects across exp erimental sites and is w ell calibrated under the n ull. 14 T able 2: Simulations: Power under δ = 1 acr oss exp erimental sites. N ˆ θ std mean se reject 5% n_eff (mean) 500 -0.204 0.073 0.073 79% 492 2000 -0.231 0.030 0.032 100% 1993 8000 -0.243 0.015 0.015 100% 7994 Notes. ‘ N ’ is the sample size p er replication. ‘ ˆ θ ’ is the av erage of the test statistic; ‘std’ is the standard deviation; ‘mean se’ is the av erage estimated standard error. ‘reject 5%’ is the fraction of replications with p < 0 . 05 (empirical size under δ = 0 , p ow er under δ = 1 ). ‘n_eff (mean)’ is the av erage effective sample size under normalized weigh ts. Baseline: K = 5 folds, Lasso, L = 2 , p = 100 , ε = 0 . 05 . T able 2 rep orts the sim ulation results based on our main sp ecification when δ = 1 , where CA TEs v ary across exp erimental sites and the n ull h yp othesis of homogeneous CA TEs is false. In this case, the test should reject with high probability . The av erage estimate ˆ θ b ecomes increasingly negativ e as the sample size increases. Both the standard deviation and the mean estimated standard error of the estimator decrease roughly by half when the sample size quadruples, again consistent with ro ot- N con vergence. The empirical rejection rate rises sharply with sample size increasing from 79% to 100% for larger samples. Finally , the effective sample size remains close to the nominal sample size for all N in all sp ecifications. The results sho w that our test has strong pow er to detect violations of effect homogeneit y across sites. W e now consider the setting which combines exp erimental and observ ational sites in the absence of confounding where ρ = 0 . This mixed design mirrors man y empirical applications in whic h treatmen t is randomized in some sites but not in others, with the latter relying on observ ational v ariation. This allows us to examine whether our metho d can distinguish violations of homogeneit y from violations of unconfoundedness. When exp erimental sites in- dicate homogeneous CA TEs, systematic differences b et ween exp erimen tal and observ ational CA TEs provide evidence against the v alidity of the observ ational identification strategy . T able 3: Simulations: Size under δ = 0 & ρ = 0 acr oss exp erimental & observational sites. N ˆ θ std mean se reject 5% n_eff (mean) 500 0.057 0.064 0.065 14% 489 2000 0.020 0.027 0.028 9.1% 1971 8000 0.006 0.013 0.014 7.2% 7854 Notes. ‘ N ’ is the sample size p er replication. ‘ ˆ θ ’ is the av erage of the test statistic; ‘std’ is the standard deviation; ‘mean se’ is the av erage estimated standard error. ‘reject 5%’ is the fraction of replications with p < 0 . 05 (empirical size under δ = 0 , p ow er under δ = 1 ). ‘n_eff (mean)’ is the av erage effective sample size under normalized weigh ts. Baseline: K = 5 folds, Lasso, L = 2 , p = 100 , ε = 0 . 05 . T able 3 rep orts simulation results for the mixed exp erimen tal–observ ational design under the n ul l h yp othesis of homogeneous treatment effects across sites, δ = 0 , and no unobserved confounding in the observ ational sites, ρ = 0 . Consequen tly , the iden tifying assumptions hold, and the test should reject at the nominal significance level. The mean estimate ˆ θ mo ves tow ard zero as N increases, consisten t with the null. The standard deviation and the a verage estimated standard error decline at the exp ected rate. The empirical rejection rate 15 is ab ov e the nominal 5% lev el in smaller samples but it declines with sample size. Finally , the effectiv e sample size remains close to the nominal sample size in all designs. Ov erall, the results indicate that our test b ehav es as exp ected when exp erimental and observ ational sites are com bined and the observ ational iden tifying assumptions hold, although it exhibits some ov er-rejection in smaller samples. W e next consider a mixed design whic h combines exp erimental and observ ational sites, but now w e allow for unobserved confounding in the observ ational sites captured by ρ = 0 . 5 . This v alue introduces strong confounding in the observ ational sites. In this setting, the observ ational iden tifying assumptions fail, so discrepancies b et w een experimental and observ ational CA TEs are driv en by confounding rather than by true effect heterogeneity . As a result, the test ma y experience distortions in its size, reflecting sensitivit y of the test to violations of the identifying assumptions in the observ ational sites. T able 4: Simulations: Size u nder δ = 0 & ρ = 0 . 5 acr oss exp erimental & observational sites. N ˆ θ std mean se reject 5% n_eff (mean) 500 0.174 0.070 0.068 72.6% 489 2000 0.137 0.028 0.030 99.8% 1974 8000 0.123 0.014 0.014 100% 7870 Notes. ‘ N ’ is the sample size p er replication. ‘ ˆ θ ’ is the av erage of the test statistic; ‘std’ is the standard deviation; ‘mean se’ is the av erage estimated standard error. ‘reject 5%’ is the fraction of replications with p < 0 . 05 (empirical size under δ = 0 , p ow er under δ = 1 ). ‘n_eff (mean)’ is the av erage effective sample size under normalized weigh ts. Baseline: K = 5 folds, Lasso, L = 2 , p = 100 , ε = 0 . 05 . T able 4 rep orts the results for the mixed experimental-observ ational design under homogeneous treatment effects across sites where δ = 0 and strong unobserved confounding in the observ ational sites where ρ = 0 . 5 . The mean estimate ˆ θ decreases at a slow er rate than under the scenario of no confounding as the sample size increases. Ho w ever, b oth the Monte Carlo standard deviation and the mean estimated standard error shrinks at the exp ected ro ot- N rate. The effective sample size also remains close to the nominal sample size. Moreo ver, the rejection rates increase substantially even for mo derate sample sizes. The results suggest that, as N gro ws, our test increasingly detects differences in CA TEs b et w een experimental and observ ational sites which are driv en b y unmeasured confounding rather than by true treatment effect heterogeneit y across sites. T able 5: Simulations: Power under δ = 1 & ρ = 0 acr oss exp erimental & observational sites. N ˆ θ std mean se reject 5% n_eff (mean) 500 -0.177 0.073 0.074 67.7% 488 2000 -0.225 0.031 0.032 100% 1972 8000 -0.242 0.015 0.015 100% 7855 Notes. ‘ N ’ is the sample size p er replication. ‘ ˆ θ ’ is the av erage of the test statistic; ‘std’ is the standard deviation; ‘mean se’ is the av erage estimated standard error. ‘reject 5%’ is the fraction of replications with p < 0 . 05 (empirical size under δ = 0 , p ow er under δ = 1 ). ‘n_eff (mean)’ is the av erage effective sample size under normalized weigh ts. Baseline: K = 5 folds, Lasso, L = 2 , p = 100 , ε = 0 . 05 . T able 5 rep orts the results for the mixed experimental-observ ational design under 16 heterogeneous treatment effects across sites where δ = 1 and there is no unobserv ed confounding where ρ = 0 in the observ ational sites. A cross sample sizes, the mean test statistic ˆ θ is negativ e and b ecomes sligh tly more negativ e as N increases. The standard deviation and the estimated mean standard error decrease appro ximately at the ro ot rate N . Consisten t with this, the rejection rate rises as the sample size increases and reaches essen tially one for N ≥ 2000 . The effectiv e sample size remains close to the nominal N in all cases. Ov erall, the sim ulation results indicate that our test has high p ow er to detect cross-site heterogeneity when iden tification is v alid in both experimental and observ ational sites and its pow er increases rapidly with sample size. T able 6: Simulations: Power under δ = 1 & ρ = 0 . 5 acr oss exp erimental & observational sites. N ˆ θ std mean se reject 5% n_eff (mean) 500 -0.063 0.080 0.078 15.1% 489 2000 -0.108 0.033 0.034 89.8% 1975 8000 -0.125 0.016 0.016 100% 7869 Notes. ‘ N ’ is the sample size p er replication. ‘ ˆ θ ’ is the av erage of the test statistic; ‘std’ is the standard deviation; ‘mean se’ is the av erage estimated standard error. ‘reject 5%’ is the fraction of replications with p < 0 . 05 (empirical size under δ = 0 , p ow er under δ = 1 ). ‘n_eff (mean)’ is the av erage effective sample size under normalized weigh ts. Baseline: K = 5 folds, Lasso, L = 2 , p = 100 , ε = 0 . 05 . Finally , T able 6 rep orts the results for the mixed exp erimental-observ ational design under heterogeneous treatment effects across sites where δ = 1 and unobserv ed confounding in the observ ational sites where ρ = 0 . 5 . The results sho w that the test statistic b ecomes increasingly negative as the sample size increases while the standard deviation and a verage standard error decrease at ro ot N . Regarding p o wer, the results show that the test has low er p o w er in small samples, while the rejection rates increase sharply as the sample size gro ws. This indicates that strong confounding can weak en the detection of effect heterogeneit y in finite samples, but the test regains p ow er for larger sample sizes. Finally , the retained sample size is close to the original sample size. Imp ortantly , b ecause effect heterogeneity and confounding are present sim ultaneously in this scenario, the test detects differences in treatmen t effects b etw een exp erimental and observ ational sites, but it cannot attribute that difference uniquely to true treatment effect heterogeneity v ersus violations of the iden tifying assumptions in the observ ational sites. Overall, the simulations show that our prop osed test is well b ehav ed when its iden tifying assumptions hold and b ecomes increasingly informative as the sample size grows. 6 Application W e illustrate our metho d using data from the International Stroke T rial (IST), a large m ulti-centre randomized controlled trial in acute ischaemic stroke conducted by the IST Collab orativ e Group ( Sanderco ck, Niew ada, Członko wsk a, & Group , 2011 ). The IST in vestigated whether early administration of aspirin, heparin, b oth, or neither affects clinical outcomes after strok e. P atien ts w ere eligible if they had a clinical diagnosis of acute isc haemic stroke within 48 hours of symptom onset and had no clear indication for, or con traindication to, either treatmen t. After a CT scan to support the diagnosis, clinicians 17 con tacted a central randomization service that recorded baseline characteristics and returned the assigned treatment. The dataset con tains anonymized individual-lev el information on 19,435 patients treated in 467 hospitals across 36 countries. It in cludes baseline characteristics, clinical status at randomization, short-run outcomes measured at 14 da ys, and follo w-up outcomes at six mon ths. The primary outcome of interest of the trial w as death or dep endency in daily living six months after randomization. Our empirical analysis focuses on the randomized assignmen t to aspirin. W e define the treatmen t indicator D as assignment to aspirin and the outcome Y as an indicator equal to one if the patien t is dead or dep endent at six months, and zero otherwise. W e restrict the sample to patients with non-missing information on treatmen t assignment and the six-month outcome. Additionally , for comparabilit y , w e fo cus on observ ations from the main trial and discard observ ations from the pilot trial. T able 7: Sample c onstruction. Step N F ull IST dataset 19,435 Main trial only (drop pilot) 18,451 Non-missing D and Y 18,273 Final analysis sample ( ≥ 50 p er coun try) 18,189 T able 7 summarizes the sample construction. Starting from the full IST dataset, w e first observ ations from the pilot phase to focus on the main trial for comparability reasons. W e then restrict the sample to patien ts with non-missing treatment assignment and six-month outcome. Finally , we imp ose a minimum sample size requirement at the site lev el and retain only coun tries with at least 50 observ ation. The final analysis sample contains N = 18 , 189 patients from 31 coun tries listed in T able 10 . As baseline co v ariates X , we use pre-treatmen t patien t c haracteristics measured at randomization: age, sex, systolic blo o d pressure, indicators for baseline lev el of consciousness (fully alert, drowsy , unconscious), and whether a CT scan was p erformed b efore randomization. These co v ariates capture clinically relev ant differences at baseline health b et ween patients. T able 8: Baseline c ovariate b alanc e: Standar dize d Differ enc es in Me ans. V ariable Mean Control (SD) Mean T reated (SD) SMD Age 71.87 (11.53) 71.89 (11.61) 0.002 Systolic blo o d pressure 160.45 (27.63) 160.04 (27.83) 0.015 F emale 0.46 (0.50) 0.47 (0.50) 0.018 CT b efore randomization 0.68 (0.47) 0.67 (0.47) 0.016 F ully alert 0.77 (0.42) 0.77 (0.42) 0.003 Dro wsy 0.22 (0.41) 0.22 (0.41) 0.003 Unconscious 0.01 (0.12) 0.01 (0.12) < 0 . 001 N 9,101 9,088 Notes. Entries report mean (standard deviation). SMD denotes the standardized mean difference b et w een treated and con trol groups. 18 T able 8 summarizes baseline cov ariate balance b et ween treated and control patien ts using standardized difference in means (SMDs). The treated and con trol groups are v ery similar across all characteristics as all SMDs are b elo w con ven tional thresholds for meaningful im balance. In particular, age, systolic bloo d pressure, sex, pre-randomization CT use, and baseline consciousness status are nearly identical across treatment arms. These patterns supp ort a successful treatment randomization in the IST study . In addition, treatmen t assignmen t is also balanced within eac h country , although sample sizes within countries v ary substan tially (see Figure 2 and T able 10 ). In this context, w e tak e coun tries as exp erimental sites, indexed b y Z , and apply our test of effect homogeneit y b etw een these countries. This setting is w ell suited to our approach because a common randomized protocol was implemen ted across all participating coun tries. Ho wev er, clinical en vironments and baseline risk may differ betw een sites, as can b e seen in Figure 1 . Figure 1: Outc ome r ates by c ountry (site): Pr(de ad or dep endent at 6 months) with 95% c onfidenc e intervals. Figure 1 displays outcome rates by country . The outcome rates v ary markedly across sites, ranging from 0.324 to 0.802, whic h lik ely reflects differences in baseline risk, case mix, and clinical practice across coun tries. This heterogeneit y in outcome lev els motiv ates testing whether the treatment effects are homogeneous across countries, sp ecifically if the conditional effect of aspirin on death or dep endency is stable b et ween coun tries. W e therefore apply our prop osed test of CA TE homogeneit y across m ultip le exp erimental sites to the IST dataset. The estimand is the av erage treatmen t effect of randomized aspirin assignmen t on six-month death or dep endency , adjusting for baseline cov ariates. In the spirit of our framework, failing to reject homogeneity indicates no systematic interactions b etw een treatment and unobserv ed determinan ts of outcomes in the mean effect and supp orts generalizing the estimated effect across countries. In contrast, rejecting homogeneit y suggests that the effect v aries across sites in wa ys not captured by observ ed co v ariates, consisten t with treatmen t–unobserv able interactions and/or other forms of site-level heterogeneit y . 19 T able 9: T est of effe ct homo geneity acr oss exp erimental sites (c ountries) ε ˆ θ se p-v alue n eff 0.05 0.012 0.006 0.046 15805 0.10 0.002 0.004 0.706 9349 Notes. ε is the trimming threshold for the prop ensity-score comp onents used to construct weigh ts. n eff is the effective sample size after trimming. T able 9 rep orts the results of our test of CA TE homogeneity across coun tries, implemented with tw o trimming thresholds, ε ∈ 0 . 05 , 0 . 10 . F or the baseline threshold ε = 0 . 05 , the test rejects homogeneity at the 5% level ( p = 0 . 046 ), providing marginal evidence th at the conditional effect of aspirin may v ary across coun tries. Increasing trimming to ε = 0 . 10 yields a muc h smaller test statistic and a large p -v alue ( p = 0 . 706 ), so we no longer reject homogeneit y . The shift in inference is accompanied by a sharp decline in the effectiv e sample size. The results show the impact that trimming and therefore, ov erlap can hav e on the conclusions. Importantly , in this context, the sensitivit y of the results to trimming is not due to the limited o verlap in treatmen t assignment, as aspirin is randomized within each coun try and the estimated propensity scores are tightly concentrated around 0 . 5 . Rather, it reflects limited supp ort for certain country co v ariate com binations. In some countries, sp ecific cov ariate profiles are rare and/or the country sample size is small, resulting in estimated site-sp ecific prop ensit y scores that can b e v ery lo w. This p oses a problem b ecause the estimation w eights are based on the inv erse of the prop ensit y scores, so small scores translate in to v ery large weigh ts. As a result, a small n um b er of observ ations in sparsely supp orted regions of the country-specific co v ariate distribution can receive extreme weigh ts and disproportionately influence the test statistic. T rimming preven ts this by excluding suc h cases with extreme prop ensit y scores. With stricter trimming that focuses on regions with impro ved common supp ort, the test do es not reject the n ull hypothesis of homogeneous treatmen t effects across coun tries. 7 Conclusion In this w ork, w e introduced a framework for testing the homogeneit y of conditional av erage treatmen t effects (CA TEs) across multiple exp erimen tal and observ ational sites. The prop osed test is built on a Neyman orthogonal score that extends Apfel et al. ( 2024 ) to a double difference setting. Under sp ecific regularit y conditions (in particular, o ( n − 1 / 4 ) con vergence rates for the nuisance estimators) and with cross-fitting, the resulting estimator is √ n -consisten t and asymptotically normal. W e also show ed how the same logic carried o ver to settings with alternative iden tification strategies, such as instrumental v ariables and panel designs with parallel trends. The sim ulation study indicated that the test is w ell b ehav ed when the iden tifying assumptions hold and b ecomes increasingly informative as sample size gro ws. In randomized m ulti-site designs, the test has the correct size under homogeneity and has high p ow er against heterogeneity . In mixed designs that combine exp erimental and observ ational sites, the test rejects systematically when unobserv ed confounding is present in the observ ational 20 sites, even when treatmen t effects are homogeneous, highligh ting the usefulness of the test as a diagnostic for flagging potential confounding. W e then illustrated the approach using data from the In ternational Strok e T rial, treating coun tries as sites and testing whether the conditional effect of randomized aspirin assignmen t on six mon th death or dep endency is homogeneous across coun tries. With more p ermissiv e trimming, w e reject homogeneity , while with stricter trimming we do not reject. This sensitivit y likely arises b ecause some coun tries ha ve very few observ ations for certain patient profiles. With a more p ermissiv e trimming rule, these rare profiles can receiv e v ery large w eights and therefore ha ve a disprop ortionate impact on the results. Ho wev er, with stricter trimming, w e exclude those p o orly represented cases, and the analysis relies on patien t profiles that are more common within each coun try . Ov erall, the prop osed framework pro vides a practical and flexible to ol for assessing homogeneit y of CA TEs across exp erimental and observ ational data. The test helps re- searc hers to diagnose confounding as well as to ev aluate the internal and external v alidity of their estimates. This is increasingly v aluable as researc hers now often ha v e access to b oth randomized trials and ric h observ ational datasets, but lac k principled w a ys to determine when estimates from these sources can b e compared, combined, or extrap olated across settings. 21 References Abadie, A. (2003). Semiparametric instrumen tal v ariable estimation of treatmen t response mo dels. Journal of Ec onometrics , 113 , 231-263. Abadie, A. (2005). Semiparametric difference-in-differences estimators. R eview of Ec onomic Studies , 72 , 1-19. Angrist, J., & F ernández-V al, I. (2010). Extrap olate-ing: External v alidit y and ov eridentifi- cation in the late framework. NBER working p ap er 16566 . Angrist, J., Imbens, G., & Rubin, D. (1996). Identification of causal effects using instrumental v ariables. Journal of A meric an Statistic al Asso ciation , 91 , 444-472 (with discussion). Apfel, N., Hatam y ar, J., Hub er, M., & Kuec k, J. (2024). Learning control v ariables and in- strumen ts for causal analysis in observ ational data. arXiv pr eprint arXiv:2407.04448 . Arono w, P. M., & Carnegie, A. (2013). Beyond late: Estimation of the a v erage treatmen t effect with an instrumen tal v ariable. Politic al Analysis , 21 , 492-506. A they , S., Chett y , R., & Imbens, G. (2025, Ma y). The exp erimental sele ction c orr e ction estimator: Using exp eriments to r emove biases in observational estimates (W orking P ap er No. 33817). National Bureau of Economic Researc h. Retriev ed from http:// www.nber.org/papers/w33817 doi: 10.3386/w33817 Bran tner, C. L., Chang, T. H., Nguyen, T. Q., Hong, H., Di Stefano, L., & Stuart, E. A. (2023, No vem b er). Metho ds for integrating trials and non-exp erimental data to examine treatmen t effect heterogeneit y . Statistic al Scienc e , 38 (4), 640–654. (Epub 2023 Nov 6) doi: 10.1214/23-sts890 Bran tner, C. L., Nguyen, T. Q., T ang, T., Zhao, C., Hong, H., & Stuart, E. A. (2024). Comparison of metho ds that combine multiple randomized trials to estimate hetero- geneous treatmen t effects. Statistics in Me dicine , 43 (7), 1291-1314. Retriev ed from https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.9955 doi: h ttps://doi.org/10.1002/sim.9955 Chen, H., Aebersold, H., Puhan, M. A., & Serra-Burriel, M. (2025). Causal machine learning methods for estimating personalised treatmen t effects–insigh ts on v alidity from tw o large trials. arXiv pr eprint arXiv:2501.04061 . Cheng, D., & Cai, T. (2021). A daptive com bination of randomized and observ ational data. arXiv pr eprint arXiv:2111.15012 . Chernozh uko v, V., Chetv eriko v, D., Demirer, M., Duflo, E., Hansen, C., Newey , W., & Robins, J. (2018). Double/debiased machine learning for treatmen t and structural parameters. The Ec onometrics Journal , 21 , C1-C68. Cole, S. R., & Stuart, E. A. (2010). Generalizing evidence from randomized clinical trials to target p opulations: the actg 320 trial. Americ an journal of epidemiolo gy , 172 (1), 107–115. Colnet, B., Ma y er, I., Chen, G., Dieng, A., Li, R., V aro quaux, G., . . . Y ang, S. (2024). Causal inference methods for combining randomized trials and observ ational studies: a review. Statistic al scienc e , 39 (1), 165–191. Co x, D. (1958). Planning of exp eriments . New Y ork: Wiley. Epanomeritakis, A., & Viviano, D. (2025). Cho osing what to learn: Exp erimental design when combining exp erimen tal with observ ational evidence. arXiv pr eprint arXiv:2510.23434 . Ghassami, A., Y ang, A., Richardson, D., Shpitser, I., & T chetgen, E. T. (2022). Com bining 22 exp erimen tal and observ ational data for iden tification and estimation of long-term causal effects. arXiv pr eprint arXiv:2201.10743 . Hahn, J. (1998, Mar.). On the role of the prop ensit y score in efficient semiparametric estimation of av erage treatment effects. Ec onometric a , 66 , 315-331. Hatt, T., Berrevoets, J., Curth, A., F euerriegel, S., & v an der Sc haar, M. (2022). Com bin ing observ ational and randomized data for estimating heterogeneous treatmen t effects. arXiv pr eprint arXiv:2202.12891 . Hatt, T., T sc hernutter, D., & F euerriegel, S. (2022, 01–05 Aug). Generalizing off-policy learning under sample selection bias. In J. Cussens & K. Zhang (Eds.), Pr o c e e dings of the thirty-eighth c onfer enc e on unc ertainty in artificial intel ligenc e (V ol. 180, pp. 769–779). PMLR. Retriev ed from https://proceedings.mlr.press/v180/ hatt22a.html Im b ens, G., Kallus, N., Mao, X., & W ang, Y. (2025). Long-term causal inference under p ersisten t confounding via data combination. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 87 (2), 362–388. Im b ens, G. W. (2004). Nonparametric estimation of a v erage treatment effects under exogeneit y: a review. The R eview of Ec onomics and Statistics , 86 , 4-29. Im b ens, G. W., & Angrist, J. (1994). Identification and estimation of lo cal a verage treatmen t effects. Ec onometric a , 62 , 467-475. Kallus, N., Puli, A. M., & Shalit, U. (2018). Removing hidden confounding by exp erimen tal grounding. A dvanc es in neur al information pr o c essing systems , 31 . Lec hner, M. (2011). The estimation of causal effects by difference-in-difference methods. F oundations and T r ends in Ec onometrics , 4 , 165-224. Lelo v a, K., Co op er, G. F., & T riantafillou, S. (2025). T esting iden tifiabilit y and transporta- bilit y with observ ational and experimental data. arXiv pr eprint arXiv:2505.12801 . Liu, M., & Xie, J. (2025). When is causal inference p ossible? a statistical test for unmeasured confounding. arXiv pr eprint arXiv:2508.20366 . Neyman, J. (1923). On the application of probability theory to agricultural exp eriments. essa y on principles. Statistic al Scienc e , R eprint, 5 , 463-480. Neyman, J. (1959). Optimal asymptotic tests of composite statistical h yp otheses. In Pr ob ability and statistics (p. 416-444). Wiley. P arikh, H., et al. (2025). A double mac hine learning approac h for combining exp erimental and observ ational studies. Observational Studies , 11 (3), 249–300. Retrieved from https://dx.doi.org/10.1353/obs.2025.a973068 doi: 10.1353/obs.2025 .a973068 P ark, Y., & Sasaki, Y. (2024). The informativ eness of com bined experimental and observ a- tional data under dynamic selection. arXiv pr eprint arXiv:2403.16177 . P earl, J., & Bareinboim, E. (2011, Aug.). T ransp ortabilit y of causal and statistical relations: A formal approac h. Pr o c e e dings of the AAAI Confer enc e on Artificial Intel ligenc e , 25 (1), 247-254. Retrieved from https://ojs.aaai.org/index.php/AAAI/ article/view/7861 doi: 10.1609/aaai.v25i1.7861 Robins, J. M., & Rotnitzky , A. (1995). Semiparametric efficiency in m ultiv ariate regression mo dels with missing data. Journal of the Americ an Statistic al Asso ciation , 90 , 122-129. Robins, J. M., Rotnitzky , A., & Zhao, L. (1994). Estimation of regression co efficients when some regressors are not alwa ys observed. Journal of the Americ an Statistic al 23 Asso ciation , 90 , 846-866. Rosenman, E. T., B asse, G., Owen, A. B., & Baio cc hi, M. (2023). Com bining observ ational and exp erimental datasets using shrink age estimators. Biometrics , 79 (4), 2961–2973. Rosenman, E. T. R., Owen, A. B., Baio cc hi, M., & Banac k, H. R. (2022, January). Prop ensity score metho ds for merging observ ational and exp erimen tal datasets. Statistics in Me dicine , 41 (1), 65–86. (Epub 2021 Oct 20) doi: 10.1002/sim.9223 Rubin, D. (1980). Comment on ’randomization analysis of experimental data: The fisher randomization test’ b y d. basu. Journal of A meric an Statistic al Asso ciation , 75 , 591-593. Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonran- domized studies. Journal of Educ ational Psycholo gy , 66 , 688-701. Sanderco c k, P. A., Niew ada, M., Członk owsk a, A., & Group, I. S. T. C. (2011). The in ternational strok e trial database. T rials , 12 , 101. Retrieved from https:// doi.org/10.1186/1745-6215-12-101 doi: 10.1186/1745-6215-12-101 Sno w, J. (1855). On the mo de of c ommunic ation of choler a (J. Churc hill, Ed.). Stuart, E. A., Bradshaw, C. P ., & Leaf, P. J. (2015). Assessing the generalizabilit y of randomized trial results to target p opulations. Pr evention Scienc e , 16 (3), 475–485. doi: 10.1007/s11121-014-0513-z T rian tafillou, S., Jabbari, F., & Co op er, G. F. (2023). Learning treatmen t effects from obser- v ational and exp erimental data. In International c onfer enc e on artificial intel ligenc e and statistics (pp. 7126–7146). V an Goffrier, G., Maystre, L., & Gilligan-Lee, C. M. (2023). Estimating long-term causal effects from short-term exp eriments and long-term observ ational data with unobserved confounding. In Confer enc e on c ausal le arning and r e asoning (pp. 791–813). V ytlacil, E. (2002). Indep endence, monotonicity , and latent index mo dels: An equiv alence result. Ec onometric a , 70 , 331-341. W ald, A. (1940). The fitting of straigh t lines if b oth v ariables are sub ject to error. Annals of Mathematic al Statistics , 11 , 284-300. W u, L., & Y ang, S. (2022). Integrativ e r -learner of heterogeneous treatment effects com bining experimental and observ ational studies. In Confer enc e on c ausal le arning and r e asoning (pp. 904–926). Y ang, S. , Gao, C., Zeng, D., & W ang, X. (2023). Elastic integrativ e analysis of randomised trial and real-w orld data for treatment heterogeneit y estimation. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 85 (3), 575–596. Y ang, S., Zeng, D., & W ang, X. (2020). Impro ved inference for heterogeneous treat- men t effects using real-w orld data sub ject to hidden confounding. arXiv pr eprint arXiv:2007.12922 . Y ang, X., Lin, L., Athey , S., Jordan, M. I., & Imbens, G. W. (2025). Cross-v alidated causal inference: a mo dern method to com bine exp erimen tal and observ ational data. arXiv pr eprint arXiv:2511.00727 . 24 App endix Pro of: Momen t condition and Neyman orthogonality for ψ CLA TE Step 1: Definitions. Let m w,x,z = E [ Y | W = w, X = x, Z = z ] , r w,x,z = E [ D | W = w , X = x, Z = z ] , ¯ g X,z = m 1 ,X,z − m 0 ,X,z , ¯ h X,z = r 1 ,X,z − r 0 ,X,z . Define the augmen tation (or debiasing) terms con taining the residuals that are weigh ted b y the inv erse of the prop ensity score: A ( Y ) X,z = ( Y − m 1 ,X,z ) 1 { W = 1 , Z = z } π 1 ,z ( X ) − ( Y − m 0 ,X,z ) 1 { W = 0 , Z = z } π 0 ,z ( X ) A ( Y ) X,z − = ( Y − m 1 ,X,z − ) 1 { W = 1 , Z  = z } π 1 ,z − ( X ) − ( Y − m 0 ,X,z − ) 1 { W = 0 , Z  = z } π 0 ,z − ( X ) . and A ( D ) X,z , A ( D ) X,z − analogously with D instead of Y . Let g X,z = ¯ g X,z + A ( Y ) X,z , h X,z = ¯ h X,z + A ( D ) X,z . The DR cross-pro ducts are Θ X,z = g X,z h X,z − − g X,z − h X,z , whic h composed of the plain regression cross-products ¯ Θ X,z = ¯ g X,z ¯ h X,z − − ¯ g X,z − ¯ h X,z , and the augmentation term cross-pro ducts A ( Y ) X,z = A ( Y ) X,z h X,z − − A ( Y ) X,z − h X,z , A ( D ) X,z = A ( D ) X,z g X,z − − A ( D ) X,z − g X,z . The difference in the exp ectation of score function ( 13 ) and θ corresp onds to the map M ( η ) = E h L X z =1 n ¯ Θ 2 X,z + Θ X,z + 2 ¯ Θ X,z A ( Y ) X,z + 2 ¯ Θ X,z A ( D ) X,z oi . (A.1) W e note that quadratic augmentation terms like ( A ( Y ) X,z ) 2 or ( A ( D ) X,z ) 2 could b e added in the exp ectation defining M ( η ) , which w ould recognize that the null h yp othesis in ( 12 ) is based on E h P L z =1  Θ 2 X,z + Θ X,z i . Ho wev er, such terms are of second order in the residuals and for this reason, they do not affect Neyman orthogonality , and are not required for defining the moment condition underlying our test either. They are for this reason not included in M ( η ) and the following pro of. 25 Step 2: Momen t condition. Expanding Θ X,z yields Θ X,z = ( ¯ g X,z + A ( Y ) X,z )( ¯ h X,z − + A ( D ) X,z − ) − ( ¯ g X,z − + A ( Y ) X,z − )( ¯ h X,z + A ( D ) X,z ) (A.2) = ¯ g X,z ¯ h X,z − | {z } ( a ) + ¯ g X,z A ( D ) X,z − | {z } ( b ) + A ( Y ) X,z ¯ h X,z − | {z } ( c ) + A ( Y ) X,z A ( D ) X,z − | {z } ( d ) − ¯ g X,z − ¯ h X,z | {z } ( e ) − ¯ g X,z − A ( D ) X,z | {z } ( f ) − A ( Y ) X,z − ¯ h X,z | {z } ( g ) − A ( Y ) X,z − A ( D ) X,z | {z } ( h ) . Grouping, we see that Θ X,z = ¯ Θ X,z + h ( b ) + ( c ) − ( f ) − ( g ) i + h ( d ) − ( h ) i . W e note that score function ( 13 ) is equal to ψ CLA TE ( O , θ , η ) = L X z =1 n ¯ Θ 2 X,z + Θ X,z + 2 ¯ Θ X,z A ( Y ) X,z + 2 ¯ Θ X,z A ( D ) X,z o − θ . (A.3) Therefore, M ( η ) = E [ ψ CLA TE ( W , X, Z ; η ) + θ ] . A t the true n uisance functions η 0 , the augmentation terms conditionally mean zero giv en X : E [ A ( Y ) X,z | X ] = E [ A ( D ) X,z | X ] = 0 . This implies that terms ( b ) , ( c ) , ( f ) , and ( g ) in equation ( A.2 ) , whic h in volv e augmentation terms, are equal to zero. Also terms ( d ) and ( h ) , which are pro ducts of augmentation terms, are conditionally mean zero. It follows that E [Θ X,z | X ] = E [ ¯ Θ X,z | X ] . F urthermore, the terms 2 ¯ Θ X,z A ( Y ) X,z and 2 ¯ Θ X,z A ( D ) X,z in equation ( A.3 ) are conditionally mean zero as well. It follo ws b y the law of iterated expectations that E h L X z =1  ¯ Θ 2 X,z + Θ X,z + 2 ¯ Θ X,z A ( Y ) X,z + 2 ¯ Θ X,z A ( D ) X,z  i = E h L X z =1 ¯ Θ 2 X,z + ¯ Θ X,z i . (A.4) As the term E h P L z =1 ¯ Θ 2 X,z + ¯ Θ X,z i corresp onds to the definition of θ 0 in equation ( 12 ) , the momen t condition E [ ψ CLA TE ( O , θ 0 , η )] = 0 (A.5) holds, implying that the exp ectation of the score function is zero at the true v alues of the n uisance parameters η and the test statistic θ 0 . Step 3: Neyman orthogonalit y Perturb ations in outc ome mo del m . Consider m ( t ) w,x, · = m w,x, · + t δ m w,x, · and let ∂ t denote differentiation w.r.t. t at t = 0 . 26 The deriv ative of M ( η ) w.r.t. t as defined in equation ( A.1 ) is ∂ t M ( η ) (A.6) = L X z =1 E h 2 ¯ Θ X,z ∂ t ¯ Θ X,z | {z } I + ∂ t Θ X,z | {z } I I + 2( ∂ t ¯ Θ X,z ) A ( Y ) X,z + 2( ∂ t ¯ Θ X,z ) A ( D ) X,z | {z } I I I + 2 ¯ Θ X,z ( ∂ t A ( Y ) X,z ) | {z } I V + 2 ¯ Θ X,z ( ∂ t A ( D ) X,z ) | {z } V i . First, consider the deriv ativ e of the augmen tation terms ∂ t A ( Y ) X,z = − δ m 1 ,X,z 1 { W = 1 , Z = z } π 1 ,z ( X ) + δ m 1 ,X,z − 1 { W = 1 , Z  = z } π 1 ,z − ( X ) , (A.7) ∂ t A ( Y ) X,z − = − δ m 1 ,X,z − 1 { W = 1 , Z  = z } π 1 ,z − ( X ) + δ m 0 ,X,z − 1 { W = 0 , Z  = z } π 0 ,z − ( X ) , ∂ t A ( Y ) X,z = ∂ t A ( Y ) X,z h X,z − − ∂ t A ( Y ) X,z − h X,z . (A.8) T aking conditional exp ectations giv en X implies that E [ h X,z | X ] = ¯ h X,z b y the augmen tation term prop erty E [ A ( D ) X,z | X ] = 0 . When additionally considering the prop ensity score prop erty E [ 1 { W = w, Z = z } /π w,z ( X ) | X ] = 1 , w e obtain the conditional av erage deriv atives E [ ∂ t A ( Y ) X,z | X ] = − δ ¯ g X,z ¯ h X,z − + δ ¯ g X,z − ¯ h X,z . (A.9) F urthermore, E [ ∂ t A ( D ) X,z | X ] = 0 . (A.10) Therefore, it follo ws from the law of iterated expectations and ( A.10 ) that the expectation of term V in equation ( A.6 ) is zero. By the augmen tation term property E [ A ( Y ) X,z | X ] = E [ A ( D ) X,z | X ] = 0 , the expectation of term I I I is zero, too Next, we expand Θ X,z = g X,z h X,z − − g X,z − h X,z in to the blo c ks Θ X,z = ¯ g X,z ¯ h X,z − + ¯ g X,z A ( D ) X,z − + A ( Y ) X,z ¯ h X,z − + A ( Y ) X,z A ( D ) X,z − − ¯ g X,z − ¯ h X,z − ¯ g X,z − A ( D ) X,z − A ( Y ) X,z − ¯ h X,z − A ( Y ) X,z − A ( D ) X,z , and take deriv ativ es w.r.t. t : ∂ t Θ X,z = δ ¯ g X,z ¯ h X,z − | {z } ( a ) + δ ¯ g X,z A ( D ) X,z − | {z } ( b ) + ∂ t A ( Y ) X,z ¯ h X,z − | {z } ( c ) + ∂ t A ( Y ) X,z A ( D ) X,z − | {z } ( d ) (A.11) − δ ¯ g X,z − ¯ h X,z | {z } ( e ) − δ ¯ g X,z − A ( D ) X,z | {z } ( f ) − ∂ t A ( Y ) X,z − ¯ h X,z | {z } ( g ) − ∂ t A ( Y ) X,z − A ( D ) X,z | {z } ( h ) . T aking conditional exp ectations sets blo cks con taining an augmentation term to zero, namely ( b, d, f , h ) . F urthermore, making use of ( A.9 ) , ( a ) and ( c ) cancel out, as w ell as ( e ) and ( g ) . It follows that the exp ectation of term I I in equation ( A.6 ) is zero (when also applying the la w of iterated expectations). 27 Finally , we note that ∂ t ¯ Θ X,z = δ ¯ g X,z ¯ h X,z − − δ ¯ g X,z − ¯ h X,z , (A.12) whic h enters term I in equation ( A.6 ) and corresp onds to the negative of equation ( A.9 ) , whic h en ters term I V . F or this reason, terms I and I V cancel out. Therefore, we hav e that ∂ t M ( η ) = 0 . (A.13) Perturb ations in the tr e atment mo del r . By symmetry (in terchanging Y ↔ D ), the same arguments demonstrate orthogonalit y w.r.t. p erturbations in r . Perturb ations in the pr op ensity sc or e π . Differen tiating with resp ect to π affects only the augmentation terms. The deriv ative introduces terms of the form − ( Y − m w,X , · ) 1 { W = w , Z = ·} δ π w, · ( X ) π w, · ( X ) 2 , (A.14) whose conditional exp ectation giv en X is zero b ecause E [ Y − m w,X , · | X , W = w , Z = · ] = 0 . Hence the deriv ativ e with resp ect to π v anishes in exp ectation. The same argumen ts holds for D -residuals. Hence any deriv ative w.r.t. π also v anishes. Step 4: Conclusion. The moment condition E h ψ CLA TE ( O , θ 0 , η i = 0 is satisfied at the true v alue of η , identifying θ 0 . F urthermore, ∂ t E h ψ CLA TE ( O , θ 0 , η + t ) i   t =0 = 0 , suc h that ψ CLA TE is Neyman-orthogonal at the true v alue of η . 28 T ables T able 10: Sample Size by Country and T r e atment Status Coun try Control T reated T otal AR GE 266 253 519 A USL 281 281 562 A UST 115 114 229 BELG 132 131 263 BRAS 38 40 78 CANA 59 58 117 CHIL 29 29 58 CZEC 214 217 431 EIRE 26 26 52 FINL 26 27 53 GREE 74 76 150 HONG 53 55 108 HUNG 52 52 104 INDI 102 104 206 ISRA 53 57 110 IT AL 1557 1554 3111 NETH 361 350 711 NEW 224 225 449 NOR W 263 262 525 POLA 377 378 755 POR T 190 189 379 SING 71 68 139 SLOK 43 41 84 SLO V 26 27 53 SOUT 30 32 62 SP AI 232 231 463 SWED 313 317 630 SWIT 815 814 1629 TURK 138 140 278 UK 2882 2883 5765 USA 59 57 116 T otal 9101 9088 18189 29 Figures Figure 2: Sample size by c ountry (site) and tr e ate d vs. c ontr ol shar es. Note: Figure 2 sho ws coun try sample sizes and aspirin assignment shares. Coun try sizes are uneven, ranging from 52 to 5,765 patients (median 229) p er coun try . Nonetheless, treatment assignment is well balanced within coun tries. 30

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment