Reading time: 25 minute
...

📝 Original Info

  • Title:
  • ArXiv ID: 2512.18737
  • Date:
  • Authors: Unknown

📝 Abstract

The estimation of individual treatment effects (ITE) focuses on predicting the outcome changes that result from a change in treatment. A fundamental challenge in observational data is that while we need to infer outcome differences under alternative treatments, we can only observe each individual's outcome under a single treatment. Existing approaches address this limitation either by training with inferred pseudo-outcomes or by creating matched instance pairs. However, recent work has largely overlooked the potential impact of post-treatment variables on the outcome. This oversight prevents existing methods from fully capturing outcome variability, resulting in increased variance in counterfactual predictions. This paper introduces Pseudo-outcome Imputation with Post-treatment Variables for CounterFactual Regression (PIPCFR), a novel approach that incorporates post-treatment variables to improve pseudo-outcome imputation. We analyze the challenges inherent in utilizing post-treatment variables and establish a novel theoretical bound for ITE risk that explicitly connects post-treatment variables to ITE estimation accuracy. Unlike existing methods that ignore these variables or impose restrictive assumptions, PIPCFR learns effective representations that preserve informative components while mitigating bias. Empirical evaluations on both real-world and simulated datasets demonstrate that PIPCFR achieves significantly lower ITE errors compared to existing methods.

📄 Full Content

Predicting the causal effect of different treatments or interventions is essential in domains such as medicine, finance, and advertising. While traditional approaches depended on randomized controlled trials (RCTs), recent advances focus on leveraging large-scale observational data to estimate treatment effects. For example, marketing strategists can analyze campaign effectiveness -such as variations in monthly ROI under different advertising campaignto model how promotional strategies influence profitability. However, the fundamental challenge lies in the fact that we observe each individual under only one treatment, with no direct supervision regarding how an individual's outcome would change if the treatment were different.

Existing works address this challenge by imputing pseudo-outcomes for missing treatments in the training data and using them as supervision. These imputation methods include meta-learners (Curth et al., 2021;Nie & Wager, 2021;Künzel et al., 2019), matching methods (Nagalapatti et al., 2024;Schwab et al., 2018;Kallus, 2020), and generative models (Yoon et al., 2018;Louizos et al., 2017;Bica et al., 2020), and their success depends critically on the quality of the inferred pseudo-outcomes. Recently, Nagalapatti et al. (2024) propose PairNet, a paired instance-based training strategy that avoids relying on potentially noisy pseudo-outcome supervision by instead creating neighbors for each training instance and working directly with observed factual outcomes.

Despite substantial progress in recent years, existing methods face a critical limitation: they largely overlook the potential impact of post-treatment variables on the outcome, particularly when the outcome is sensitive to noise, which leads to increased variance in counterfactual predictions. Consider a real-world advertising example illustrated in Figure 1. Real-world data is often sequential in nature (F 0 ∼ F K ). When analyzing the causal effect of an advertising campaign T (e.g., treatment) at step k, marketers typically collect user behavior metrics that combine pre-treatment variables X = [F 0 , F 1 , …, F k-1 ] (e.g., engagement levels before campaign launch, reflecting user preferences that act as confounding variables) and post-treatment variables S = [F k , F k+1 , …, F K ] (e.g., user browsing behaviors after the campaign). Both pre-and post-treatment variables affect the outcome Y (e.g., monthly ROI).

Uncertainty in post-treatment variables introduces substantial variance into counterfactual predictions. Without accounting for these post-treatment variables, existing methods cannot fully capture outcome variability, resulting in increased uncertainty in counterfactual estimates. Figure 1 confirms this issue in PairNet, a recent state-of-the-art method: as the post-treatment variable uncertainty increases, PairNet shows greater variance in counterfactual predictions, leading to significantly higher ITE estimation errors. In this paper, we propose Pseudo-outcome Imputation with Post-treatment Variables for CounterFactual Regression (PIPCFR), an alternative training strategy that incorporates post-treatment variables to improve pseudo-outcome imputation. Unlike existing methods that either remove post-treatment variables (Zhu et al., 2024) or impose restrictive assumptions like the front-door criterion (Xu et al., 2023), our approach learns effective post-treatment representations that preserve informative components while mitigating bias.

PIPCFR consists of three modules: 1) a post-treatment representation network extracts effective representation ϕ from S; 2) a pseudo-counterfactual constructor takes ϕ and X as inputs to predict counterfactual outcomes; 3) a counterfactual regression module takes X as inputs and learns from both factual outcomes and inferred counterfactual outcomes. We analyze the challanges of using post-treatment variables and derive a novel theoretical bound for ITE risk that explicitly connects post-treatment variables to estimation accuracy. As shown in Figure 1, PIPCFR significantly reduces counterfactual variance, thereby achieving lower ITE estimation errors. Our experiments show that PIPCFR outperforms existing methods on both real-world and simulated datasets, with notable improvements in challenging scenarios with high exogenous noise and long temporal dependencies. Furthermore, we show that PIPCFR is compatible with various ITE estimation techniques, consistently improving their performance when combined.

In summary, our contributions are:

  1. We introduce PIPCFR, a novel approach that imputes pseudo-outcomes using post-treatment variables to reduce the variance of counterfactual predictions while mitigating potential bias. 2. We establish a theoretical bound for ITE risk that explicitly connects post-treatment variables to estimation accuracy, providing principled guidance for algorithm design.

  2. We show that PIPCFR significantly outperforms existing ITE methods on both real-world and simulated datasets, particularly in challenging scenarios with high exogenous noise and long temporal dependencies.

Pseudo-outcomes Imputation Meta-Learners employ a two-stage approach: first training a base model on observational data, then constructing an ITE model using derived pseudo-outcomes. X-Learner (Künzel et al., 2019) combines predictions from separate treatment groups. DR-Learner (Kennedy, 2023) incorporates propensity scores for doubly robust estimates. R-Learner (Nie & Wager, 2021) directly learns treatment effects using Robinson’s decomposition. However, these methods suffer from error propagation between stages and ignore post-treatment variables. Matching Methods (Schwab et al., 2018;Kallus, 2020;Iacus et al., 2012) impute outcomes from similar instances using strategies like propensity score matching and covariate-based distance measures. While PairNet (Nagalapatti et al., 2024) improves upon these by using only observed outcomes and creating neighbors, it still depends heavily on accurate distance metrics. Like meta-learners, matching methods also fail to account for post-treatment variables. Generative Methods synthesize pseudo-outcomes through various approaches. GANITE (Yoon et al., 2018) employs GANs to generate counterfactuals, SciGAN (Bica et al., 2020) extends this to continuous treatments, while other works explore Gaussian Processes (Zhang et al., 2020) and Variational Autoencoders (Rissanen & Marttinen, 2021).

In causal inference, existing methods primarily handle posttreatment variables in two ways. The first employs front-door adjustment (Jeong et al., 2022;Wienöbst et al., 2022;Xu et al., 2023), which require post-treatment variables to satisfy the front-door criterion (Pearl, 2009) -for instance, the variables must intercept all directed paths from treatment to outcome. This requirement limits practical application in real-world scenarios where such strict conditions rarely hold. The second approach attempts to disentangle post-treatment variables from observed variables and systematically remove their influence (Zhu et al., 2024;Acharya et al., 2016;Elwert & Winship, 2014;King & Zeng, 2006). While this strategy effectively avoids post-treatment bias by deliberately not controlling for these variables, it inevitably discards valuable information contained within S that could improve estimation accuracy. Unlike these existing works, our method neither rely on the restrictive front-door criterion nor completely discard post-treatment information. Instead, PIPCFR extracts effective representations of post-treatment variables while simultaneously mitigating post-treatment bias.

We follow the Neyman-Rubin potential outcomes framework (Rubin, 2005), where an individual with observed covariates x ∈ X , when subjected to a treatment t ∈ T , exhibits an outcome Y(t) ∈ R. We denote posttreatment variables as s ∈ S. We consider binary treatment (T = {0, 1}) in this paper. During training, we have an observational dataset D = {x i , t i , s i , y i } N i=1 that includes post-treatment variables, whereas during testing, we must infer ITE based solely on pre-treatment covariates x. Samples (x, t, s, y) are drawn from a joint distribution p(x, t, s, y). We denote the covariate distribution as p(x) and the conditional distribution p(x|t) as p t (x). Define p t (x, s) = p(x, s | t). The marginal treatment distribution is u t = p(T = t). Let the true outcome function

Our goal is to learn a model τ(x) that estimates the change in outcome

Solving this problem requires us to minimize the following

Assumptions. Following prior work, we make the following standard assumptions: A1 Overlap: Every individual has a non-zero probability of being assigned any treatment, i.e., p(t|x) ∈ (0, 1); ∀x, t. A2 Stable Unit Treatment Value Assumption: The outcomes for any sample i are independent of treatment assignments on other samples j ̸ = i. A3 Unconfoundedness: The observed covariates block all backdoor paths between treatments and outcomes.

It is worth noting that the assumptions in this paper are the basic assumptions of causal inference; we made no additional assumptions about S or the functional form.

We denote the representation of post-treatment variables as ψ η : S → R. We define outcome predictor as f : X × T → R, and define pseudo-outcome constructor as q : X × T × R → R. To simplify notations, we denote ϕ = ψ η (s) as the post-treatment representation. We also denote f t (x) = f (x, t) and q t (x, ϕ) = q(x, t, ϕ). We define p t (ϕ | x) = p(ϕ | x, t). Definition 1. The factual errors of f are:

Definition 2. The counterfactual errors of f and q are:

(2)

Definition 3. The expected difference of f and q on counterfactual predictions are:

Definition 4. The ITE risk is defined in terms of residuals as:

) 2 p(x)dx Definition 5. The ITE estimation of the outcome predictor is: τ(x) := f 1 (x)f 0 (x). Proposition 1. The ITE risk can be decomposed into a sum of factual error, counterfactual error, and error residuals as follows:

4 Methodology

The key insight of our method is to leverage post-treatment variables to construct accurate counterfactual pseudolabels, thereby enabling causal inference models to estimate precise Individual Treatment Effects (ITEs). As shown Using the pseudo-counterfactual outcomes generated by q, we propose an alternative training loss called PIP loss (ϵ PIP ) to minimize the ITE risk as follows:

where εCF (Definition 3) represents the expected difference between the counterfactual predictions of q and f .

The key in Eq. 5 is how to obtain post-treatment representation ϕ. Prior research (Acharya et al., 2016;Elwert & Winship, 2014;Zhu et al., 2024) have investigated the negative effects of controlling post-treatment variables and show that improper use of post-treatment variables can introduce post-treatment bias in the ITE estimation.

To leverage post-treatment variables effectively, we present a simple study to illustrate how post-treatment variables affect the counterfactual estimation in our setting.

Example 1. Consider the structural equation model:

The parameter σ x , σ t , σ u defines the scale of random variation within the model. The variable u s is an unobserved exogenous noise in S. It is drawn from N (0, σ 2 u ) and is independent of the treatment T. Consider a sample (x, t, s, y(t)). To predict the counterfactual outcome y(1t), we can:

where d → denotes the convergence in distribution.

Please refer to Appendix section A.4 for the proof. This example demonstrates that: (a) When predicting counterfactual outcomes based solely on pre-treatment covariates X and treatment T, the variance of the prediction error is influenced by σ u . Higher values of σ u result in greater prediction uncertainty. (b) While incorporating the post-treatment variable S as input can reduce this variance, it introduces bias into the predictions. This bias occurs because S is influenced by the treatment T, causing its distribution to vary between treatment and control groups. (c) In contrast, using the variable u s as a feature offers the best of both worlds: it reduces the variance of counterfactual predictions without introducing bias, since the distribution of u s remains invariant across different treatments. Although u s is typically unobservable, we can potentially extract information about u s from the observable S.

Therefore, to construct effective pseudo labels for the PIP loss (5), we need to learn representations of post-treatment variables that extract useful variance-reducing information while eliminating components that would introduce bias.

To achieve this, we investigate the gap between the ITE risk ϵ ITE and our PIP loss ϵ PIP , and provide a novel bound for the ITE risk that explicitly connect post-treatment variables to estimation accuracy.

where

] 2 p t (s, x)dx can be regarded as evaluating how well predictor q models (or represents) counterfactual outcome information, where this modeling is facilitated by the introduction of s. Additionally, there exists a constant B such that 1 B r t (x, ϕ) ∈ G where r t (x, ϕ) = f t (x)q t (x, ϕ). Please refer to Appendix section A.2 for the proof. Theorem 1 shows that the ITE risk is upper bounded by the sum of: (1) the PIP Loss, (2) an Integral Probability Metric (IPM) that measures the distributional distance of representations ϕ across treatment groups-a source of post-treatment bias that our model aims to minimize to encourage ϕ ⊥ ⊥ t | x, and (3) the generalization error of the pseudo-outcome constructor q. The last term measures the additional information about counterfactual outcomes gained by the pseudo-outcome constructor q after incorporating post-treatment variables; when q is optimally trained, this term depends only on the inherent characteristics of the data.

Building upon our theoretical results, we propose PIPCFR, an end-to-end algorithm for ITE estimation. Motivated by Theorem 1, we minimize the upper bound in (7) in order to minimize the ITE risk.

Minimizing Post-treatment Bias First, to eliminate the post-treatment bias, we need to minimize the IPM term in (7). However, in the observational data, we only have representations ϕ under one treatment, and cannot access both p 0 (ϕ | x) and p 1 (ϕ | x) at the same time. A straightforward method to compute this IPM term is to use matching strategy by creating neighbors to estimate the representations ϕ under the missing treatment, but this method is time-consuming and highly relies on distance metrics. We propose an alternative solution that instead minimizes the Kullback-Leibler (KL) divergence KL(p(t | x)∥p(t | x, ϕ)). Proposition 2 (Relation between IPM and KL Divergence). Let G be the family of norm-1 functions in a Reproducing Kernel Hilbert Space (RKHS). Assume G is generated by a normalized kernel. The Integral Probability Metric (IPM) between two distributions p 0 (ϕ | x) and p 1 (ϕ | x)) is bounded by their conditional KL divergence as follows:

,

Please refer to Appendix section A.3 for the proof. Minimizing the KL term indicates maximizing the conditional independence between ϕ and t given x. To achieve this, we introduce two propensity score models: g(t, x) = p(t | x) and g(t, x, ϕ) = p(t | x, ϕ), which are trained to predict treatments using the following objective:

We then optimize the post-treatment representation networks ψ η to minimize the KL loss:

where γ is a hyper parameter. This induces conditional independence by ensuring that a propensity score model does not gain any additional information about treatment when given access to ψ η (s i ).

Minimizing Generalization Error of q Second, to ensure the quality of the pseudo-outcomes, it’s necessary to minimize the generalization error of q. In principle, we can employ any existing causal inference algorithms for this purpose. In this paper, we simply use a TARNet-like method to construct the pseudo-outcome predictor. Specifically, we adopt CFRNet (Shalit et al., 2017) to minimize εCF . The learning objective of CFRNet combines a weighted factual loss with an IPM distance that measures the divergence of covariate representations between treatment and control groups. We define the pseudo counterfactual constructor as q(x, t, ϕ) = h(ψ α (x), t, ϕ), where ψ α is a representation extraction function shared across treatments, and h is the treatment-specific outcome prediction function. The learning objective is as follows:

Through optimization of this term, representations ϕ extract information from S that provides additional predictive power for outcomes conditional on x, effectively capturing the randomness of S.

Minimizing PIP Loss Third, we minimize the ϵ PIP term. We use a TARNet-like architecture to construct the outcome predictor f . The empirical loss can be written as: Update outcome predictor f with loss L pip 7: end while Algorithm Summary The whole training pipeline is summarized in Algorithm 1. PIPCFR is trained end-to-end by jointly optimizing the objectives (8), ( 9), ( 10) and (11) using stochastic gradient descent. Please refer to Appendix C for more details.

In this section, we aim to answer the following research questions:

• RQ1: How does PIPCFR perform compared to state-of-the-art methods in ITE estimation?

• RQ2: How robust is PIPCFR when post-treatment variables exhibit complexity and noise?

• RQ3: How does PIPCFR perform when combined with other methods?

• RQ4: How sensitive is PIPCFR to the choice of hyper-parameters?

• RQ5: How does PIPCFR perform in real-world data?

Dataset. We evaluate our approach on both real-world and simulated datasets, including IHDP (Hill, 2011), News (Johansson et al., 2016)), a synthetic dataset and a real-world dataset. For IHDP and News, post-treatment variables are generated following (Cheng et al., 2021). For the synthetic dataset, we simulate a temporal causal system with interacting variables over time. Additional details are provided in Appendix B. (3) Representation-learning methods like TARNet (Shalit et al., 2017), CFRNet (Shalit et al., 2017), and DRCFR (Hassanpour et al., 2019). These approaches share representations between different treatment heads while learning treatment-specific estimators with varied regularization techniques. ( 4) Propensity-score Learner such as DragonNet (Shi et al., 2019) is a doubly robust method that imposes weighted factual losses. ( 5) Matching methods including PairNet (Nagalapatti et al., 2024) and PerfectMatch (Schwab et al., 2018), which employ matching strategies by creating neighboring samples to impute outcomes for missing treatments. Additionally, we consider two extended baselines: PairNet+S and PerfectMatch+S, which incorporate post-treatment variables S when matching neighboring samples.

Metrics. We evaluate ITE risk on a dataset D tst using the Presicion in Estimating Heterogeneous Effects (ϵ PEHE ) (Johansson et al., 2016) defined as:

1

We quantify PEHE (in) error for training instances and PEHE (out) error for testing instances. To ensure the reliability of the results, we sampled 50 datasets (D train , D test ) from the data distribution p(x, t, s, y) for training and testing, and reported the mean and standard deviation.

We present the results in Table 1. PIPCFR consistently outperforms existing methods from five categories of prior techniques for ITE. Meta Learners show poor performance due to their two-staged regression approach. Missing outcomes imputed during the first stage create errors that propagate to the second stage, resulting in suboptimal ITE estimation. Through joint training approaches, representation learners and PS-learners outperform meta-learners by enabling effective information sharing across different treatments. However, a key limitation of these methods is that they lack specific mechanisms to address the variance that emerges from post-treatment variables. Matching methods perform poorly as they depend on distance metrics to create neighboring samples for missing treatments, which can be unreliable when post-treatment variables introduce variance. Notably, even when post-treatment variables (S) are incorporated in the matching process (PairNet+S, PerfectMatch+S), these methods still underperform. This confirms our analysis in Section 4.2 that directly using S as input will introduce post-treatment bias. In contrast to these baselines, PIPCFR demonstrates superior performance by effectively extracting useful information from post-treatment variables while simultaneously mitigating post-treatment bias and imputing accurate pesudo-outcomes for missing treatments.

Impact of Exogenous Noise in S To evaluate the impact of exogenous noise, we vary the noise scales ϵ u = [1,2,3,4,5] as defined in Appendix B. Figure 3 (left) demonstrates a clear pattern: as noise increases, all methods exhibit increasing PEHE error. Notably, PIPCFR consistently outperforms all baselines across different noise scales, demonstrating its robust performance in the presence of significant exogenous noise. More importantly, PIPCFR’s performance advantage grows with increasing noise, indicating its superior capability in handling the uncertainty in post-treatment variables compared to existing approaches.

In real-world scenarios, post-treatment variables can manifest as long sequences where noise accumulates over time, making ITE estimation increasingly challenging as the time horizon extends. We consider post-treatment variables as a K-step sequence, as defined in the Appendix B. We experiment with a fixed noise scale while varying K ∈ [1, 100]. The results are presented in Figure 3 (middle). As expected, ITE estimation errors increase for all methods as K increases. Notably, PIPCFR achieves lower PEHE error compared to the baselines. Furthermore, PIPCFR’s performance advantage actually widens as K increases, demonstrating its superior ability to capture long temporal dependencies between post-treatment variables and outcomes.

In Section 4.4, we employ CFRNet (Shalit et al., 2017) to minimize the generalization error of q. However, our approach is fundamentally flexible -in principle, we can utilize any existing causal inference methods to minimize this generalization error, making PIPCFR inherently compatible with a wide range of existing ITE estimation techniques. To demonstrate this compatibility, we conduct experiments in which we replace the CFRNet method with several alternatives, including DRCFR (Hassanpour et al., 2019), DragonNet (Shi et al., 2019), and ESCFR (Wang et al., 2024). The results, presented in Table 2, clearly show that PIPCFR consistently enhances performance when integrated with these existing ITE estimation methods.

We examine the impact of KL loss weight γ in (9), as shown in Figure 3 (right). The performance drop at γ = 0 highlights the importance of conditional independence when learning representations of post-treatment variables. Large γ values lead to performance degradation, likely because the KL divergence term dominates the optimization process. The model achieves optimal performance within a stable range of γ ∈ [0.1, 1], showing that PIPCFR is not very sensitive to hyperparameter choice. This stability reduces the need for precise hyperparameter tuning, enhancing the practical applicability of our approach.

We evaluate our model on a product dataset comprising over 3 million samples collected from the online gaming platform. The pre-treatment variable X contains 621 features that describe users’ static profiles and their recent gaming histories. The post-treatment variable S contains 320 features that describe users’ performance and behavior after treatment. The outcome Y is defined as the users’ cumulative game rounds in the next 3 days. We establish ITE ground truth by performing matching methods (e.g. KNN and PSM) against the entire dataset. The ϵ PEHE in and ϵ PEHE out values of different models are summarized in Table 3. PIPCFR consistently outperforms competitive baselines, achieving the lowest PEHE under both KNN-based and PSM-based ground truth estimations.

In this paper, we addressed a critical limitation in existing ITE estimation methods: the oversight in accounting for post-treatment variables that introduce significant variance in counterfactual predictions. We introduced PIPCFR, a novel approach that leverages post-treatment variables to impute more accurate pseudo-outcomes. Our theoretical analysis established a new bound for ITE risk that explicitly connects post-treatment variables to estimation accuracy. Experiments demonstrated that PIPCFR consistently outperforms state-of-the-art methods from various categories. We showed that PIPCFR is compatible with existing ITE methods.

Our work opens new directions for causal inference research by highlighting the importance of properly handling post-treatment variables. While promising, this work has limitations, particularly its reliance on standard causal inference assumptions such as unconfoundedness. In the future, several interesting research directions lie ahead. (1) Incorporate PIPCFR into more sophisticated models that account for hidden confounders.

(2) Exploring advanced network architectures to enhance the representation learning of post-treatment variables.

A.1 Definition Definition 6. The error residual

Definition 7. The counterfactual errors are:

A.2 Proof of Theorem 1

Theorem. Assume that exists δ > 0 such that E x,t,s∼p(X,T,S) [r t (x)r 1-t (x, s)] ≤ δ εCF , we have

where Q t (ψ η , q) = X S [ f * (x)q(x, 1t, ψ η (s))] 2 p t (s, x)dx can be regarded as evaluating how well predictor q models (or represents) counterfactual outcome information, where this modeling is facilitated by the introduction of s. Additionally, there exists a constant B such that 1 B r t (x, ϕ) ∈ G where r t (x, ϕ) = f t (x)q t (x, ϕ).

[r 0 (x) 2 -S r0 (x, s) 2 p 1 (s | x)ds]p 1 (x)dx

In the following, we derive H 1 , H 2 , H 3 and H 4 respectively.

Therefore, the bound of ITE risk is:

A.3 Proof of Proposition A.3

Proposition (Relation between IPM and KL Divergence). Let G be the family of norm-1 functions in a Reproducing Kernel Hilbert Space (RKHS). Assume G is generated by a normalized kernel. The Integral Probability Metric (IPM) between two distributions p 0 (ϕ | x) and p 1 (ϕ | x)) is bounded by their conditional KL divergence as follows:

where

Proof. In the following, we denote

And JS π (p 1 ||p 0 ) denote the Jensen-Shannon Divergence as follows:

Our proof includes three steps:

The detailed steps are as follows:

Step

With the following equations, E ϕ∼p(ϕ|x) KL(p(t|x, ϕ) || p(t|x))

Combining Lemma 5 and Theorem 7 from (Hoyos-Osorio & Sanchez-Giraldo, 2023) yields the following inequality for any kernel κ satisfying κ(x, x) = 1:

Since G is generated by a normalized kernel, then we have:

= 0 for all ϕ, x in the support of p(ϕ, X) (or almost everywhere with respect to p(Φ, X)).

Furthermore, for any given x and ϕ, KL(p

This factorization is the definition of conditional independence, t ⊥ ⊥ ϕ | x. The conditional independence t ⊥ ⊥ ϕ | x implies that the distribution of ϕ given x is independent of the value of t. Specifically, p(ϕ | x, t) = p(ϕ | x) for all t. For t ∈ {0, 1}, this means p(ϕ | x, t = 0) = p(ϕ | x, t = 1) almost everywhere with respect to the measure for ϕ. Thus, p 0 (ϕ | x) = p 1 (ϕ | x) almost everywhere for each x.

From the definition of the Integral Probability Metric (IPM), we know that IPM G (P 1 , P 2 ) = 0 if and only if P 1 = P 2 (for a suitable class of functions G). Therefore, for every x, since the conditional distributions p 0 (ϕ | x) and p 1 (ϕ | x) are equal as distributions over ϕ, we have

Finally, taking the expectation over x, we obtain

Proof. First, we define the true counterfactual outcome is

each represented by 25 pre-treatment variables related to the children and mothers. Please refer to (Hill, 2011) for more details.

Following (Cheng et al., 2021), we define the post-treatment variables for each unit i as a time series {s t i ik } K k=1 of length K. We consider m-dimensional post-treatment variables, s t i ik ∈ R m , for each time step k ∈ {1, 2, . . . , K}. These variables are generated based on the unit’s treatment indicator t i ∈ {0, 1}, baseline covariates x i ∈ R 1×25 , and previous values in the time series.

For k = 1, the initial post-treatment variable s t i i1 ∈ R m for unit i is directly taken from the observed outcome in the IHDP dataset (distinct from the variable y in our model).

For time steps k > 1, the variable s t i ik is generated based on the unit’s treatment indicator t i ∈ {0, 1}, baseline covariates x i ∈ R 25 , and previous values in the time series {s t i ij } k-1 j=1 . The generative model for k > 1 is as follows:

Here, β t i represents the coefficient matrix specific to the treatment group (β 0 ∈ R 25×m if t i = 0 and β 1 ∈ R 25×m if t i = 1) and C 1 is a scalar scaling factor. The exogenous noise vectors ϵ u ∈ R m are generated for k > 1 by sampling each of their m elements independently from the Laplace distribution with mean 0 and scale 1.

The coefficient matrices β 0 ∈ R 25×m and β 1 ∈ R 25×m are generated by sampling each of their elements independently according to the specified discrete probability distributions:

• Elements of β 0 : sampled from {0, 1, 2, 3, 4} with probabilities {0.5, 0.2, 0.15, 0.1, 0.05}, respectively.

• Elements of β 1 : sampled from {-2, -1, 0, 1, 2} with probabilities {0.2, 0.2, 0.2, 0.2, 0.2}, respectively.

The outcome is defined as

jk . We randomly split the samples into train/validation/test with a 60/20/20 ratio.

The News dataset simulates a media consumer’s opinions on various news items, viewed either on a mobile device or desktop. Each news item is represented by word counts x i ∈ N V , where V is the number of words. The outcome y i ∈ R reflects reader experience, and the treatment t i ∈ {0, 1} indicates device: desktop (t = 0) or mobile (t = 1). A topic model trained on a large document set represents consumer preferences. Let k be the number of topics and z(x) ∈ R k be the topic distribution of news item x, with z c 1 (mobile) and z c 0 (desktop) as centroids in the topic space. The outcome is determined by the similarity between z(x i ) and z c t : y t i i = C(z(x i ) T z c 0 + t i • z(x i ) T z c 1 ) + ϵ, where C is a scaling factor and ϵ ∼ N(0, 1) is noise. In total, News comprises 5,000 news items, each with 3,477 word counts. For more details, please refer to (Johansson et al., 2016).

We define the post-treatment variables for each unit i as a time series {s t i ik } K k=1 of length K. We consider multidimensional s t i ik ∈ R m×1 as follows:

where ϵ u represents exogenous noise generated from the Laplace distribution with mean 0 and scale 1. C 2 is a scaling factor and A ∈ R m×m is the transfer matrix with eigenvalue less than 0.9. For IHDP, the outcome is defined as the last post-treatment variable. The outcome is defined as

jk . Table 1 presents the results for K = 60. We randomly split the samples into train/validation/test with a 60/20/20 ratio.

The synthetic dataset simulates a temporal causal system with interacting variables over time. The initial state variables x i0 ∈ R N are sampled from a standard normal distribution. The state evolution follows:

where a x ∈ R N×1 is a coefficient matrix. The binary treatment t ik is assigned through:

Baseline. We compare PIPCFR with several competitive baselines, as shown in Table1. (1) Tree Learners: causal forest (ATHEY et al., 2019), which builds upon traditional random forests to estimate ITE. (2) Metalearners such as XLearner (Künzel et al., 2019) that learn ITE directly after imputing pseudo-outcomes for missing treatments.

Baseline. We compare PIPCFR with several competitive baselines, as shown in Table1

Baseline. We compare PIPCFR with several competitive baselines, as shown in Table

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut