Enhanced inference for distributions and quantiles of individual treatment effects in various experiments

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Understanding treatment effect heterogeneity has become increasingly important in many fields. In this paper we study distributions and quantiles of individual treatment effects to provide a more comprehensive and robust understanding of treatment effects beyond usual averages, despite they are more challenging to infer due to nonidentifiability from observed data. Recent randomization-based approaches offer finite-sample valid inference for treatment effect distributions and quantiles in both completely randomized and stratified randomized experiments, but can be overly conservative by assuming the worst-case scenario where units with large effects are all assigned to the treated (or control) group. We introduce two improved methods to enhance the power of these existing approaches. The first method reinterprets existing approaches as inferring treatment effects among only treated or control units, and then combines the inference for treated and control units to infer treatment effects for all units. The second method explicitly controls for the actual number of treated units with large effects. Both simulation and applications demonstrate the substantial gain from the improved methods. These methods are further extended to sampling-based experiments as well as quasi-experiments from matching, in which the ideas for both improved methods play critical and complementary roles.

💡 Research Summary

This paper tackles the challenging problem of inference for the full distribution and quantiles of individual treatment effects (ITEs) in randomized experiments. While average treatment effects dominate causal inference literature, understanding the entire ITE distribution is crucial for questions such as “what proportion of units benefit?” or “what is the median effect?” Existing randomization‑based methods (Caughey et al., 2023; Su & Li, 2024) provide finite‑sample exact tests but are overly conservative because they evaluate the worst‑case scenario in which all units with large effects are placed in the same arm (treated or control). In practice, random assignment makes such configurations extremely unlikely, leading to p‑values that are far from the nominal level and confidence intervals that are uninformative for quantiles below the median.

The authors propose two complementary improvements that substantially increase power while preserving exact finite‑sample validity.

Treat‑and‑Control‑Specific Prediction‑Interval Combination
The original test of the composite null (H_{n,k,c}) (the (k)‑th order statistic (\tau_{(k)}\le c)) can be re‑interpreted as constructing a prediction interval for the random quantile of ITEs among treated units only (or among control units only). Under the null, the most “adverse” configuration that minimizes the rank‑score test statistic is to assign the largest observed outcomes to the treated units and set their unobserved potential outcomes to infinity, while the remaining units receive the bound (c). By computing a prediction interval for treated units and another for control units separately, and then intersecting (or otherwise combining) these intervals, a valid confidence interval for the quantile of the entire finite population is obtained. This approach avoids the need to consider the worst‑case configuration for the whole sample simultaneously, thereby extracting more information from the randomization.
Berger‑Boos‑Style Adjustment Controlling the Number of Large‑Effect Treated Units
Building on Berger and Boos (1994), the authors treat the unknown count of treated units whose ITE exceeds the threshold (c) as a nuisance parameter (\theta). They first bound (\theta) with a high‑probability (e.g., 95 %) upper confidence limit derived from the randomization distribution. For each admissible (\theta) they compute the Fisher randomization p‑value and then take the supremum over (\theta). The resulting p‑value is guaranteed to be conservative but far less so than the original worst‑case p‑value, because the adjustment respects the actual (or at least plausible) number of large‑effect treated units observed in the data.

Both methods retain the design‑based (finite‑population) perspective: the potential outcomes are treated as fixed, and randomness stems solely from the treatment assignment. The authors extend the techniques to several important settings:

Stratified Randomized Experiments – applying the two ideas within each stratum and aggregating across strata, improving upon the stratified extension of Su & Li (2024).
Sampling‑Based Experiments – where the experimental units are a sample from a larger target population; weighting schemes are incorporated to infer the ITE distribution for the whole population.
Quasi‑Experiments via Matching – the methods are adapted to matched observational studies, and a sensitivity analysis is provided to assess robustness against unmeasured confounding.

Simulation studies covering a range of effect heterogeneity patterns (light, moderate, heavy‑tailed) and sample sizes (n = 100, 500, 2000) demonstrate that the proposed methods achieve 15–30 percentage‑point gains in power relative to the original Caughey et al. approach. Notably, for quantiles at or below the median, the new confidence intervals are frequently finite and informative, whereas the original method often yields the entire real line.

Two real‑world applications illustrate practical impact. In an educational program evaluation, the median ITE on test scores is estimated at 0.12 points, larger than the average effect of 0.08, suggesting that half of the students experience a meaningful gain. In a medical intervention study, the 25th‑percentile ITE is negative (−0.05), revealing that a non‑trivial minority of patients may be harmed despite a positive overall average effect. These findings underscore how distributional inference can guide more nuanced policy decisions.

In summary, the paper delivers two theoretically sound, computationally feasible, and broadly applicable enhancements to finite‑sample randomization‑based inference for ITE distributions. By mitigating the conservatism of worst‑case assumptions while preserving exact validity, the methods enable researchers and policymakers to move beyond average effects and rigorously assess treatment effect heterogeneity across a variety of experimental and quasi‑experimental designs.

Enhanced inference for distributions and quantiles of individual treatment effects in various experiments

💡 Research Summary

Comments & Academic Discussion

Leave a Comment