Distinguishing cause from effect using observational data: methods and benchmarks

Distinguishing cause from effect using observational data: methods and   benchmarks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The discovery of causal relationships from purely observational data is a fundamental problem in science. The most elementary form of such a causal discovery problem is to decide whether X causes Y or, alternatively, Y causes X, given joint observations of two variables X, Y. An example is to decide whether altitude causes temperature, or vice versa, given only joint measurements of both variables. Even under the simplifying assumptions of no confounding, no feedback loops, and no selection bias, such bivariate causal discovery problems are challenging. Nevertheless, several approaches for addressing those problems have been proposed in recent years. We review two families of such methods: Additive Noise Methods (ANM) and Information Geometric Causal Inference (IGCI). We present the benchmark CauseEffectPairs that consists of data for 100 different cause-effect pairs selected from 37 datasets from various domains (e.g., meteorology, biology, medicine, engineering, economy, etc.) and motivate our decisions regarding the “ground truth” causal directions of all pairs. We evaluate the performance of several bivariate causal discovery methods on these real-world benchmark data and in addition on artificially simulated data. Our empirical results on real-world data indicate that certain methods are indeed able to distinguish cause from effect using only purely observational data, although more benchmark data would be needed to obtain statistically significant conclusions. One of the best performing methods overall is the additive-noise method originally proposed by Hoyer et al. (2009), which obtains an accuracy of 63+-10 % and an AUC of 0.74+-0.05 on the real-world benchmark. As the main theoretical contribution of this work we prove the consistency of that method.


💡 Research Summary

The paper tackles one of the most fundamental problems in causal discovery: determining the direction of causality between two observed variables X and Y using only observational data. Under the simplifying assumptions of no hidden confounders, no selection bias, and no feedback loops, the authors focus on the bivariate case where a joint distribution P_{X,Y} is given and the goal is to infer whether X → Y or Y → X.

Two major families of methods are reviewed in depth. The first is the Additive Noise Model (ANM) originally proposed by Hoyer et al. (2009). ANM assumes that the effect can be expressed as Y = f(X) + N, where f is a (potentially nonlinear) deterministic function and the noise term N is statistically independent of the cause X. In practice, one fits a regression model to estimate f, computes residuals, and then tests for independence between the residuals and the putative cause. If independence holds in one direction but not the other, the independent direction is taken as the causal direction. The paper discusses several concrete implementations of ANM, including Gaussian‑process regression, kernel ridge regression, and polynomial regression for estimating f, and a variety of independence tests such as HSIC and distance‑based statistics. Importantly, the authors provide a rigorous consistency proof for the original ANM algorithm, showing that as the sample size tends to infinity the method converges to the correct causal direction under the model assumptions.

The second family is Information‑Geometric Causal Inference (IGCI), which is designed for the (near) deterministic case. IGCI exploits the postulated independence between the distribution of the cause and the slope of the causal function. By comparing the log‑density of the cause with the derivative of the function that maps cause to effect, IGCI computes a score; a positive score indicates X → Y, a negative score indicates the opposite. The authors present several variants of the IGCI score, including normalized differences and bidirectional averaging, and discuss how to estimate the required quantities from finite samples.

To evaluate these methods, the authors introduce the CauseEffectPairs benchmark. This dataset comprises 100 real‑world cause‑effect pairs drawn from 37 distinct domains (meteorology, biology, medicine, engineering, economics, etc.). For each pair, the authors assign a ground‑truth causal direction based on domain expertise and prior experimental evidence, carefully selecting pairs that are believed to be free of confounding and selection bias. In addition to the real data, synthetic datasets with known ground truth (linear and nonlinear relationships, varying noise levels) are generated to test robustness.

Experimental results show that ANM‑based approaches achieve the highest overall performance on the real‑world benchmark, with an accuracy of 63 % ± 10 % and an area‑under‑the‑ROC curve (AUC) of 0.74 ± 0.05. Among the ANM variants, Gaussian‑process regression combined with the HSIC independence test yields the most stable results. IGCI performs competitively when the data are almost noise‑free but degrades substantially in the presence of moderate to high noise, leading to lower overall accuracy on the benchmark. The authors also report extensive ablation studies, demonstrating that the choice of regression model and independence test can affect performance by several percentage points.

The paper’s contributions are fourfold: (1) a comprehensive review of ANM and IGCI, including implementation details; (2) the release of the CauseEffectPairs benchmark as a public resource for future method comparison; (3) an extensive empirical evaluation on both synthetic and real data, highlighting strengths and weaknesses of each method; and (4) a theoretical consistency proof for the original ANM algorithm, strengthening its methodological foundation.

Limitations are acknowledged. The benchmark, while diverse, contains only 100 pairs, which limits statistical power for definitive conclusions. Moreover, the study assumes away confounding, selection bias, and feedback, conditions that are rarely satisfied in complex real‑world systems. The authors suggest future work on multivariate extensions, methods that can handle latent confounders, and the incorporation of cyclic causal structures. They also call for larger, more varied benchmark collections to enable robust statistical validation of causal discovery algorithms.


Comments & Academic Discussion

Loading comments...

Leave a Comment