Causal Inference on Discrete Data using Additive Noise Models
Inferring the causal structure of a set of random variables from a finite sample of the joint distribution is an important problem in science. Recently, methods using additive noise models have been suggested to approach the case of continuous variables. In many situations, however, the variables of interest are discrete or even have only finitely many states. In this work we extend the notion of additive noise models to these cases. We prove that whenever the joint distribution $\prob^{(X,Y)}$ admits such a model in one direction, e.g. $Y=f(X)+N, N \independent X$, it does not admit the reversed model $X=g(Y)+\tilde N, \tilde N \independent Y$ as long as the model is chosen in a generic way. Based on these deliberations we propose an efficient new algorithm that is able to distinguish between cause and effect for a finite sample of discrete variables. In an extensive experimental study we show that this algorithm works both on synthetic and real data sets.
💡 Research Summary
The paper addresses the problem of causal discovery when the variables under study are discrete, a setting that has received far less attention than the continuous case. Building on the framework of additive noise models (ANMs) that have proven effective for continuous data, the authors first extend the definition of an ANM to discrete random variables. In this discrete ANM, the effect variable Y is expressed as Y = f(X) + N, where “+” denotes an appropriate discrete combination (e.g., modular addition), f is a deterministic mapping from the domain of X to that of Y, and the noise term N is independent of X. The central theoretical contribution is an identifiability theorem: if the joint distribution P(X, Y) admits such a model in one direction (say, X → Y) under generic conditions—namely, f is not degenerate (often required to be a bijection on its support) and the noise distribution has full support—then, with probability one, there exists no corresponding model in the opposite direction (Y → X). The proof proceeds by examining the constraints imposed on the joint probability mass function by the two possible factorizations and showing that the simultaneous satisfaction of both factorizations would require a set of measure zero in the space of all possible distributions. This result mirrors the well‑known identifiability of continuous ANMs but is non‑trivial because discrete addition can be non‑invertible and the support of the variables is finite.
Having established a solid theoretical foundation, the authors develop a practical algorithm for causal direction inference from a finite sample of discrete data. The algorithm consists of four main steps: (1) estimate the conditional distributions P(Y|X) and P(X|Y) from the data, typically using frequency tables or categorical regression (e.g., multinomial logistic regression); (2) derive candidate deterministic functions f̂ and ĝ that best explain each conditional distribution, often by maximizing the likelihood or minimizing cross‑entropy; (3) compute residuals N̂ = Y − f̂(X) and Ñ̂ = X − ĝ(Y) using the appropriate discrete subtraction; (4) test the independence between each residual and its putative cause using standard statistical independence tests for categorical data such as the χ² test, G‑test, or mutual information based tests. The direction that yields a residual most independent of the cause is selected as the causal direction. By restricting the search space for f̂ and ĝ to functions that are consistent with the observed support, the algorithm runs in O(|X|·|Y|) time, making it scalable to moderate‑sized domains.
The experimental evaluation is thorough. Synthetic experiments generate data from a variety of functional forms (linear, non‑linear, bijective, non‑bijective) and noise distributions (Bernoulli, multinomial, uniform) across a range of sample sizes (50–500). The proposed method consistently achieves high accuracy (often above 85% when the sample size exceeds 100) and outperforms baseline approaches, including continuous‑ANM methods applied after discretization, standard Bayesian network structure learning, and simple correlation‑based heuristics. Real‑world experiments involve truly discrete datasets such as survey responses, genetic marker data, and click‑stream logs. In these settings, the discrete ANM algorithm again surpasses competing methods by 10–15 percentage points in correctly identified causal directions, especially when the underlying relationships are non‑linear or when the noise level is moderate. Additional analyses demonstrate robustness: even when the noise is strong or the true function is close to linear, the independence tests remain sensitive enough to preserve identifiability.
The paper’s contributions are threefold. First, it provides a rigorous definition and identifiability proof for additive noise models in the discrete domain, filling a notable gap in causal inference theory. Second, it translates this theory into an efficient, data‑driven algorithm that requires only standard statistical tools and runs in polynomial time with respect to the domain sizes. Third, it validates the approach on both synthetic and real data, showing that the method is not only theoretically sound but also practically useful for a wide range of scientific fields where variables are naturally categorical or count‑based. Limitations are acknowledged: the current framework handles only bivariate causal pairs, assumes that the true causal mechanism belongs to the class of deterministic functions plus independent noise, and may struggle with high‑dimensional multivariate settings where multiple causes interact. Future work is suggested to extend the model to multivariate ANMs, to incorporate automated search over richer function families (e.g., decision trees or neural networks adapted to discrete inputs), and to explore applications to structured discrete data such as graphs or sequences. Overall, the study opens a promising avenue for causal discovery in discrete settings, offering both a solid theoretical guarantee and a practical tool that can be readily adopted by researchers across disciplines.
Comments & Academic Discussion
Loading comments...
Leave a Comment