Clustering and Pruning in Causal Data Fusion

Clustering and Pruning in Causal Data Fusion
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Data fusion, the process of combining observational and experimental data, can enable the identification of causal effects that would otherwise remain non-identifiable. Although identification algorithms have been developed for specific scenarios, do-calculus remains the only general-purpose tool for causal data fusion, particularly when variables are present in some data sources but not others. However, approaches based on do-calculus may encounter computational challenges as the number of variables increases and the causal graph grows in complexity. Consequently, there exists a need to reduce the size of such models while preserving the essential features. For this purpose, we propose pruning (removing unnecessary variables) and clustering (combining variables) as preprocessing operations for causal data fusion. We generalize earlier results on a single data source and derive conditions for applying pruning and clustering in the case of multiple data sources. We give sufficient conditions for inferring the identifiability or non-identifiability of a causal effect in a larger graph based on a smaller graph and show how to obtain the corresponding identifying functional for identifiable causal effects. Examples from epidemiology and social science demonstrate the use of the results.


💡 Research Summary

The paper addresses the computational challenges inherent in causal data fusion, where observational and experimental datasets are combined to identify causal effects that are not identifiable from any single source. While do‑calculus provides a complete theoretical tool for causal identification, practical algorithms such as Do‑search become infeasible as the number of variables and the complexity of the underlying directed acyclic graph (DAG) increase. To mitigate this, the authors propose two preprocessing operations—pruning and clustering—that reduce the size of the causal graph without compromising identifiability.

Pruning removes vertices that are provably irrelevant for the target causal effect p(y | do(x)). Extending earlier results that were limited to a single observational source, the authors derive sufficient conditions for pruning in the multi‑source setting. A vertex can be pruned if it is not a descendant of the outcome, does not appear in any input distribution, or is connected to the rest of the graph by a single edge. Under these conditions, they prove an “identification invariance” property: the causal effect is identifiable from the original graph G and the set of input distributions I if and only if it is identifiable from the pruned graph G′ and the correspondingly reduced input set I′.

Clustering aggregates a set of vertices into a single cluster node T, thereby coarsening the graph. Prior clustering approaches risk losing identifiability because the coarse representation may hide crucial structural details. The authors introduce the notion of transit clusters and provide sufficient conditions under which clustering preserves identifiability. Specifically, all vertices in a cluster must share the same parent and child sets, the cluster must have only unidirectional connections to the rest of the graph, and every input distribution must involve the entire cluster (i.e., the cluster is observed or intervened on as a whole). When these conditions hold, the clustered graph G′ yields the same identifiability status as the original graph, and any identifying functional derived on G′ can be directly “lifted” to G.

The theoretical contributions are validated through two empirical studies. First, a simulation experiment generates random DAGs with 20–30 variables and applies Do‑search with and without the proposed preprocessing. Pruning alone reduces average runtime by roughly 45 %, while adding clustering cuts runtime by more than 70 %, demonstrating substantial computational savings. Second, real‑world case studies from epidemiology (salt‑adding behavior → blood pressure) and social science (educational intervention → income) illustrate how the methods handle partially overlapping data sources. In both cases, the causal effect remains identifiable after pruning and clustering, and the resulting identifying expressions are simpler and more interpretable.

Overall, the paper makes three key contributions: (1) it extends pruning and clustering theory to the general multi‑source causal fusion problem, providing clear sufficient conditions; (2) it formalizes identification invariance as an equivalence relation between original and transformed graphs, enabling seamless integration of preprocessing with existing identification algorithms; and (3) it empirically shows that such preprocessing dramatically improves algorithmic efficiency without sacrificing correctness. Limitations include the fact that the presented conditions are sufficient but not necessary, leaving open the challenge of characterizing broader classes of admissible transformations. Future work may explore automated detection of prune‑able or cluster‑able substructures and extensions to settings with partially observed or latent variables.


Comments & Academic Discussion

Loading comments...

Leave a Comment