CausalCompass: Evaluating the Robustness of Time-Series Causal Discovery in Misspecified Scenarios
Causal discovery from time series is a fundamental task in machine learning. However, its widespread adoption is hindered by a reliance on untestable causal assumptions and by the lack of robustness-oriented evaluation in existing benchmarks. To address these challenges, we propose CausalCompass, a flexible and extensible benchmark suite designed to assess the robustness of time-series causal discovery (TSCD) methods under violations of modeling assumptions. To demonstrate the practical utility of CausalCompass, we conduct extensive benchmarking of representative TSCD algorithms across eight assumption-violation scenarios. Our experimental results indicate that no single method consistently attains optimal performance across all settings. Nevertheless, the methods exhibiting superior overall performance across diverse scenarios are almost invariably deep learning-based approaches. We further provide hyperparameter sensitivity analyses to deepen the understanding of these findings. We also find, somewhat surprisingly, that NTS-NOTEARS relies heavily on standardized preprocessing in practice, performing poorly in the vanilla setting but exhibiting strong performance after standardization. Finally, our work aims to provide a comprehensive and systematic evaluation of TSCD methods under assumption violations, thereby facilitating their broader adoption in real-world applications. The code and datasets are available at https://github.com/huiyang-yi/CausalCompass.
💡 Research Summary
The paper addresses a critical gap in the evaluation of time‑series causal discovery (TSCD) methods: existing benchmarks largely assume that all underlying causal assumptions hold, which is rarely the case in real‑world applications. To systematically assess robustness under assumption violations, the authors introduce CausalCompass, an extensible benchmark suite that generates synthetic datasets with controlled violations of eight common modeling assumptions.
Two “vanilla” generative models serve as the baseline: a linear vector autoregressive (VAR) model and a nonlinear Lorenz‑96 system. Each baseline can be transformed by eight independent perturbations: (1) measurement error (additive Gaussian noise with varying variance), (2) non‑stationarity (time‑varying noise scale sampled from a Gaussian process), (3) latent confounders (randomly inserted latent variables with configurable probability), (4) Z‑score standardization across the temporal dimension, (5) min‑max normalization, (6) trend and seasonality components, (7) mixed‑distribution data, and (8) missing values. This design enables the creation of datasets that simultaneously breach multiple assumptions, mimicking the complexity of real data.
The benchmark evaluates eleven representative TSCD algorithms spanning six methodological families: constraint‑based (PCMCI), noise‑based (VAR‑LiNGAM), score‑based (DYNOTEARS, NTS‑NOTEARS), topology‑based (TSCI), Granger‑causality‑based (LGC), and deep‑learning‑based (cMLP, cLSTM, CU‑TS, CU‑TS+). Table 1 in the paper summarizes which causal graph type (summary vs. window) and which assumptions each algorithm explicitly supports.
Experimental protocol includes exhaustive hyper‑parameter sweeps for each method (e.g., conditioning depth for PCMCI, regularization λ for NOTEARS variants, hidden‑layer size and learning rate for cMLP/cLSTM) and a sensitivity analysis of these hyper‑parameters. Performance is measured using structural Hamming distance, SHD, AUROC, and AUPRC on both summary and window graphs. In total, more than 110 k runs are performed across linear and nonlinear baselines and all eight violation scenarios.
Key findings: (1) No single algorithm dominates across all eight violation settings, confirming that many TSCD methods are fragile when assumptions are broken. (2) Deep‑learning approaches (cMLP, cLSTM) achieve the highest average performance, especially under nonlinear dynamics, non‑stationarity, measurement error, and mixed‑data conditions, indicating superior expressive power and robustness. (3) NTS‑NOTEARS exhibits a dramatic performance drop on raw (“vanilla”) data but recovers strong results after Z‑score standardization, highlighting its reliance on variance‑sortability cues. (4) Classical methods such as PCMCI and DYNOTEARS perform adequately when assumptions hold (linear, stationary) but degrade sharply under non‑linear or non‑stationary violations. (5) Hyper‑parameter sensitivity analysis shows deep‑learning models are relatively stable across a range of learning rates and batch sizes, whereas NOTEARS‑type methods are highly sensitive to the regularization parameter λ and to data scaling.
The authors release CausalCompass as open‑source code and datasets, enabling the community to benchmark new TSCD algorithms under realistic, assumption‑violating conditions. They argue that robustness evaluation should become a standard component of TSCD research, and that preprocessing choices (standardization, normalization) must be reported explicitly because they can substantially affect outcomes. Future extensions could incorporate more sophisticated latent‑confounder structures, structural non‑stationarity (time‑varying causal graphs), and online/streaming causal inference scenarios. Overall, the work provides a much‑needed systematic framework for assessing the practical reliability of TSCD methods, paving the way for broader adoption of causal discovery in real‑world time‑series analysis.
Comments & Academic Discussion
Loading comments...
Leave a Comment