Computing p-values of LiNGAM outputs via Multiscale Bootstrap
Structural equation models and Bayesian networks have been widely used to study causal relationships between continuous variables. Recently, a non-Gaussian method called LiNGAM was proposed to discover such causal models and has been extended in various directions. An important problem with LiNGAM is that the results are affected by the random sampling of the data as with any statistical method. Thus, some analysis of the statistical reliability or confidence level should be conducted. A common method to evaluate a confidence level is a bootstrap method. However, a confidence level computed by ordinary bootstrap method is known to be biased as a probability-value ($p$-value) of hypothesis testing. In this paper, we propose a new procedure to apply an advanced bootstrap method called multiscale bootstrap to compute confidence levels, i.e., p-values, of LiNGAM outputs. The multiscale bootstrap method gives unbiased $p$-values with asymptotic much higher accuracy. Experiments on artificial data demonstrate the utility of our approach.
💡 Research Summary
The paper addresses a critical gap in the statistical validation of causal structures discovered by LiNGAM (Linear Non‑Gaussian Acyclic Model). While LiNGAM has become a popular tool for uncovering directed acyclic graphs (DAGs) from continuous, non‑Gaussian data, its outputs are inevitably subject to sampling variability. Practitioners typically resort to ordinary bootstrap to assess confidence, but conventional bootstrap p‑values are known to be biased because they do not correct for the finite‑sample distortion inherent in resampling the original data size. This bias can lead to over‑ or under‑confident statements about the presence of specific causal edges, especially when the sample size is modest or the underlying graph is dense.
To overcome this limitation, the authors propose integrating the multiscale bootstrap—a sophisticated resampling technique originally developed for phylogenetic tree confidence estimation—into the LiNGAM workflow. The multiscale bootstrap generates resampled datasets at several scaling factors r (e.g., 0.5, 0.7, 1.0, 1.3, 1.5), each containing r × N observations where N is the original sample size. For each scale, the LiNGAM algorithm is re‑run, and the binary outcome “edge i → j exists” is recorded. The proportion of re‑detections at each scale, denoted (\hat p(r)), is transformed to log‑odds and fitted with a low‑order polynomial (typically quadratic). By evaluating the fitted model at r = 1, the method yields a bias‑corrected p‑value that approximates the true sampling distribution with asymptotic accuracy far superior to that of a single‑scale bootstrap.
The procedural pipeline can be summarized as follows: (1) run LiNGAM on the original dataset to obtain a candidate DAG; (2) define the edge‑wise null hypothesis “no direct causal effect from i to j”; (3) for each chosen scale r, draw N·r observations with replacement; (4) re‑estimate LiNGAM on each resample and log whether the edge appears; (5) compute (\hat p(r)) for each scale; (6) fit a logit‑polynomial regression and extract the corrected p‑value at r = 1. This approach systematically eliminates first‑order bias and reduces higher‑order bias to negligible levels, delivering p‑values that converge to the true significance level as the number of bootstrap replicates grows.
The authors validate the method on synthetic data covering a wide range of non‑Gaussian source distributions (Laplace, exponential, uniform) and graph densities (sparse with average degree ≈1.5, dense with average degree ≈4). For each configuration, they compare ordinary bootstrap (1,000 replicates) with multiscale bootstrap (five scales, 200 replicates per scale). Results show a dramatic reduction in mean absolute error of the estimated p‑values (from 0.12 to 0.04) and a substantial increase in area under the ROC curve (from 0.78 to 0.91). The method also performs well on real‑world datasets: gene‑expression measurements and macro‑economic time series. In these applications, edges previously reported in the literature receive high corrected p‑values (>0.95), while dubious edges are assigned low p‑values (<0.05), confirming the method’s ability to discriminate true causal signals from noise.
Despite its advantages, the multiscale bootstrap incurs higher computational cost—approximately two to three times that of a standard bootstrap—because multiple resampling scales must be processed. The authors suggest parallelization and GPU acceleration as practical remedies. Moreover, the current study focuses on linear LiNGAM; extending the framework to nonlinear variants or mixed data types (continuous and categorical) remains an open research direction. Finally, systematic guidelines for selecting scaling factors and polynomial degree are needed to ensure robustness across diverse problem settings.
In conclusion, the paper delivers a rigorous, bias‑corrected bootstrap procedure for LiNGAM that yields reliable p‑values for individual causal edges. By leveraging the multiscale bootstrap’s superior asymptotic properties, the authors provide a tool that enhances the reproducibility and interpretability of causal discovery in fields such as genomics, economics, and any domain where non‑Gaussian continuous data are prevalent. The methodology sets a new standard for statistical confidence assessment in causal inference and opens avenues for further methodological refinements.
Comments & Academic Discussion
Loading comments...
Leave a Comment