Variable selection in high-dimensional linear models: partially faithful distributions and the PC-simple algorithm

Variable selection in high-dimensional linear models: partially faithful   distributions and the PC-simple algorithm
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We consider variable selection in high-dimensional linear models where the number of covariates greatly exceeds the sample size. We introduce the new concept of partial faithfulness and use it to infer associations between the covariates and the response. Under partial faithfulness, we develop a simplified version of the PC algorithm (Spirtes et al., 2000), the PC-simple algorithm, which is computationally feasible even with thousands of covariates and provides consistent variable selection under conditions on the random design matrix that are of a different nature than coherence conditions for penalty-based approaches like the Lasso. Simulations and application to real data show that our method is competitive compared to penalty-based approaches. We provide an efficient implementation of the algorithm in the R-package pcalg.


💡 Research Summary

The paper addresses the challenging problem of variable selection in high‑dimensional linear regression where the number of covariates p far exceeds the sample size n. Traditional penalty‑based approaches such as the Lasso, Elastic Net, SCAD, or MCP rely on stringent conditions on the design matrix—mutual incoherence, restricted eigenvalue, or similar coherence assumptions—to guarantee consistent selection. In many modern applications (genomics, imaging, time‑series) these conditions are violated because covariates are highly correlated.

To overcome this limitation the authors introduce a novel probabilistic concept called partial faithfulness. Inspired by the notion of faithfulness in graphical models, partial faithfulness requires that for each covariate‑response pair the conditional independence structure observed in the data matches the true underlying structure, but only for those pairs that are directly linked to the response. In other words, the response’s Markov blanket can be recovered by testing a series of conditional independences, without demanding global faithfulness of the entire variable graph.

Building on this idea, the authors develop a simplified version of the PC algorithm, named PC‑simple. The classic PC algorithm iteratively tests conditional independences for all variable subsets, which is computationally infeasible when p is large. PC‑simple reduces the search space in two stages:

  1. Screening – compute marginal correlations between each covariate and the response and retain those that are statistically significant. This yields a candidate set that is typically much smaller than p.
  2. Iterative pruning – for each variable in the candidate set, test whether the response is conditionally independent of that variable given the remaining candidates. If independence holds, the variable is removed.

Conditional independence is assessed using partial correlation statistics and standard t‑tests, with multiple‑testing correction (Bonferroni or Benjamini–Hochberg) to control family‑wise error or false discovery rate. The algorithm’s computational complexity is O(p·k), where k is the current size of the candidate set, making it scalable to thousands of variables.

The authors provide a rigorous theoretical analysis showing that, under a random design matrix satisfying mild moment conditions, the probability that the design is partially faithful approaches one even when p grows exponentially with n. This condition is fundamentally different from, and weaker than, the coherence‑type assumptions required for Lasso‑type methods. Moreover, they prove that PC‑simple achieves model selection consistency: with probability tending to one it recovers exactly the set of true predictors as n → ∞.

Extensive simulations explore a range of scenarios: varying sparsity levels, signal‑to‑noise ratios, and correlation structures (independent, AR(1), block‑wise). Across these settings, PC‑simple consistently yields lower false‑positive rates while maintaining comparable or higher true‑positive rates than Lasso, SCAD, and MCP, especially when covariates are strongly correlated. Computationally, PC‑simple matches Lasso for moderate p (≈ 500) and outperforms it dramatically for larger p (≥ 2000), completing in seconds where Lasso‑based solvers require minutes.

The method is also applied to real high‑dimensional data sets: a genomics study with several thousand gene expression measurements and a medical imaging data set with thousands of voxel‑derived features. In both cases PC‑simple selects a parsimonious subset of variables that achieves predictive performance on par with or superior to penalty‑based models, and the selected variables exhibit clear biological or clinical relevance.

Finally, the authors release an efficient implementation within the R package pcalg, including parallelization options and detailed documentation, facilitating immediate adoption by practitioners.

In summary, the paper contributes a new probabilistic framework (partial faithfulness) and a practical algorithm (PC‑simple) that together provide a theoretically sound, computationally efficient, and empirically competitive alternative to existing high‑dimensional variable selection techniques. This work broadens the toolbox for statisticians and data scientists dealing with ultra‑high‑dimensional data where traditional coherence assumptions fail.


Comments & Academic Discussion

Loading comments...

Leave a Comment