Stable Graphical Model Estimation with Random Forests for Discrete, Continuous, and Mixed Variables
A conditional independence graph is a concise representation of pairwise conditional independence among many variables. Graphical Random Forests (GRaFo) are a novel method for estimating pairwise conditional independence relationships among mixed-type, i.e. continuous and discrete, variables. The number of edges is a tuning parameter in any graphical model estimator and there is no obvious number that constitutes a good choice. Stability Selection helps choosing this parameter with respect to a bound on the expected number of false positives (error control). The performance of GRaFo is evaluated and compared with various other methods for p = 50, 100, and 200 possibly mixed-type variables while sample size is n = 100 (n = 500 for maximum likelihood). Furthermore, GRaFo is applied to data from the Swiss Health Survey in order to evaluate how well it can reproduce the interconnection of functional health components, personal, and environmental factors, as hypothesized by the World Health Organization’s International Classification of Functioning, Disability and Health (ICF). Finally, GRaFo is used to identify risk factors which may be associated with adverse neurodevelopment of children who suffer from trisomy 21 and experienced open-heart surgery. GRaFo performs well with mixed data and thanks to Stability Selection it provides an error control mechanism for false positive selection.
💡 Research Summary
The paper introduces Graphical Random Forests (GRaFo), a novel approach for estimating conditional independence graphs (CIGs) in high‑dimensional settings where variables may be continuous, discrete, or a mixture of both. Traditional graphical model estimators, such as LASSO‑based neighborhood selection, are primarily designed for Gaussian or purely discrete data and lack a natural way to handle mixed‑type variables. GRaFo fills this gap by leveraging the variable‑importance measures from Random Forests (RF) and Conditional Forests (cRF) to perform a series of nonlinear regressions, one for each variable against all others. Because importance scores for continuous and categorical responses are not directly comparable, the authors compute a local ranking for each regression and assign each undirected edge the worse (higher) rank of the two directed regressions that involve the edge. This conservative ranking scheme yields a list of candidate edges.
A key contribution is the integration of Stability Selection (Meinshausen & Bühlmann, 2010) to control the expected number of false positive edges (E
Comments & Academic Discussion
Loading comments...
Leave a Comment