Group lasso based selection for high-dimensional mediation analysis

Group lasso based selection for high-dimensional mediation analysis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Mediation analysis aims to identify and estimate the effect of an exposure on an outcome that is mediated through one or more intermediate variables. In the presence of multiple intermediate variables, two pertinent methodological questions arise: estimating mediated effects when mediators are correlated, and performing high-dimensional mediation analyses when the number of mediators exceeds the sample size. This paper presents a two-step procedure for high-dimensional mediation analyses. The first step selects a reduced number of candidate mediators using an ad-hoc lasso penalty. The second step applies a procedure we previously developed to estimate the mediated effects, accounting for the correlation structure among the retained candidate mediators. We compare the performance of the proposed two-step procedure with state-of-the-art methods using simulated data. Additionally, we demonstrate its practical application by estimating the causal role of DNA methylation in the pathway between smoking and rheumatoid arthritis using real data.


💡 Research Summary

This paper tackles the challenging problem of mediation analysis when the number of potential mediators far exceeds the sample size, a situation common in modern genomics and epigenomics studies. The authors propose a two‑step procedure, named MAHI (Mediation Analysis with High‑dimensional data), that integrates a group‑lasso based variable‑selection stage with a low‑dimensional quasi‑Bayesian inference stage.

Step 1 – Group‑lasso selection with stability screening
The authors formulate a joint loss function that combines the prediction errors of the mediator models (either Gaussian or logistic) and the outcome model (also Gaussian or logistic). A weight (w_Y>0) scales the contribution of the outcome loss, allowing the procedure to accommodate mediators whose exposure‑mediator ((\alpha_{1k})) and mediator‑outcome ((\beta_k)) coefficients lie on different scales. The key penalty is a group‑lasso term (\lambda\sum_{k}\sqrt{\alpha_{1k}^2+\beta_k^2}), which forces the pair ((\alpha_{1k},\beta_k)) to be set to zero together. Consequently, a mediator is retained only if it has the potential to contribute to a non‑zero indirect effect.

Optimization is carried out with a proximal‑gradient algorithm. Rather than fine‑tuning (\lambda) to achieve a specific sparsity level, the authors choose a relatively small (\lambda) so that as many true mediators as possible survive, while still guaranteeing that the final number of selected variables (K_{\max}) is below the sample size. To mitigate the well‑known instability of lasso solutions, the authors embed a stability‑selection scheme: they draw (N_{\text{boot}}) bootstrap samples, run the group‑lasso on each, and record how often each mediator is selected. Mediators are then ranked by selection frequency, and the top (K_{\max}) are kept for the second stage. The grid search over (w_Y) further ensures that mediators with a wide range of (\alpha) and (\beta) magnitudes are captured.

Step 2 – Low‑dimensional inference of direct and indirect effects
With a reduced set of candidate mediators (now fewer than (n)), the authors apply a previously developed low‑dimensional multiple‑mediator algorithm (Zhang et al., 2020). This method fits parametric outcome and mediator models, draws (J) (typically 1000) samples of the estimated coefficient vector from its asymptotic multivariate normal distribution, and for each draw simulates counterfactual mediator values and outcomes. The average difference in potential outcomes when the mediator is set to its “treated” versus “control” level yields an estimate of the average indirect effect (\delta_k). Empirical distributions of (\delta_k) provide point estimates, p‑values, and confidence intervals. Multiple‑testing correction (e.g., FDR) is applied across the retained mediators, and those whose intervals exclude zero are declared significant.

Simulation study
The authors conduct two extensive simulation experiments (100 replicates each) with (n=100) and (K=500). The first scenario assumes independent mediators; the second introduces realistic correlation structures among mediators. Both continuous and binary outcomes are examined. Competing methods include HDMAX2, HDMAX2‑FDR, HIMA, and a Bayesian sparse linear mixed model. Performance metrics are true‑positive rate (TPR), false‑positive rate (FPR), and mean‑squared error (MSE) of indirect‑effect estimates. MAHI consistently achieves the highest TPR (≈0.85–0.92) while keeping FPR below 0.05, and reduces MSE by roughly 30 % relative to the best alternatives, especially when mediators are correlated.

Real‑data application
The method is applied to a French cohort investigating the causal pathway from smoking (exposure) to rheumatoid arthritis (outcome) mediated by DNA methylation at CpG sites. After a preliminary filter, 5,000 CpGs are fed into MAHI. Step 1 selects 150 candidates; Step 2 identifies 12 CpGs with statistically significant indirect effects after FDR correction. Four of these CpGs map to genes previously implicated in inflammation, while the remaining eight suggest novel epigenetic mechanisms linking smoking to autoimmune disease. The estimated indirect effects are modest (standardized 0.02–0.05) but biologically plausible.

Discussion and limitations
MAHI’s main strengths are: (1) simultaneous consideration of exposure‑mediator and mediator‑outcome coefficients via a group‑lasso penalty, (2) robustness to sampling variability through stability selection, and (3) accurate inference of indirect effects using a quasi‑Bayesian simulation framework. The approach works well for both continuous and binary outcomes and accommodates correlated mediators. Limitations include the need for user‑specified grids for (w_Y) and (\lambda), potential saturation of (K_{\max}) in ultra‑high‑dimensional settings (requiring an additional pre‑screening step), and the current focus on linear (or logistic) relationships. Extensions to nonlinear mediators, multiple exposures, longitudinal data, or time‑varying mediation remain open research avenues.

Conclusion
The proposed MAHI framework provides a practical, statistically rigorous solution for high‑dimensional mediation analysis. By coupling group‑lasso variable selection with a robust low‑dimensional inference algorithm, it improves both the identification of true mediators and the precision of indirect‑effect estimates. Simulation results and an epigenetic case study demonstrate its superiority over existing methods, suggesting broad applicability in genomics, epigenomics, and other fields where high‑dimensional causal pathways are of interest.


Comments & Academic Discussion

Loading comments...

Leave a Comment