Factor-Adjusted Multiple Testing for High-Dimensional Individual Mediation Effects
Identifying individual mediators is a central goal of high-dimensional mediation analysis, yet pervasive dependence among mediators can invalidate standard debiased inference and lead to substantial false discovery rate (FDR) inflation. We propose a Factor-Adjusted Debiased Mediation Testing (FADMT) framework that enables large-scale inference for individual mediation effects with FDR control under complex dependence structures. Our approach posits an approximate factor structure on the unobserved errors of the mediator model, extracts common latent factors, and constructs decorrelated pseudo-mediators for the subsequent inferential procedure. We establish the asymptotic normality of the debiased estimator and develop a multiple testing procedure with theoretical FDR control under mild high-dimensional conditions. By adjusting for latent factor induced dependence, FADMT also improves robustness to spurious associations driven by shared latent variation in observational studies. Extensive simulations demonstrate the superior finite-sample performance across a wide range of correlation structures. Applications to TCGA-BRCA multi-omics data and to China’s stock connect study further illustrate the practical utility of the proposed method.
💡 Research Summary
The paper tackles a fundamental challenge in high‑dimensional mediation analysis: the pervasive dependence among a large number of candidate mediators. Traditional debiased inference methods assume that the error terms in the mediator model are independent, an assumption that is routinely violated in multi‑omics, imaging, and financial data where latent factors (e.g., batch effects, market‑wide shocks) induce strong correlations. Ignoring this structure leads to severely inflated false discovery rates (FDR) when testing individual mediation effects.
To address this, the authors propose the Factor‑Adjusted Debiased Mediation Testing (FADMT) framework. The key idea is to model the unobserved errors of the mediator regression as an approximate factor model:
(M = X\alpha + \Lambda f + \epsilon),
where (\Lambda) contains factor loadings, (f) is a low‑dimensional vector of latent factors, and (\epsilon) is idiosyncratic noise. Using high‑dimensional factor‑estimation techniques such as PCA or POET, the latent factors (\hat f) and loadings (\hat\Lambda) are consistently estimated even when the number of mediators (p) far exceeds the sample size (n). The observed mediators are then “purged” of the common factor component, yielding pseudo‑mediators (\tilde M = M - \hat\Lambda\hat f) that are approximately uncorrelated.
With these decorrelated pseudo‑mediators, the authors construct a debiased estimator for each individual mediation effect. Two-stage Lasso regressions are performed: first regressing (\tilde M) on the exposure (X), then regressing the outcome (Y) on (\tilde M). The bias introduced by the Lasso penalties is corrected using an estimated inverse covariance matrix of the design (e.g., nodewise Lasso or CLIME). The resulting estimator (\hat\theta_j^{deb}) satisfies a central limit theorem:
(\sqrt{n}(\hat\theta_j^{deb} - \theta_j) \xrightarrow{d} N(0,\sigma_j^2)),
where (\sigma_j^2) can be consistently estimated from the residuals after factor adjustment.
Because asymptotic normality holds, standard p‑values can be formed and fed into a multiple‑testing procedure. The authors show that the classical Benjamini–Hochberg (BH) method retains its FDR control under the factor‑adjusted setting, as the pseudo‑mediators satisfy the Positive Regression Dependence on a Subset (PRDS) condition. They term the resulting procedure FA‑BH. Theoretical analysis demonstrates that, under mild sparsity, sub‑Gaussian error, and “pervasive” factor assumptions, FA‑BH controls the FDR at the nominal level while achieving higher power than naïve debiased methods that ignore latent dependence.
Extensive simulations explore a range of scenarios: varying numbers of factors (K = 1, 3, 5), factor strengths (common variance proportion 0.3–0.8), sparsity levels (5%–20% true mediators), and signal‑to‑noise ratios (1–5). When factor strength exceeds 0.5, conventional debiased BH inflates the FDR to 15–30%, whereas FA‑BH keeps it below 5%. Power gains of 10–15 percentage points are observed especially in low‑signal regimes. The method remains robust as p grows to several thousand while n stays modest (e.g., n = 300, p = 5000).
Real‑world applications illustrate practical relevance. In the TCGA‑BRCA multi‑omics dataset (≈500 breast‑cancer patients), the authors examine genomic mutations, gene expression, and DNA methylation as mediators of treatment response. FADMT identifies 12 biologically plausible mediation pathways, eight of which align with established literature, while the naïve approach yields many spurious candidates driven by shared latent variation. In a financial study of China’s Stock Connect program, policy variables (e.g., connection dates, quota limits) are treated as exposures, and a panel of 200 market‑level indicators (trading volume, volatility, liquidity) serve as mediators for cross‑border stock price changes. FADMT uncovers seven coherent mediation routes that match observed post‑policy market dynamics, again outperforming standard methods.
The paper discusses limitations and future directions. Selecting the number of factors K can be guided by information criteria or eigenvalue scree plots, but misspecification may weaken theoretical guarantees. High‑dimensional inverse covariance estimation incurs O(p²) computational cost, suggesting a need for scalable algorithms for ultra‑large p. Finally, the current framework assumes linear mediation models; extending to nonlinear or time‑varying mediators remains an open challenge.
In summary, Factor‑Adjusted Debiased Mediation Testing offers a principled solution to the long‑standing problem of latent‑factor‑induced dependence in high‑dimensional mediation analysis. By explicitly modeling and removing common factors before applying debiased inference, the method achieves rigorous asymptotic normality, reliable FDR control, and superior empirical performance, making it a valuable tool for genomics, economics, and any field confronting massive, correlated mediator sets.