High-dimensional Statistical Inference and Variable Selection Using Sufficient Dimension Association
Simultaneous variable selection and statistical inference is challenging in high-dimensional data analysis. Most existing post-selection inference methods require explicitly specified regression models, which are often linear, as well as sparsity in the regression model. The performance of such procedures can be poor under either misspecified nonlinear models or a violation of the sparsity assumption. In this paper, we propose a sufficient dimension association (SDA) technique that measures the association between each predictor and the response variable conditioning on other predictors in the high-dimensional setting. Our proposed SDA method requires neither a specific form of regression model nor sparsity in the regression. Alternatively, our method assumes normalized or Gaussian-distributed predictors with a Markov blanket property. We propose an estimator for the SDA and prove asymptotic properties for the estimator. We construct three types of test statistics for the SDA and propose a multiple testing procedure to control the false discovery rate. Extensive simulation studies have been conducted to show the validity and superiority of our SDA method. Gene expression data from the Alzheimer Disease Neuroimaging Initiative are used to demonstrate a real application.
💡 Research Summary
This paper introduces a novel framework called Sufficient Dimension Association (SDA) for simultaneous variable selection and statistical inference in high‑dimensional settings. Traditional post‑selection inference methods typically rely on explicitly specified regression models—most often linear—and assume sparsity of the regression coefficients. These assumptions break down when the underlying relationship is nonlinear or when sparsity does not hold, which is common in modern biomedical data such as genomics or neuroimaging. Moreover, existing sufficient dimension reduction (SDR) techniques, such as sliced inverse regression (SIR), require accurate estimation of the central subspace, a task that becomes unstable and computationally intensive as the number of predictors grows.
SDA circumvents these limitations by focusing on the conditional dependence structure among the predictors themselves rather than on a prespecified outcome model. The authors assume that the predictor vector (X = (X_1,\dots,X_p)^\top) follows a multivariate normal distribution with mean zero and precision matrix (\Theta = \Sigma^{-1}). Crucially, (\Theta) is assumed to be sparse, which implies that many pairs of predictors are conditionally independent given the rest. The dependence between the response (Y) and the predictors is captured through a Markov blanket (A): (Y \perp!!\perp X_i \mid X_{-i}) for all (i \notin A). The goal is to test, for each predictor (X_i), whether it belongs to this blanket.
To operationalize this, the authors express each predictor as a linear regression on the remaining variables:
(X_i = \zeta_i^\top X_{-i} + Z_i,)
where (Z_i) is the residual (conditionally independent of (X_{-i})) and (\zeta_i) is derived from the sparse precision matrix. The residual (Z_i) plays the role of a “partialized” version of (X_i). For a collection of transformation functions ({g_h(\cdot)}{h=1}^H) applied to the response, they define the sufficient dimension association measures (\nu{hi} = \operatorname{Cov}(Z_i, g_h(Y))). Proposition 1 shows that (i) belongs to the Markov blanket if and only if at least one (\nu_{hi}) is non‑zero; conversely, all (\nu_{hi}=0) imply exclusion from the blanket. This formulation does not require any explicit model for (Y) and accommodates nonlinear relationships through the choice of multiple (g_h). In the special case of a linear model with a single transformation (g_1(Y)=Y), SDA reduces to the familiar semipartial correlation.
The paper establishes the asymptotic properties of the SDA estimator. By fitting a high‑dimensional LASSO regression to obtain (\hat\zeta_i) and the residuals (\hat Z_i), the authors construct sample covariances (\hat\nu_{hi}). They prove consistency and asymptotic normality of (\hat\nu_{hi}) under standard regularity conditions on the sparsity of (\Theta) and the growth rates of (p) and (n). Standard error estimates are derived from the asymptotic variance, enabling the construction of three distinct test statistics:
- Chi‑square statistic – the sum of squared standardized (\hat\nu_{hi}) across all (h), which follows a (\chi^2_H) distribution under the null.
- Kolmogorov–Smirnov (KS) statistic – the maximum absolute deviation between the empirical distribution of (\hat\nu_{hi}) and the standard normal CDF, offering robustness to heavy‑tailed errors.
- Cramér‑von‑Mises (CvM) statistic – an integrated squared deviation, providing greater power when the departure from the null is spread across the distribution.
Because each predictor yields (H) individual hypotheses, the authors adopt a multiple‑testing framework to control the false discovery rate (FDR). They extend the Benjamini–Hochberg procedure to accommodate the dependence among the three test statistics and across predictors, guaranteeing that the overall FDR is bounded by a pre‑specified level (\alpha).
Extensive simulations evaluate SDA against several benchmarks: LASSO‑based post‑selection inference, high‑dimensional partial correlation tests, and recent data‑splitting FDR methods. Scenarios vary the underlying regression function (linear vs. nonlinear), signal‑to‑noise ratios, sparsity of (\Theta), and size of the true Markov blanket. Across all settings, SDA consistently achieves higher power while maintaining accurate FDR control, especially when the model is misspecified or the sparsity assumption on the regression coefficients fails.
The methodology is applied to a real Alzheimer’s Disease Neuroimaging Initiative (ADNI) gene‑expression dataset comprising 49,386 probes measured on 745 subjects. SDA identifies a set of genes whose conditional association with disease status remains significant after adjusting for all other genes. Many of these genes overlap with known Alzheimer’s pathways, while several novel candidates emerge that were missed by conventional LASSO or partial‑correlation approaches.
In summary, the paper contributes a flexible, model‑agnostic inference tool for high‑dimensional data. By leveraging the Gaussianity and sparsity of the predictor precision matrix, SDA isolates the conditional effect of each variable without imposing a regression model on the outcome. The three complementary test statistics and the FDR‑controlling multiple‑testing scheme make the approach both powerful and reliable. Potential extensions include relaxing the Gaussian predictor assumption, handling more complex Markov blanket structures, and scaling the algorithm to streaming or distributed data environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment