Sparse group principal component analysis via double thresholding with application to multi-cellular programs
Multi-cellular programs (MCPs) are coordinated patterns of gene expression across interacting cell types that collectively drive complex biological processes such as tissue development and immune responses. While MCPs are typically estimated from high-dimensional gene expression data using methods like sparse principal component analysis or latent factor models, these approaches often suffer from high computational costs and limited statistical power. In this work, we propose Sparse Group Principal Component Analysis (SGPCA) to estimate MCPs by leveraging their inherent group and individual sparsity. We introduce an efficient double-thresholding algorithm based on power iteration. In each iteration, a group thresholding step first identifies relevant gene groups, followed by an individual thresholding step to select active cell types. This algorithm achieves a linear computational complexity of $O(np)$, making it highly efficient and scalable for large-scale genomic analyses. We establish theoretical guarantees for SGPCA, including statistical consistency and a convergence rate that surpasses competing methods. Through extensive simulations, we demonstrate that SGPCA achieves superior estimation accuracy and improved statistical power for signal detection. Furthermore, We apply SGPCA to a Lupus study, discovering differentially expressed MCPs distinguishing Lupus patients from normal subjects.
💡 Research Summary
This paper introduces Sparse Group Principal Component Analysis (SGPCA), a novel method designed to uncover multi‑cellular programs (MCPs) from high‑dimensional single‑cell RNA‑seq data while explicitly modeling hierarchical sparsity. The authors observe that genes naturally form groups (each gene across multiple cell types) and that only a small subset of genes participates in any MCP, with further sparsity at the cell‑type level within each active gene. To capture this structure, SGPCA extends classical power iteration by inserting a double‑thresholding step in each iteration: (1) a group‑wise soft‑thresholding based on the ℓ₂ norm to select active gene groups, and (2) an entry‑wise soft‑thresholding based on the ℓ₁ norm to induce sparsity across cell types within the selected groups. The algorithm requires only O(np) operations per iteration (matrix‑vector multiplication) and O(p) for thresholding, dramatically reducing computational cost compared with existing Fantope‑based approaches that scale as O(p³).
Theoretical analysis under a spiked covariance model with hierarchical sparsity proves that the estimated loading vectors converge at rate O(√{(s_g log G + s_e log p)/n}), where s_g and s_e denote the numbers of active groups and active entries per group, respectively. This rate improves upon previous bounds that ignore group structure. The authors also provide a rigorous initialization scheme based on diagonal thresholding and a stopping rule based on subspace distance.
For tuning the group and entry thresholds (η, τ), the paper proposes a stability‑selection procedure that repeatedly resamples the data, computes PCs, and selects thresholds maximizing alignment across resamples. Empirical results show that this data‑driven criterion yields lower type‑I and type‑II errors than conventional criteria such as explained variance or fixed sparsity levels.
Extensive simulations varying signal‑to‑noise ratios, sparsity levels, and sample sizes demonstrate that SGPCA consistently outperforms standard sparse PCA, group‑lasso‑based PCA, and the recent convex relaxation method of Xiao and Xiao (2024) in terms of mean‑squared error, precision/recall, and false‑discovery rate.
The method is applied to a lupus scRNA‑seq dataset. SGPCA identifies several MCPs that differentiate lupus patients from healthy controls, highlighting gene‑cell‑type patterns consistent with known immunological pathways and revealing novel candidate genes.
Overall, SGPCA offers a computationally scalable (linear in n and p), statistically optimal, and biologically interpretable framework for MCP discovery. The authors release an R package (SGPCA) with interactive tuning and reproducible scripts, facilitating adoption by the genomics community. Future work may extend the approach to nonlinear manifolds, multi‑view integration, and online updating for real‑time clinical applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment