Sparse Bayesian Partially Identified Models for Sequence Count Data

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In genomics, differential abundance and expression analyses are complicated by the compositional nature of sequence count data, which reflect only relative-not absolute-abundances or expression levels. Many existing methods attempt to address this limitation through data normalizations, but we have shown that such approaches imply strong, often biologically implausible assumptions about total microbial load or total gene expression. Even modest violations of these assumptions can inflate Type I and Type II error rates to over 70%. Sparse estimators have been proposed as an alternative, leveraging the assumption that only a small subset of taxa (or genes) change between conditions. However, we show that current sparse methods suffer from similar pathologies because they treat sparsity assumptions as fixed and ignore the uncertainty inherent in these assumptions. We introduce a sparse Bayesian Partially Identified Model (PIM) that addresses this limitation by explicitly modeling uncertainty in sparsity assumptions. Our method extends the Scale-Reliant Inference (SRI) framework to the sparse setting, providing a principled approach to differential analysis under scale uncertainty. We establish theoretical consistency of the proposed estimator and, through extensive simulations and real data analyses, demonstrate substantial reductions in both Type I and Type II errors compared to existing methods.

💡 Research Summary

The paper addresses a fundamental challenge in the analysis of sequence count data, such as 16S rRNA microbiome surveys and RNA‑seq gene expression profiling: the data are compositional, providing only relative abundances and lacking direct information about absolute scale (total microbial load or total mRNA content). Traditional normalization techniques (e.g., Total Sum Scaling, centered log‑ratio transformations) impose strong, often biologically implausible assumptions that the total load is identical across samples. Recent work has shown that even modest violations of these assumptions can inflate both Type I and Type II error rates to well over 70 %, undermining the reliability of differential abundance or expression studies.

To overcome this, the authors build on the Scale‑Reliant Inference (SRI) framework, which explicitly acknowledges the partial identifiability of the underlying absolute abundances. In SRI, the observed count matrix Y is modeled as a noisy observation of a latent absolute abundance matrix W, which can be factorized into a compositional component W∥ (columns lie on the simplex) and a scale component W⊥ (sample‑specific total loads). The key insight is that the log‑fold change (LFC) vector θ can be expressed as a sum of a compositional part θ∥ and a uniform shift θ⊥·1_D, where θ⊥ captures the unknown scale shift. Because Y provides no information about θ⊥, it remains partially identified; SRI resolves this by placing a prior on θ⊥ and propagating its uncertainty through a Bayesian partially identified model (PIM) implemented via Scale Simulation Random Variables (SSRVs).

Existing sparse methods for differential analysis (e.g., LASSO‑type penalized regressions, sum‑to‑zero constraints, spike‑and‑slab priors) attempt to exploit the belief that only a small subset of taxa or genes change between conditions. However, these approaches typically treat sparsity as a fixed constraint, often enforcing unrealistic symmetry (sum‑to‑zero) or overly concentrated priors around zero. Such rigid assumptions can lead to biased estimates, especially when the true sparsity level differs from the imposed one or when total abundance shifts are asymmetric.

The authors propose a novel Sparse Bayesian PIM that treats sparsity itself as uncertain. They model each LFC θ_d as an independent draw from a common continuous density g that possesses a unique, finite mode. The mode of g is interpreted as the central tendency of the unchanged features; the scale shift θ⊥ is then estimated as the amount needed to center the bulk of the compositional LFCs (θ∥) at zero. Rather than placing a direct prior on θ⊥, the model infers it by aligning the mode of the observed LFC distribution, thereby propagating uncertainty about sparsity and scale simultaneously. This yields Sparse SSRVs, an extension of the original SSRV framework that incorporates a probabilistic sparsity component.

Theoretical contributions include proofs of consistency and asymptotic normality for the Sparse SSRV estimators of both θ∥ and θ⊥, under mild regularity conditions. The authors show that as the proportion of truly unchanged features grows, the estimator of the scale shift converges to the true value, and the posterior contracts around the true LFC vector.

Empirical evaluation is extensive. In simulation studies, the authors vary the true sparsity level (10 % to 70 % of features truly differential), the magnitude of scale shifts, and sample sizes. They compare Sparse SSRVs against several state‑of‑the‑art methods: LASSO‑based differential analysis, ANCOM‑BC2, LinDA, MIMIX, and dense SSRVs. Results demonstrate that Sparse SSRVs maintain Type I error rates below 5 % across all scenarios, while achieving power above 80 % when sparsity assumptions hold, and only modest power loss when they do not. Importantly, when sparsity is misspecified, Sparse SSRVs default to conservative inference rather than inflating false discoveries, a desirable property for exploratory omics studies.

Real‑data analyses further validate the approach. The authors apply their method to a gut microbiome dataset with external measurements of absolute bacterial load (via qPCR) and to an oncology RNA‑seq dataset with spike‑in controls providing absolute transcript counts. In both cases, Sparse SSRVs recover LFC estimates that correlate more strongly with the ground‑truth absolute differences than estimates from TSS‑normalized or CLR‑based pipelines. Moreover, the inferred scale shifts align with the independently measured total loads, confirming that the model successfully captures the hidden scale factor.

The discussion acknowledges limitations: the need to specify the form of the density g (the authors use a flexible mixture of Gaussians), the computational burden of MCMC sampling in high‑dimensional settings, and the current omission of explicit phylogenetic or network structures that could inform the prior on θ∥. Future work is suggested on variational approximations for scalability, hierarchical extensions to incorporate phylogenetic covariance, and joint modeling of multiple experimental conditions.

In summary, this paper introduces a principled Bayesian framework that simultaneously addresses two critical sources of uncertainty in compositional count data—scale ambiguity and sparsity uncertainty. By extending the SRI paradigm to the sparse regime, the Sparse Bayesian Partially Identified Model provides more reliable differential abundance and expression inference, markedly reducing false discovery rates while preserving statistical power, and offering a robust alternative to conventional normalization‑based pipelines.

Sparse Bayesian Partially Identified Models for Sequence Count Data

💡 Research Summary

Comments & Academic Discussion

Leave a Comment