Finite mixture representations of zero-and-$N$-inflated distributions for count-compositional data

Finite mixture representations of zero-and-$N$-inflated distributions for count-compositional data
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We provide novel probabilistic portrayals of two multivariate models designed to handle zero-inflation in count-compositional data. We develop a new unifying framework that represents both as finite mixture distributions. One of these distributions, based on Dirichlet-multinomial components, has been studied before, but has not yet been properly characterised as a sampling distribution of the counts. The other, based on multinomial components, is a new contribution. Using our finite mixture representations enables us to derive key statistical properties, including moments, marginal distributions, and special cases for both distributions. We develop enhanced Bayesian inference schemes with efficient Gibbs sampling updates, wherever possible, for parameters and auxiliary variables, demonstrating improvements over existing methods in the literature. We conduct simulation studies to evaluate the efficiency of the Bayesian inference procedures and present applications to a human gut microbiome dataset to illustrate the practical utility of the proposed distributions.


💡 Research Summary

This paper introduces a unified probabilistic framework for modelling multivariate count‑compositional data that exhibit both excess zeros and “N‑inflation” – the situation where, in extreme cases, all trials fall into a single category. Two families of distributions are developed: the Zero‑and‑N‑inflated Multinomial (ZANIM) and the Zero‑and‑N‑inflated Dirichlet‑Multinomial (ZANIDM). Both are expressed as finite mixtures with 2^d components, where d is the number of categories.

For ZANIM, the authors start from the multinomial‑Poisson transformation, introduce an auxiliary Gamma variable ϕ, and modify each Poisson‑type likelihood term with a zero‑inflation parameter ζ_j for category j. Integrating out ϕ yields a mixture of: (i) a standard multinomial component, (ii) N‑inflated components where exactly one category receives all N counts, (iii) reduced‑dimension multinomials corresponding to configurations with several zero‑inflated categories, and (iv) a degenerate component where all counts are zero. The mixture weights η are simple functions of the ζ parameters, guaranteeing they sum to one.

ZANIDM builds on the hierarchical Dirichlet‑Multinomial representation. Each λ_j is either set to zero (with probability ζ_j) or drawn from a Gamma(α_j,1) distribution (with probability 1‑ζ_j). Conditional on the λ’s, the counts follow a multinomial with probabilities proportional to λ. After marginalising over the Bernoulli indicators and the λ’s, the resulting PMF again consists of 2^d mixture components: a full Dirichlet‑Multinomial, N‑inflated Dirichlet‑Multinomial terms, reduced‑dimension Dirichlet‑Multinomials, and the same degenerate zero‑only component as in ZANIM. The concentration parameters α control over‑dispersion, while ζ controls structural zeros, allowing the two phenomena to be estimated independently.

The authors derive closed‑form expressions for moments, covariances, and marginal distributions for both families, exploiting the mixture representation. They also provide stochastic representations that facilitate simulation: a set of independent Bernoulli variables z_j determines which mixture component is active, and conditional on z the counts are drawn from the appropriate (multinomial or Dirichlet‑Multinomial) distribution.

Bayesian inference is performed via Gibbs sampling. For ZANIM, full conditional distributions of the Bernoulli indicators, the auxiliary ϕ, and the multinomial probabilities are standard (Beta, Gamma, Dirichlet) and can be sampled directly. For ZANIDM, the authors improve upon earlier work by marginalising the latent λ_j variables, which reduces the dimensionality of the Markov chain and dramatically increases effective sample size. Both samplers avoid Metropolis‑Hastings steps, leading to fast convergence.

Simulation studies compare the marginalised ZANIDM sampler with a naïve Metropolis‑Hastings implementation and evaluate ZANIM versus ZANIDM against conventional multinomial and Dirichlet‑Multinomial models. Results show that the mixture‑based models achieve higher log‑likelihoods, better predictive accuracy, and substantially larger effective sample sizes, especially when the data contain both zero‑inflation and N‑inflation.

The methodology is applied to a human gut microbiome dataset comprising thousands of samples and hundreds of microbial taxa. Many taxa are rarely observed (structural zeros), yet a few taxa dominate certain samples, creating N‑inflation. Both ZANIM and ZANIDM outperform standard models in terms of fit and out‑of‑sample prediction. Moreover, the estimated ζ_j and α_j parameters provide biologically interpretable insights: taxa with high ζ_j are identified as truly absent in most environments, while α_j captures the variability among present taxa.

In summary, the paper makes three major contributions: (1) a novel finite‑mixture representation for zero‑and‑N‑inflated multinomial and Dirichlet‑Multinomial distributions, (2) thorough theoretical characterisation of their statistical properties, and (3) efficient Bayesian inference algorithms that substantially improve over existing approaches. The framework is broadly applicable to any multivariate count data with excess zeros and extreme concentration, such as gene‑expression counts, ecological abundance data, or environmental monitoring, offering a powerful new tool for statisticians and applied researchers alike.


Comments & Academic Discussion

Loading comments...

Leave a Comment