MCMC Inference for a Model with Sampling Bias: An Illustration using SAGE data

MCMC Inference for a Model with Sampling Bias: An Illustration using   SAGE data
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper explores Bayesian inference for a biased sampling model in situations where the population of interest cannot be sampled directly, but rather through an indirect and inherently biased method. Observations are viewed as being the result of a multinomial sampling process from a tagged population which is, in turn, a biased sample from the original population of interest. This paper presents several Gibbs Sampling techniques to estimate the joint posterior distribution of the original population based on the observed counts of the tagged population. These algorithms efficiently sample from the joint posterior distribution of a very large multinomial parameter vector. Samples from this method can be used to generate both joint and marginal posterior inferences. We also present an iterative optimization procedure based upon the conditional distributions of the Gibbs Sampler which directly computes the mode of the posterior distribution. To illustrate our approach, we apply it to a tagged population of messanger RNAs (mRNA) generated using a common high-throughput technique, Serial Analysis of Gene Expression (SAGE). Inferences for the mRNA expression levels in the yeast Saccharomyces cerevisiae are reported.


💡 Research Summary

The paper addresses a common problem in modern high‑throughput biology: the population of interest cannot be sampled directly, and the observed data are obtained through a biased, indirect procedure. Using the example of Serial Analysis of Gene Expression (SAGE), the authors model the data‑generating process as a two‑stage hierarchical system. First, the true mRNA expression proportions (π) of a large set of genes are transformed into a “tagged” population (θ) by a bias matrix B that captures gene‑specific tagging efficiencies, enzyme biases, and other experimental artefacts. Second, the observed tag counts y are drawn from a multinomial distribution with total count N and cell probabilities θ.

In a Bayesian framework, π is assigned a Dirichlet prior and each row of B receives a Beta (or Dirichlet) prior reflecting prior knowledge about tagging probabilities. The joint posterior p(π, B | y) is analytically intractable because of the high dimensionality (thousands of genes) and the coupling introduced by B. The authors therefore develop a suite of Gibbs sampling algorithms that exploit the conjugacy of the Dirichlet‑multinomial pair for π|B,y and the Beta‑binomial (or Dirichlet‑multinomial) structure for B|π,y. By iteratively sampling from these two conditional distributions, the sampler efficiently explores the posterior even when the parameter vector has tens of thousands of components.

Key computational tricks include: (1) storing only the sparse sufficient statistics of the tag counts, which dramatically reduces memory usage; (2) vectorized updates that avoid looping over genes; and (3) a convergence diagnostic based on the Gelman‑Rubin statistic applied to multiple parallel chains. In addition to full posterior sampling, the authors propose an iterative optimization scheme that uses the same conditional expectations as the Gibbs steps to perform a coordinate‑ascent search for the posterior mode (MAP estimate). This provides a fast alternative when a point estimate is sufficient.

The methodology is applied to a real SAGE dataset from the yeast Saccharomyces cerevisiae, comprising roughly 6 000 genes and about 2 × 10⁶ total tags. The bias matrix B is initialized using published enzyme efficiency data and refined through the Gibbs iterations. After 10 000 iterations (with a 2 000‑iteration burn‑in), the sampler converges, yielding posterior distributions for each gene’s expression proportion. Compared with naïve tag‑count normalization, the Bayesian estimates have substantially narrower credible intervals, especially for low‑abundance transcripts, and the MAP estimates closely match the posterior means, confirming the reliability of the optimization routine.

The contributions of the paper are threefold: (i) a principled Bayesian formulation of biased sampling that separates the true population from the observed, biased sample; (ii) scalable Gibbs‑sampling algorithms capable of handling very high‑dimensional multinomial parameters; and (iii) a practical MAP‑optimization method derived from the same conditional structure. The authors discuss extensions such as time‑varying bias matrices for longitudinal studies, non‑parametric bias functions, and application to other high‑throughput platforms like RNA‑seq. Overall, the work provides a robust statistical toolkit for correcting sampling bias in large‑scale genomic experiments, improving both point estimates and uncertainty quantification for gene expression levels.


Comments & Academic Discussion

Loading comments...

Leave a Comment