MCMC Inference for a Model with Sampling Bias: An Illustration using SAGE data

Reading time: 5 minute
...

📝 Original Info

  • Title: MCMC Inference for a Model with Sampling Bias: An Illustration using SAGE data
  • ArXiv ID: 0711.3765
  • Date: 2007-11-26
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (가능하면 원문에서 확인 필요) **

📝 Abstract

This paper explores Bayesian inference for a biased sampling model in situations where the population of interest cannot be sampled directly, but rather through an indirect and inherently biased method. Observations are viewed as being the result of a multinomial sampling process from a tagged population which is, in turn, a biased sample from the original population of interest. This paper presents several Gibbs Sampling techniques to estimate the joint posterior distribution of the original population based on the observed counts of the tagged population. These algorithms efficiently sample from the joint posterior distribution of a very large multinomial parameter vector. Samples from this method can be used to generate both joint and marginal posterior inferences. We also present an iterative optimization procedure based upon the conditional distributions of the Gibbs Sampler which directly computes the mode of the posterior distribution. To illustrate our approach, we apply it to a tagged population of messanger RNAs (mRNA) generated using a common high-throughput technique, Serial Analysis of Gene Expression (SAGE). Inferences for the mRNA expression levels in the yeast Saccharomyces cerevisiae are reported.

💡 Deep Analysis

📄 Full Content

This paper develops methods for making Bayesian inferences about the composition of a population whose members have different probabilities of being observed. Our approach applies to situations where the categorical composition of a population is of interest and where some members of the population may be more easily observed than others. The sampling process can be viewed as a multinomial process where the probability of a sample being chosen will differ for each category in a known way. An example: survey samples of males and female birds that differ greatly in their coloration, markings, and degree of vocalization. Alternatively, sampling rates may be differentiated by age classes or species that differ in their activity level or size. Similar problems exist in studies of molecular biology where observations are generally indirect and the ability to observe a molecule varies by type. Specific examples include proteins that differ in their hydrophobicity or size. We will illustrate our ideas, by considering a data set generated using Serial Analysis of Gene Expression (SAGE), a bioinformatic technique used to measure mRNA expression levels.

SAGE is a high-throughput method for inferring mRNA expression levels from an experimentally generated set of sequence tags. A SAGE dataset consists of a list of counts for the number of tags that can be unambiguously attributed to the mRNA of a specific gene. These observed tag counts can be thought of as a sample from a much larger pool of mRNA tags. The tag counts are then used to make inferences about the proportion of mRNA from each gene within the mRNA population from a group of cells. The standard approach for interpreting SAGE data is through the use of a multinomial sampling model (Velculescu et al., 1995(Velculescu et al., , 1997)). Morris et al. (2003) directly applied a Bayesian multinomial-Dirichlet model to the observed vector of tag counts. This approach improves upon most earlier work by considering simultaneous inference on all proportions. They provide a simple computationally tractable approach and consider the result of the statistical shrinkage effect which offers improved estimates for proportions with low tag counts while underestimating the expression proportions for tags with large counts. This leads them to propose a mixture Dirichlet prior in order to mitigate the propensity to underestimate highly expressed genes.

An alternative analysis, which was not based on a multinomial model directly, was developed by Thygesen & Zwinderman (2006), who modeled the marginal distribution of the counts across tag types as though they were independent observations from a Poisson distribution. They applied a hierarchical zero-truncated Poisson model with mean parameter which followed either a gamma or log-normal distribution. A non-parametric adjustment factor was required in order to correctly capture the overabundance of larger counts. Similar analysis (Kuznetsov et al., 2002;Kuznetsov, 2006) modeled SAGE counts using a discrete Pareto-like distribution. These studies found that this model effectively predicted counts greater than zero. One drawback is that the variance of expression cannot be explicitly separated from sample variance. However, in the context of differential expression, Baggerly et al. (2003) suggests that treating genes individually offers less power than a model, such as the multinomial, which incorporates all tags simultaneously. One common thread is the explicit assumption that an mRNA’s frequency in the sampled tag pool is equivalent to its frequency in the mRNA population from which the tag pool was derived. However, because the ability to form tags from an mRNA transcript varies from gene to gene, the tag pool that is sampled is actually a biased sample of the mRNA population (Gilchrist et al., 2007). Gilchrist et al. (2007) illustrated how the tag formation bias could be estimated and incorporated through the calculation of a gene specific tag formation probability. The complicating factor is that the sampling bias of one gene is not only a function of its own tag formation probability, but is also inversely proportional to a mean tag formation probability where the contribution of each gene to this mean is weighted by its frequency in the mRNA population, i.e. the very parameters we wish to estimate.

As a result of the sampling bias, the probability of observing an individual gene depends on both its tag formation probability and the distribution of these probabilities across all other genes weighted by their mRNA frequency. Gilchrist et al. (2007) derive an implicit solution for the maximum likelihood and joint posterior mode estimators of the composition of the mRNA population. However, there are a number of numerical stability issues that severely limit the range of prior parameters that can be used, i.e. those priors with appreciable weight relative to the observed sample sizes. There is also the restrictive assumption that

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut