Distributions associated with general runs and patterns in hidden Markov models

Reading time: 6 minute
...

📝 Original Info

  • Title: Distributions associated with general runs and patterns in hidden Markov models
  • ArXiv ID: 0706.3985
  • Date: 2007-12-18
  • Authors: Researchers from original ArXiv paper

📝 Abstract

This paper gives a method for computing distributions associated with patterns in the state sequence of a hidden Markov model, conditional on observing all or part of the observation sequence. Probabilities are computed for very general classes of patterns (competing patterns and generalized later patterns), and thus, the theory includes as special cases results for a large class of problems that have wide application. The unobserved state sequence is assumed to be Markovian with a general order of dependence. An auxiliary Markov chain is associated with the state sequence and is used to simplify the computations. Two examples are given to illustrate the use of the methodology. Whereas the first application is more to illustrate the basic steps in applying the theory, the second is a more detailed application to DNA sequences, and shows that the methods can be adapted to include restrictions related to biological knowledge.

💡 Deep Analysis

Deep Dive into Distributions associated with general runs and patterns in hidden Markov models.

This paper gives a method for computing distributions associated with patterns in the state sequence of a hidden Markov model, conditional on observing all or part of the observation sequence. Probabilities are computed for very general classes of patterns (competing patterns and generalized later patterns), and thus, the theory includes as special cases results for a large class of problems that have wide application. The unobserved state sequence is assumed to be Markovian with a general order of dependence. An auxiliary Markov chain is associated with the state sequence and is used to simplify the computations. Two examples are given to illustrate the use of the methodology. Whereas the first application is more to illustrate the basic steps in applying the theory, the second is a more detailed application to DNA sequences, and shows that the methods can be adapted to include restrictions related to biological knowledge.

📄 Full Content

1. Introduction. Hidden Markov models (HMMs) provide a rich structure for use in a wide range of statistical applications. As examples, they serve as models in speech recognition [Rabiner (1989)], image processing [Li and Gray (2000)], DNA sequence analysis [Durbin, Eddy, Krogh and Mitchison (1998) and Koski (2001)], DNA microarray time course analysis [Yuan and Kendziorski (2006)] and econometrics [Hamilton (1989) and Sims and Zha (2006)], to name just a few. HMMs essentially specify two structures, an underlying model for the unobserved state of the system, and one for the observations, conditional on the unobserved states. Thus, HMMs are a sub-class of state space models [Harvey (1989)], but have the restriction that the models for the hidden states are defined on finite dimensional spaces.

HMMs have been studied extensively, especially for the case where the hidden sequence is first-order Markovian (a Markov chain); see, for example, Rabiner (1989). Higher-order HMMs are less frequently used, but are gaining in popularity, especially in areas such as bioinformatics [Krogh (1997) and Ching, Ng and Fung (2003)]. For practical purposes, three fundamental problems associated with first-order HMMs have been examined thoroughly and solved [Rabiner (1989)]: (1) the efficient computation of the likelihood of the sequence of observations given the HMM [Baum and Eagon (1967)];

(2) the determination of a best sequence of underlying states to maximize the likelihood of the observation sequence [Viterbi (1967)]; and (3) the adjustment of model parameters to maximize the likelihood of the observations [the Baum-Welch algorithm; see Baum and Eagon (1967)].

However, little is known about probabilities for patterns or collections of patterns (also known as words or motifs, resp.) in heterogeneous sequences such as those of HMMs [Reinert, Schbath and Waterman (2005)]. In this paper a fourth problem that is becoming increasingly more important for applications such as bioinformatics and data mining is considered: the probability that a pattern has occurred or will occur in the hidden state sequence of an HMM.

Currently, inference on patterns in the hidden state sequence of an HMM usually proceeds as follows. The HMM is determined and the Viterbi algorithm is used to find the most probable state sequence among all possible ones, conditional on the observations. This state sequence is then treated as if it is “deterministically correct” and patterns are found by examining it. However, the conditional distribution (given the observations) of patterns over all state sequences is more relevant. If, for example, the number of genes present in a DNA sequence is of interest and the Viterbi sequence of an HMM is used [as in methods based on Krogh, Mian and Haussler (1994)], then counting genes from the Viterbi sequence cannot be guaranteed to even give a good estimate of the number of genes. This is because there could be gene counts that correspond to many state sequences, and when accumulating probabilities over those sequences, one could find that those counts are much more likely than the count corresponding to the Viterbi sequence. This could especially be true if there are many different sequences all with likelihood close to that of the Viterbi sequence. If a single choice of gene count is needed, then the mean of the conditional distribution over state sequences, given the observations, would seem to be a more reasonable choice. Thus, a method to compute pattern distributions in state sequences modeled as HMMs would be helpful.

In this paper a computational method for finding such pattern distributions is developed, and waiting time probabilities for patterns under the general framework given in Aston and Martin (2005) are extended to the state sequence of HMMs. The probabilities are computed under the paradigm that T observations of the output sequence of the HMM have been realized, and the generation process is either complete or set to continue. Waiting time probabilities are then computed for patterns in the unobserved state sequence up to a time T * ≥ T . Note that waiting time probabilities for patterns in the observations of an HMM when no data has yet been observed (i.e., T = 0), a case that relates directly to computations in the standard case of Markovian sequences with no “hidden” states, was dealt with in Cheung (2004).

The methodology of this paper will be applied to two examples, but one application will be studied in detail, that of CpG Island analysis in DNA sequences. A CpG island is a short segment of DNA in which the frequency of CG pairs is higher than in other regions. The “p” indicates that C and G are connected by a phosphodiester bond. The C in a CG pair is often modified by methylation, and if that happens, there is a relatively high chance that it will mutate to a T, and thus, CG pairs are under-represented in DNA sequences. Upstream from a gene, the methylation process is suppressed in a sh

…(Full text truncated)…

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut