Distributions associated with general runs and patterns in hidden Markov models
This paper gives a method for computing distributions associated with patterns in the state sequence of a hidden Markov model, conditional on observing all or part of the observation sequence. Probabilities are computed for very general classes of patterns (competing patterns and generalized later patterns), and thus, the theory includes as special cases results for a large class of problems that have wide application. The unobserved state sequence is assumed to be Markovian with a general order of dependence. An auxiliary Markov chain is associated with the state sequence and is used to simplify the computations. Two examples are given to illustrate the use of the methodology. Whereas the first application is more to illustrate the basic steps in applying the theory, the second is a more detailed application to DNA sequences, and shows that the methods can be adapted to include restrictions related to biological knowledge.
💡 Research Summary
This paper introduces a comprehensive methodology for computing the probability distributions of arbitrary patterns occurring in the hidden state sequence of a hidden Markov model (HMM), conditional on having observed all or part of the output sequence. While classical HMM theory focuses on three fundamental problems—likelihood evaluation, most probable state path (Viterbi), and parameter estimation (Baum‑Welch)—the authors address a fourth, increasingly important problem: the conditional distribution of patterns (or motifs) in the hidden states.
The authors allow the hidden chain to be of any finite order m, i.e., the state at time t depends on the previous m states. By augmenting the original state vector to ˜Xₜ = (X_{t‑m+1},…,X_t), they extend the standard forward (α) and backward (β) recursions to higher‑order HMMs. The forward variable α_t(˜X_t)=P(Y₁,…,Y_t,˜X_t) and the backward variable β_t(˜X_t)=P(Y_{t+1},…,Y_T|˜X_t) are computed exactly as in the first‑order case, with the only change being the enlarged state space.
Pattern classes are defined very generally. A simple pattern Λ_i is a fixed sequence of symbols from the hidden state alphabet, possibly with repetitions. A compound pattern Λ is the union of several simple patterns. Competing patterns consist of c compound patterns, each required to appear a prescribed number r_j times; the waiting time of interest is the first time any of them reaches its quota. Generalized later patterns require that all patterns achieve their required counts before a final waiting time is measured. When c = 1, these reduce to ordinary waiting‑time distributions for a single pattern. The framework also accommodates “runs” of length k and distinguishes between overlapping and non‑overlapping counting schemes.
To make the computation tractable, the authors construct an auxiliary Markov chain Z_t that embeds the progress of pattern matching into its state description. Each Z_t contains both the current hidden state X_t and the current length of the longest prefix of each pattern that has been matched so far (essentially a deterministic finite automaton attached to the HMM). Transition probabilities of Z_t are the product of the original HMM transition/observation probabilities and the deterministic update of the matching automaton. Because Z_t is itself a Markov chain, the standard forward‑backward machinery can be applied directly to Z_t. The probability that a pattern finishes at time t is obtained by summing α_t(z)·β_t(z) over all auxiliary states z that correspond to a completed pattern. By iterating this sum over t up to a user‑specified horizon T* (≥ T), the full waiting‑time distribution is recovered.
Two counting conventions are explicitly treated. In non‑overlapping counting (system‑wide or within‑pattern), once a pattern is counted the counting process restarts, preventing partially completed patterns from being finished later. In overlapping counting, any partially completed pattern may be completed regardless of other occurrences. The authors illustrate the difference with concrete binary sequences and show how the auxiliary chain naturally enforces the chosen convention.
The methodology is demonstrated on two examples. The first is a toy geological data set where a simple run of identical symbols is analyzed, illustrating the basic steps of building the auxiliary chain, running forward‑backward recursions, and extracting the waiting‑time distribution, even when only a partial observation sequence is available (T < T*).
The second, more substantive example concerns CpG island detection in human DNA. CpG islands are genomic regions with elevated CG dinucleotide frequency, often associated with gene promoters. The authors model the genome as a two‑state HMM (CpG island vs. non‑island) with higher‑order dependencies to capture nucleotide context. They define compound patterns representing the start and end of an island, impose biologically motivated constraints (minimum island length, minimum separation between islands), and embed these constraints into the auxiliary chain. Using the proposed algorithm, they compute the posterior distribution of the number of islands, their lengths, and inter‑island distances, conditioned on the observed nucleotide sequence. The results reveal that the Viterbi path—commonly used in practice—can substantially mis‑estimate both the count and the length of islands; for example, the Viterbi segmentation may produce a single long island where the posterior distribution assigns high probability to several shorter islands.
In the discussion, the authors emphasize that their framework provides a principled way to quantify uncertainty about pattern occurrence in HMMs, moving beyond the deterministic “most likely path” approach. The method is applicable to any domain where HMMs are used and where pattern statistics are of interest, such as speech recognition (phoneme runs), image analysis (texture motifs), and bioinformatics (motif occurrences). They suggest extensions to continuous‑valued observations, scalable approximations for very large state spaces, and online updating for streaming data.
Overall, the paper delivers a mathematically rigorous, algorithmically feasible, and practically useful solution to the problem of pattern‑based inference in hidden Markov models, broadening the toolkit available to statisticians and data scientists working with sequential data.
Comments & Academic Discussion
Loading comments...
Leave a Comment