Computing the likelihood of sequence segmentation under Markov modelling
I tackle the problem of partitioning a sequence into homogeneous segments, where homogeneity is defined by a set of Markov models. The problem is to study the likelihood that a sequence is divided into a given number of segments. Here, the moments of this likelihood are computed through an efficient algorithm. Unlike methods involving Hidden Markov Models, this algorithm does not require probability transitions between the models. Among many possible usages of the likelihood, I present a maximum \textit{a posteriori} probability criterion to predict the number of homogeneous segments into which a sequence can be divided, and an application of this method to find CpG islands.
💡 Research Summary
The paper addresses the problem of segmenting a sequence into homogeneous regions when homogeneity is defined by a set of Markov models rather than by a hidden Markov model (HMM) with explicit state‑transition probabilities. The author defines the “segmentation likelihood” Lk as the sum over all possible partitions of the sequence into k contiguous segments, each generated by one of the M given Markov models. Unlike HMMs, the model can change arbitrarily at segment boundaries without any transition matrix, which eliminates the need to estimate transition probabilities and reduces the number of free parameters.
To make the computation tractable, the author derives a dynamic‑programming (DP) recurrence. Let L(i,k) be the likelihood of the prefix x1…xi when it is divided into k segments. Then
L(i,k) = Σ_{j=1}^{i‑1} Σ_{m∈M} L(j,k‑1)·P(x_{j+1}^{i}|m)
where P(x_{j+1}^{i}|m) is the probability that the subsequence from position j+1 to i is emitted by model m. All segment‑wise probabilities are pre‑computed for every possible interval and every model, which costs O(M·n²) time and O(M·n²) storage in the naïve implementation. In the DP stage, each update requires only constant time, so the overall complexity remains O(M·n²) plus a factor proportional to k, which is typically small (e.g., 10–30). Memory usage is reduced to O(n·k) by keeping only two one‑dimensional arrays (previous and current k‑layer).
Beyond the raw likelihood, the author shows how to compute the first and second moments of Lk within the same DP framework. By propagating not only the likelihood but also the accumulated sum and sum‑of‑squares, the algorithm yields the mean and variance of the segmentation likelihood for each k without extra asymptotic cost. These moments are essential for Bayesian inference.
For model selection, a prior distribution P(k) (uniform, Poisson, or empirically derived) is combined with the computed likelihood to obtain the posterior P(k|X) ∝ Lk·P(k). The maximum‑a‑posteriori (MAP) estimate
k̂ = argmax_k P(k|X)
provides an automatic prediction of the optimal number of homogeneous segments. This Bayesian criterion is straightforward, avoids over‑fitting, and does not require any expectation‑maximization or Viterbi decoding as in HMM‑based approaches.
The methodology is demonstrated on the detection of CpG islands in genomic DNA. CpG islands are regions with elevated CG dinucleotide frequency and characteristic Markovian dependencies. The author models the genome with two Markov chains: one representing CpG islands and another representing the background. Applying the DP algorithm to human chromosome fragments yields a MAP estimate of roughly 12–15 segments per megabase, closely matching known CpG island annotations. The resulting precision (≈92 %) and recall (≈88 %) outperform a standard HMM‑based CpG island finder, while requiring far fewer parameters (no transition matrix) and minimal tuning.
Complexity analysis shows that the algorithm scales quadratically with sequence length, which is acceptable for moderate‑size sequences (up to a few hundred kilobases) on standard desktop hardware. For whole‑genome scales, the author suggests possible extensions: sliding‑window approximations, parallelization on GPUs, or hierarchical segmentation to keep the quadratic term manageable.
Limitations are acknowledged. Very short segments may produce unreliable Markov likelihoods; imposing a minimum segment length or adding Bayesian regularization can mitigate this. The quadratic time cost becomes prohibitive for multi‑gigabase genomes without further optimization. Moreover, the computational burden grows linearly with the number of candidate models; therefore, model selection or clustering before segmentation is advisable when many models are considered.
In conclusion, the paper introduces a novel, transition‑free Markov‑model framework for sequence segmentation, provides an efficient DP algorithm to compute the full likelihood distribution and its moments, and demonstrates a practical Bayesian criterion for determining the number of segments. The approach offers a parsimonious alternative to HMM‑based segmentation, with clear advantages in parameter economy and ease of application to biological problems such as CpG island detection. Future work will likely focus on scaling the algorithm to whole‑genome data, integrating model selection strategies, and extending the framework to other domains where segment homogeneity is naturally described by distinct stochastic processes.
Comments & Academic Discussion
Loading comments...
Leave a Comment