Discovering Patterns in Biological Sequences by Optimal Segmentation
Computational methods for discovering patterns of local correlations in sequences are important in computational biology. Here we show how to determine the optimal partitioning of aligned sequences into non-overlapping segments such that positions in the same segment are strongly correlated while positions in different segments are not. Our approach involves discovering the hidden variables of a Bayesian network that interact with observed sequences so as to form a set of independent mixture models. We introduce a dynamic program to efficiently discover the optimal segmentation, or equivalently the optimal set of hidden variables. We evaluate our approach on two computational biology tasks. One task is related to the design of vaccines against polymorphic pathogens and the other task involves analysis of single nucleotide polymorphisms (SNPs) in human DNA. We show how common tasks in these problems naturally correspond to inference procedures in the learned models. Error rates of our learned models for the prediction of missing SNPs are up to 1/3 less than the error rates of a state-of-the-art SNP prediction method. Source code is available at www.uwm.edu/~joebock/segmentation.
💡 Research Summary
The paper addresses a fundamental challenge in computational biology: how to capture local correlations in aligned biological sequences without imposing a globally coupled model such as a full‑length hidden Markov model. The authors formulate the problem as an “optimal segmentation” task—partitioning a set of aligned sequences into non‑overlapping contiguous segments so that positions within the same segment are strongly correlated, while positions belonging to different segments are essentially independent.
To achieve this, they embed a set of hidden variables into a Bayesian network. Each segment is represented by a single hidden variable that acts as a cluster label for all positions in that segment. The hidden variable is connected to every observed nucleotide (or amino‑acid) position in the segment, forming a star‑shaped sub‑network. Conditional on the hidden variable, the observed positions are assumed independent, each following a categorical distribution whose parameters are specific to the hidden state. Consequently, each segment becomes an independent mixture model: the hidden variable selects a mixture component (a “cluster”), and the component generates the observed symbols in that segment.
The key technical contribution is a dynamic‑programming (DP) algorithm that simultaneously discovers the optimal segment boundaries and the optimal number of hidden states per segment. For every candidate segment length ℓ (up to a user‑defined maximum) and every admissible number of hidden states k, the algorithm fits a mixture model using Expectation‑Maximization and records a model‑selection score (log‑likelihood, BIC, or MDL). The DP recurrence then computes the best score for the prefix of the sequence ending at position t as
DP
Comments & Academic Discussion
Loading comments...
Leave a Comment