In this paper, we extend a previously developed recursive entropic segmentation scheme for applications to biological sequences. Instead of Bernoulli chains, we model the statistically stationary segments in a biological sequence as Markov chains, and define a generalized Jensen-Shannon divergence for distinguishing between two Markov chains. We then undertake a mean-field analysis, based on which we identify pitfalls associated with the recursive Jensen-Shannon segmentation scheme. Following this, we explain the need for segmentation optimization, and describe two local optimization schemes for improving the positions of domain walls discovered at each recursion stage. We also develop a new termination criterion for recursive Jensen-Shannon segmentation based on the strength of statistical fluctuations up to a minimum statistically reliable segment length, avoiding the need for unrealistic null and alternative segment models of the target sequence. Finally, we compare the extended scheme against the original scheme by recursively segmenting the Escherichia coli K-12 MG1655 genome.
Deep Dive into Extending the Recursive Jensen-Shannon Segmentation of Biological Sequences.
In this paper, we extend a previously developed recursive entropic segmentation scheme for applications to biological sequences. Instead of Bernoulli chains, we model the statistically stationary segments in a biological sequence as Markov chains, and define a generalized Jensen-Shannon divergence for distinguishing between two Markov chains. We then undertake a mean-field analysis, based on which we identify pitfalls associated with the recursive Jensen-Shannon segmentation scheme. Following this, we explain the need for segmentation optimization, and describe two local optimization schemes for improving the positions of domain walls discovered at each recursion stage. We also develop a new termination criterion for recursive Jensen-Shannon segmentation based on the strength of statistical fluctuations up to a minimum statistically reliable segment length, avoiding the need for unrealistic null and alternative segment models of the target sequence. Finally, we compare the extended sch
Large-scale genomic rearrangements, such as transpositions, inversions, and horizontal gene transfer (HGT), play important roles in the evolution of bacteria. Biological functions can be lost or gained in such recombination events. For example, it is known that virulent genes are frequently found near the boundaries of HGT islands, suggesting that virulence arise from the incorporation of foreign genetic material [1]- [3]. In a recent essay, Goldenfeld and Woese argued that the mosaic nature of bacterial genomes resulting from such large-scale genomic rearrangements requires us to rethink familiar notions of phylogeny and evolution [4]. As a first step in unraveling the complex sequence of events that shape the probable evolutionary history of a bacterium, we need to first identify the recombination sites bounding recombined segments, which are frequently distinguishable statistically from their flanking sequences.
We do this by modeling the native and recombined segments in a genome as stationary Markov chains. The boundaries between such statistically stationary segments (or domains) are called change points in the statistical modeling literature, or domain walls in the statistical physics literature. Given a nucleotide or amino acid sequence of length N, the problem of finding M segments generated by P stationary Markov chains is called segmentation [5], [6]. Many segmentation schemes can be found in the literature (see minireview by Braun and Muller [7]).
For M = P both unknown, Gionis and Mannila showed that finding the optimal segmentation for a given sequence is NP-hard [8]. Therefore, some segmentation schemes assume P = M, while others assume that P is small, and known beforehand.
Of these, the recursive segmentation scheme introduced by Bernaola-Galván et al. [9], [10] is conceptually appealing because of its simplicity. In this scheme, a given sequence is recursively partitioned into finer and finer segments -all modeled as Bernoulli chains -based on their Jensen-Shannon divergences. The unknown number of segments M (assumed to be equal to the number of segment types P ) is then discovered when segmentation is terminated based on an appropriate statistical criterion. In this paper, we describe our extensions to this recursive segmentation scheme. In Sec. II, we explain how Markov chains model the shortrange correlations in a given sequence better than Bernoulli chains, and thereafter generalize the Jensen-Shannon divergence to distinguish between two Markov chains. In Sec. III, we carry out a mean-field analysis to better understand the recursive segmentation scheme and its pitfalls, before describing two local segmentation optimization algorithms for improving the statistical significance of domain walls in Sec. IV. We also develop in Sec. V a new termination criterion, based on the intrinsic statistical fluctuations of the sequence to be segmented, for the recursive segmentation scheme. Finally, we compare our extended scheme against the original scheme by recursively segmenting the Escherichia coli K-12 MG1655 genome in Sec. VI, before concluding in Sec. VII.
In the earliest recursive segmentation scheme proposed by Bernaola-Galván et al. [9], [10], the divergence between 1-mer statistics from two or more subsequences of a given sequence is examined. These subsequences are modeled as Bernoulli chains (equivalent to Markov chains of order K = 0), even though it is well known that biological sequences exhibit dinucleotide correlations and codon biases [11]- [15]. Later versions of the recursive segmentation scheme examine higher order subsequence statistics, so as to take advantage of different codon usage in coding and noncoding regions [16]- [18], but these are still assumed to be drawn from Bernoulli chains, albeit with extended alphabets. The first study we are aware of modeling subsequences as Markov chains for recursive segmentation is the work by Thakur et al. [19].
In this section, we will explain why the observed dinucleotide frequences and codon biases in biological sequences can be better modeled by Markov chains of order K > 0, compared to Bernoulli chains with the same high order statistics. We will then generalize the Jensen-Shannon divergence, so that it can be used in entropic segmentation schemes to quantify the statistical difference between Markov chains of order K > 0. Finally, we discuss the added modeling complexities associated with using Markov-chain orders that vary from segment to segment, and change when segments are further divided.
Given a sequence x = x 1 x 2 • • • x N , where the symbols x i are drawn from an alphabet S = {α s } S s=1 containing S letters, we want to model x as being generated sequentially from a single stationary stochastic process. In the bioinformatics literature, x is usually modeled as a Bernoulli chain or as a Markov chain. For a Bernoulli chain, the N symbols are obtained from October 22, 2018 DRAFT
N independent trials, governed by t
…(Full text truncated)…
This content is AI-processed based on ArXiv data.