Sequence alignment, mutual information, and dissimilarity measures for constructing phylogenies

Existing sequence alignment algorithms use heuristic scoring schemes which cannot be used as objective distance metrics. Therefore one relies on measures like the p- or log-det distances, or makes explicit, and often simplistic, assumptions about sequence evolution. Information theory provides an alternative, in the form of mutual information (MI) which is, in principle, an objective and model independent similarity measure. MI can be estimated by concatenating and zipping sequences, yielding thereby the “normalized compression distance”. So far this has produced promising results, but with uncontrolled errors. We describe a simple approach to get robust estimates of MI from global pairwise alignments. Using standard alignment algorithms, this gives for animal mitochondrial DNA estimates that are strikingly close to estimates obtained from the alignment free methods mentioned above. Our main result uses algorithmic (Kolmogorov) information theory, but we show that similar results can also be obtained from Shannon theory. Due to the fact that it is not additive, normalized compression distance is not an optimal metric for phylogenetics, but we propose a simple modification that overcomes the issue of additivity. We test several versions of our MI based distance measures on a large number of randomly chosen quartets and demonstrate that they all perform better than traditional measures like the Kimura or log-det (resp. paralinear) distances. Even a simplified version based on single letter Shannon entropies, which can be easily incorporated in existing software packages, gave superior results throughout the entire animal kingdom. But we see the main virtue of our approach in a more general way. For example, it can also help to judge the relative merits of different alignment algorithms, by estimating the significance of specific alignments.

💡 Research Summary

The paper addresses a fundamental limitation of conventional sequence‑alignment‑based phylogenetic distance measures: the scoring schemes used by alignment programs are heuristic and cannot be interpreted as true evolutionary distances. Consequently, practitioners often resort to model‑dependent distances such as p‑distances, log‑det (paralinear) distances, or simplistic substitution models, all of which introduce bias when the underlying evolutionary process deviates from the assumed model.

To overcome this, the authors propose using mutual information (MI) as an objective, model‑independent similarity metric. MI quantifies the amount of shared information between two random variables and, in principle, provides a direct measure of sequence similarity that does not rely on any explicit evolutionary model. Two complementary theoretical frameworks are explored. The first is algorithmic (Kolmogorov) information theory: after obtaining a global pairwise alignment, the two sequences are concatenated and compressed with a standard lossless compressor (e.g., bzip2, PPM). The compressed length of the concatenated string, together with the compressed lengths of the individual sequences, yields an estimate of the Kolmogorov‑based MI. The second framework is classical Shannon information theory, where single‑letter frequencies and joint frequencies are used to compute entropies H(X), H(Y) and the joint entropy H(X,Y); MI follows from the identity I(X;Y)=H(X)+H(Y)−H(X,Y). Both approaches require only a standard alignment as input and no additional parameter tuning.

A key insight of the study is that the widely used Normalized Compression Distance (NCD) suffers from non‑additivity, making it unsuitable as a phylogenetic metric because distances on a tree must be additive along branches. The authors therefore introduce a simple modification that restores additivity: they take the logarithm of the compressed concatenated length and subtract the average of the logarithms of the individual compressed lengths, i.e., d = log C(xy) – ½

💡 Research Summary

📜 Original Paper Content