Information Distance: New Developments

In pattern recognition, learning, and data mining one obtains information from information-carrying objects. This involves an objective definition of the information in a single object, the information to go from one object to another object in a pair of objects, the information to go from one object to any other object in a multiple of objects, and the shared information between objects. This is called “information distance.” We survey a selection of new developments in information distance.

💡 Research Summary

The paper provides a comprehensive survey of recent advances in the theory and application of information distance, a concept that quantifies the minimal algorithmic information required to transform one object into another. Beginning with the foundational definition based on Kolmogorov complexity—D(x, y) = max{K(x|y), K(y|x)}—the authors discuss its mathematical properties such as symmetry, the triangle inequality, and its role as a true metric. Recognizing that raw information distance is sensitive to object size, the paper introduces the Normalized Information Distance (NID), defined as NID(x, y) = D(x, y) / max{K(x), K(y)}. NID yields a scale‑free similarity measure bounded between 0 and 1, making it suitable for comparing objects of disparate lengths and complexities.

Since Kolmogorov complexity is incomputable, practical implementations rely on compression algorithms to approximate K(·). The authors evaluate several compressors (LZ77, PPM, bzip2, etc.), analyzing how compression efficiency, memory usage, and runtime affect distance estimates across different data types such as text, images, and biological sequences. They demonstrate that the choice of compressor can significantly bias similarity scores, and they propose guidelines for selecting appropriate compressors based on domain characteristics.

Beyond deterministic compression, the survey highlights probabilistic extensions. Expected Information Distance (EID) incorporates stochastic models—Markov chains, Bayesian networks—to capture average transformation costs under uncertainty. By leveraging Kullback‑Leibler divergence, EID provides robust similarity estimates in noisy environments, particularly for time‑series and structured data where deterministic compression may fail.

The paper also generalizes the pairwise notion to sets of objects. The Multiset Information Distance (MID) aggregates pairwise distances via averaging or taking the maximum, enabling direct use in clustering, hierarchical analysis, and dimensionality reduction. Empirical results show that MID‑based clustering outperforms classic methods such as k‑means and DBSCAN in preserving intrinsic data structure.

Four application domains illustrate the practical impact of these developments. In genomics, NID‑based unsupervised clustering of DNA sequences achieves higher accuracy and lower computational cost than traditional BLAST pipelines. In computer vision, a hybrid approach combining compression‑based distances with deep feature embeddings yields state‑of‑the‑art image similarity scores. In natural language processing, compression distances correlate strongly with semantic similarity, and when integrated with topic models they improve document clustering. In cybersecurity, binary similarity measured via compression distance enables effective detection of malware variants.

Finally, the authors identify open challenges: refining approximations of Kolmogorov complexity, integrating compression and probabilistic models into a unified framework, scaling distance computation to high‑dimensional data, and establishing theoretical convergence guarantees for learning algorithms that employ information distance. By synthesizing mathematical rigor with empirical evidence, the paper positions information distance as a versatile tool bridging theory and practice across diverse fields.

💡 Research Summary

📜 Original Paper Content