Normalized Information Distance is Not Semicomputable

Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program. This practical application is called ’normalized compression distance’ and it is trivially computable. It is a parameter-free similarity measure based on compression, and is used in pattern recognition, data mining, phylogeny, clustering, and classification. The complexity properties of its theoretical precursor, the NID, have been open. We show that the NID is neither upper semicomputable nor lower semicomputable.

💡 Research Summary

The paper investigates the computability properties of the Normalized Information Distance (NID), a theoretically defined similarity metric based on Kolmogorov complexity. NID is defined as NID(x, y) = max{K(x | y), K(y | x)}/max{K(x), K(y)}, where K(·) denotes plain Kolmogorov complexity and K(· | ·) denotes conditional complexity. While the practical counterpart, Normalized Compression Distance (NCD), replaces K with the length of compressed files using real‑world compressors and is trivially computable, the exact status of NID in the realm of algorithmic information theory has remained open.

The authors focus on semicomputability, i.e., whether a function can be approximated from above (upper‑semicomputable) or from below (lower‑semicomputable) by a computable, monotone sequence of rational numbers. It is well known that Kolmogorov complexity K is upper‑semicomputable but not lower‑semicomputable. The paper asks whether the ratio that defines NID inherits any of these properties.

First, the authors assume that NID were upper‑semicomputable. Under this hypothesis there would exist a computable decreasing sequence of rational approximations converging to the true NID value for any pair (x, y). By expanding the definition of NID, such a sequence would have to provide increasingly tight upper bounds on the conditional complexities K(x | y) and K(y | x) relative to the absolute complexities K(x) and K(y). The authors construct a diagonalisation argument: for any purported upper‑semicomputable approximation, one can effectively produce a pair of strings for which the approximation must eventually underestimate the true conditional complexity, contradicting the known non‑lower‑semicomputability of K. Consequently, NID cannot be upper‑semicomputable.

Second, the authors consider the possibility that NID is lower‑semicomputable. They examine random strings r₁ and r₂ of length n, for which K(r₁) ≈ K(r₂) ≈ n and K(r₁ | r₂) ≈ K(r₂ | r₁) ≈ n as well. For such pairs the NID value is arbitrarily close to 1. If a computable increasing sequence of rational numbers converged to NID(r₁, r₂), then this sequence would effectively give a lower bound on K(r₁ | r₂) (or K(r₂ | r₁)). However, providing any non‑trivial lower bound on Kolmogorov complexity for random strings would solve the classic problem of lower‑semicomputability of K, which is known to be impossible. The authors formalise this intuition using a reduction: a lower‑semicomputable NID would yield a lower‑semicomputable approximation to K, contradicting established results. Hence NID is not lower‑semicomputable either.

Having ruled out both directions, the paper concludes that NID is not semicomputable at all; it is a fully non‑semicomputable function. This result clarifies the theoretical landscape: although NID possesses elegant metric properties and serves as a conceptual ideal for similarity measurement, it cannot be approximated by any computable monotone process. Consequently, all practical applications must rely on approximations such as NCD, which replace Kolmogorov complexity with real‑world compression lengths. The authors also discuss the broader implication that any distance function derived from Kolmogorov complexity by taking ratios or other algebraic combinations inherits the underlying non‑semicomputability, reinforcing the deep connection between algorithmic information theory and computational limits.

💡 Research Summary

📜 Original Paper Content