Nonapproximablity of the Normalized Information Distance
Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program. This practical application is called `normalized compression distance’ and it is trivially computable. It is a parameter-free similarity measure based on compression, and is used in pattern recognition, data mining, phylogeny, clustering, and classification. The complexity properties of its theoretical precursor, the NID, have been open. We show that the NID is neither upper semicomputable nor lower semicomputable up to any reasonable precision.
💡 Research Summary
The paper investigates the computability properties of the Normalized Information Distance (NID), a theoretical similarity metric derived from Kolmogorov complexity. While the practical counterpart, Normalized Compression Distance (NCD), is trivially computable by compressing files with real‑world compressors, the exact status of NID has remained open. The authors prove that NID is neither upper semicomputable nor lower semicomputable to any non‑trivial precision.
The authors begin by recalling that Kolmogorov complexity K(x) is upper semicomputable but not lower semicomputable, a classic result in algorithmic information theory. NID is defined as
NID(x, y) = max{K(x|y), K(y|x)} / max{K(x), K(y)}.
Both numerator and denominator involve Kolmogorov complexities, making NID a ratio of two non‑computable quantities. The paper formalizes the notions of upper and lower semicomputability: a function f is upper semicomputable if there exists a computable, monotonically decreasing sequence of rational numbers converging to f from above; lower semicomputability is defined analogously from below.
The central theorem consists of two parts: (1) No algorithm can produce a sequence of rational approximations that converge to NID from above within any fixed ε > 0; (2) No algorithm can produce a sequence that converges from below within any fixed ε > 0. The proofs are by contradiction and rely heavily on the non‑semicomputability of K and K(·|·).
For the upper‑semicomputability claim, assume a computable decreasing sequence {g_i} of rationals with g_i ≥ NID(x, y) and lim g_i = NID(x, y) within ε. Because the denominator max{K(x), K(y)} appears in NID, the existence of such a sequence would allow us to approximate K(x) and K(y) to within ε, contradicting the known impossibility of lower‑semicomputing K. The argument uses a diagonalisation technique: by enumerating all possible upper‑semicomputable approximators, one can construct a pair (x, y) on which any candidate fails to stay within the prescribed bound.
The lower‑semicomputability claim proceeds similarly. If a computable increasing sequence {h_i} converged to NID from below, then the numerator max{K(x|y), K(y|x)} could be approximated from below, yielding a lower‑semicomputable approximation of conditional Kolmogorov complexity, which is known to be impossible. Again, diagonalisation and an “information gap” argument are employed: the numerator and denominator depend on independent complexity terms, so accurate approximation of one forces accurate knowledge of the other, which cannot be achieved by any semicomputable process.
Consequences of the theorem are twofold. First, it provides a rigorous justification for the widespread use of NCD as a practical surrogate: NID cannot be directly computed or even approximated to any reasonable precision, so any algorithmic approach must rely on compression‑based heuristics. Second, it delineates a clear boundary between theoretical similarity measures that are mathematically elegant but computationally inaccessible, and those that are algorithmically tractable. The authors suggest future work on quantifying the gap between NCD and NID, exploring alternative metrics that retain some of NID’s universality while being semicomputable, and investigating whether restricted versions of NID (e.g., limited to specific classes of strings) might admit approximations.
In summary, the paper settles a long‑standing open problem by showing that the Normalized Information Distance is fundamentally non‑approximable: it is neither upper nor lower semicomputable within any fixed rational error bound. This result underscores the essential role of compression‑based approximations in real‑world applications and clarifies the theoretical limits of algorithmic information‑theoretic similarity measures.
Comments & Academic Discussion
Loading comments...
Leave a Comment