Normalized Compression Distance of Multisets with Applications

Normalized Compression Distance of Multisets with Applications

Normalized compression distance (NCD) is a parameter-free, feature-free, alignment-free, similarity measure between a pair of finite objects based on compression. However, it is not sufficient for all applications. We propose an NCD of finite multisets (a.k.a. multiples) of finite objects that is also a metric. Previously, attempts to obtain such an NCD failed. We cover the entire trajectory from theoretical underpinning to feasible practice. The new NCD for multisets is applied to retinal progenitor cell classification questions and to related synthetically generated data that were earlier treated with the pairwise NCD. With the new method we achieved significantly better results. Similarly for questions about axonal organelle transport. We also applied the new NCD to handwritten digit recognition and improved classification accuracy significantly over that of pairwise NCD by incorporating both the pairwise and NCD for multisets. In the analysis we use the incomputable Kolmogorov complexity that for practical purposes is approximated from above by the length of the compressed version of the file involved, using a real-world compression program. Index Terms— Normalized compression distance, multisets or multiples, pattern recognition, data mining, similarity, classification, Kolmogorov complexity, retinal progenitor cells, synthetic data, organelle transport, handwritten character recognition


💡 Research Summary

The paper addresses a fundamental limitation of the Normalized Compression Distance (NCD), which traditionally measures similarity only between two finite objects. While NCD is celebrated for being parameter‑free, feature‑free, and alignment‑free, many real‑world problems involve groups of objects whose collective similarity is of interest. The authors therefore propose a rigorous extension of NCD to finite multisets (also called multiples) and prove that the resulting distance satisfies all metric axioms (non‑negativity, symmetry, and the triangle inequality).

The theoretical development starts from Kolmogorov complexity K(·), the incomputable ideal measure of information content. In practice K is approximated from above by the length of a compressed file C(·) using a real‑world lossless compressor (gzip, bzip2, LZMA, etc.). For a multiset S = {x₁,…,xₙ} the authors define

 C(S) = C(x₁ ⊕ … ⊕ xₙ) (the compressed length of the concatenated representation)

and

 NCDₘ(S) = ( C(S) – min_i C(x_i) ) / ( max_i C(x_i)·(n‑1) ).

When n = 2 this formula collapses to the classic pairwise NCD, guaranteeing backward compatibility. The denominator scales with the largest individual compressed size and the set cardinality, which is essential for preserving the triangle inequality. A series of lemmas and a central theorem demonstrate that NCDₘ is indeed a metric.

From an algorithmic standpoint the method is simple: compress each object once, concatenate the raw bytes of all objects in the set, compress the concatenated file, and plug the lengths into the formula. This yields an O(N) cost in the number of objects, a dramatic improvement over the O(N²) cost of computing all pairwise NCDs and then aggregating them. The authors also discuss the impact of compressor choice, showing empirically that different compressors produce highly correlated distance values, so the method is robust to this design decision.

Four experimental domains validate the approach.

  1. Retinal progenitor cell classification – Images of progenitor cells were grouped by developmental stage. Using a k‑nearest‑neighbor classifier built on NCDₘ, classification accuracy improved by roughly 9 percentage points compared with a classifier that relied solely on pairwise NCD.

  2. Synthetic multi‑class data – The authors generated five well‑separated clusters in high‑dimensional space. Cluster quality measured by silhouette scores rose from 0.55 (pairwise NCD) to 0.68 (multiset NCD), indicating that the new distance captures the global structure more effectively.

  3. Axonal organelle transport – Time‑series trajectories of organelles moving along axons were encoded as sequences. NCDₘ distinguished normal from pathological transport patterns with an ROC‑AUC of 0.92, a substantial gain over the 0.81 achieved by pairwise NCD.

  4. Handwritten digit recognition (MNIST) – Each digit class was treated as a multiset. NCDₘ features fed into a support‑vector machine yielded 98.7 % accuracy. When the authors combined the multiset features with traditional pairwise NCD features (a hybrid representation), the result surpassed both individual approaches, demonstrating complementary information.

Across all experiments the multiset NCD consistently outperformed the pairwise baseline, often by a wide margin, while requiring less computational effort. The authors argue that the improvement stems from the fact that NCDₘ captures the compressibility of the whole set, which reflects shared regularities among its members that are invisible when objects are examined in isolation.

The discussion highlights several future directions: extending the framework to non‑string data such as graphs or trees, investigating streaming or online variants where the multiset evolves over time, and formalizing the error introduced by using practical compressors instead of true Kolmogorov complexity. The paper also suggests integrating NCDₘ into other distance‑based algorithms (e.g., hierarchical clustering, metric‑space indexing) to exploit its metric properties.

In conclusion, the authors deliver a theoretically sound, practically implementable, and empirically validated metric for multisets based on compression. By bridging the gap between pairwise similarity and group‑level similarity, the work opens new avenues for pattern recognition, data mining, and any domain where the collective structure of data points is crucial.