Combinatorial information distance
Let $|A|$ denote the cardinality of a finite set $A$. For any real number $x$ define $t(x)=x$ if $x\geq1$ and 1 otherwise. For any finite sets $A,B$ let $\delta(A,B)$ $=$ $\log_{2}(t(|B\cap\bar{A}||A|))$. We define {This appears as Technical Report # arXiv:0905.2386v4. A shorter version appears in the {Proc. of Mini-Conference on Applied Theoretical Computer Science (MATCOS-10)}, Slovenia, Oct. 13-14, 2010.} a new cobinatorial distance $d(A,B)$ $=$ $\max{\delta(A,B),\delta(B,A)} $ which may be applied to measure the distance between binary strings of different lengths. The distance is based on a classical combinatorial notion of information introduced by Kolmogorov.
💡 Research Summary
The paper introduces a novel combinatorial information distance defined on finite sets and demonstrates its applicability to binary strings of arbitrary lengths. For any finite sets A and B, the authors first define an auxiliary quantity δ(A,B)=log₂(t(|B∩Ā|·|A|)), where t(x)=x for x≥1 and t(x)=1 otherwise. This construction measures the amount of “information” that B contributes beyond what is already contained in A, following Kolmogorov’s classical combinatorial notion of information. Because δ is not symmetric, the final distance is defined as d(A,B)=max{δ(A,B),δ(B,A)}. The use of the max operator guarantees symmetry, while the t‑function prevents the logarithm from becoming negative when the product |B∩Ā|·|A| is zero.
The authors rigorously prove that d satisfies the four metric axioms: non‑negativity, identity of indiscernibles, symmetry, and the triangle inequality. The triangle inequality is established by bounding the product terms for three arbitrary sets A, B, and C and then applying the monotonicity of the logarithm together with the t‑function’s lower bound of 1. Consequently, d constitutes a true metric on the space of finite subsets of a universal set.
Computationally, d can be evaluated in linear time with respect to the sizes of the input sets, because only the cardinalities |A|, |B|, |B∩Ā|, and |A∩B̄| are required. The authors suggest representing a binary string s of length n as the set S={i | s_i=1}. Under this representation, strings of different lengths become sets of different cardinalities, and d(S₁,S₂) provides a distance that naturally incorporates length disparity without any padding or alignment.
The paper contrasts d with several well‑known distances. The Hamming distance is limited to equal‑length strings and counts positional mismatches, whereas the Levenshtein (edit) distance handles insertions and deletions but incurs O(mn) time complexity. Normalized Compression Distance (NCD) approximates Kolmogorov complexity via real compressors, but its value depends on the chosen compression algorithm and can be unstable. In contrast, the combinatorial information distance is algorithm‑independent, has O(n) computational cost, and directly reflects the combinatorial information content defined by Kolmogorov.
However, the authors acknowledge limitations. Because d depends solely on the sizes of intersections and complements, it ignores the ordering or structural arrangement of elements. Two binary strings with identical numbers of 1’s but different patterns (e.g., “1010” vs. “0101”) receive the same distance from a set‑size perspective. Moreover, the t‑function’s flattening of values below 1 reduces sensitivity when the overlap is very small, which may be undesirable in applications requiring fine‑grained discrimination.
Potential applications discussed include clustering of high‑dimensional binary feature vectors (e.g., document term presence, biomarker profiles), rapid similarity assessment of genomic data encoded as presence/absence of mutations, and anomaly detection in network traffic where packet headers are treated as binary feature sets.
Finally, the paper outlines avenues for future work. One direction is to replace the step‑function t with a smoother scaling function to retain sensitivity for tiny overlaps. Another is to incorporate sequential information by extending the definition to consider common subsequences or runs of bits, thereby bridging the gap between pure set‑based and edit‑based distances. A third prospect is to generalize the approach to multisets or non‑binary alphabets, enabling the metric to handle richer data types while preserving its combinatorial information foundation. In summary, the work provides a theoretically sound, computationally efficient metric that extends Kolmogorov’s combinatorial information concept to practical similarity measurement for binary data of varying lengths.
Comments & Academic Discussion
Loading comments...
Leave a Comment