On Normalized Compression Distance and Large Malware

On Normalized Compression Distance and Large Malware
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Normalized Compression Distance (NCD) is a popular tool that uses compression algorithms to cluster and classify data in a wide range of applications. Existing discussions of NCD’s theoretical merit rely on certain theoretical properties of compression algorithms. However, we demonstrate that many popular compression algorithms don’t seem to satisfy these theoretical properties. We explore the relationship between some of these properties and file size, demonstrating that this theoretical problem is actually a practical problem for classifying malware with large file sizes, and we then introduce some variants of NCD that mitigate this problem.


💡 Research Summary

This paper critically examines the theoretical foundations of the Normalized Compression Distance (NCD) and its practical efficacy when applied to large files, specifically in the context of malware classification. NCD is a popular similarity metric that leverages standard compression algorithms to approximate the uncomputable Normalized Information Distance (NID). Its theoretical validity as a similarity measure hinges on the compression algorithm satisfying specific axioms of a “normal compressor”: idempotency, monotonicity, symmetry, and distributivity.

The authors first demonstrate a significant gap between theory and practice. Through extensive experiments on standard compression corpora (Calgary, Canterbury, Silesia) with files up to 51 MB, they show that widely-used compressors—bzip2, zlib, lzma, and PPMZ—severely violate the idempotency axiom (where |C(XX)| should equal |C(X)|). The deviation grows with file size, far exceeding the theoretical O(log n) allowance. This violation is not merely theoretical; it translates into poor practical performance.

The practical impact is evaluated using a malware family classification task on large Android APK files (up to 15.4 MB) from the Android Malware Genome Project. Using a simple 1-Nearest Neighbor classifier with a single reference sample per family, classification accuracy varied drastically: lzma achieved 59.7%, PPMZ 44.4%, while bzip2 (29.8%) and zlib (33.3%) performed near random guessing. Restricting the experiment to smaller files (<200 KB) dramatically improved accuracy for bzip2 and lzma, confirming file size as a critical factor in NCD’s performance degradation.

The paper identifies the root cause as the memory limitations of compression algorithms (e.g., block size, dictionary size). When concatenating two large strings X and Y, the compressor often cannot effectively use information from X to compress Y (or vice versa) if the similar parts are too far apart within the data stream.

To mitigate this, the authors propose adaptive variants of NCD. They redefine the distance as NCD_C,J, which uses a combining function J(X,Y) instead of simple concatenation. Two novel combining functions are introduced:

  1. Interleaving (IL): Splits both files into fixed-size blocks and weaves them together (X1Y1X2Y2…).
  2. NCD-Shuffle (NS): Splits files into blocks, computes the NCD between all block pairs from X and Y, and reorders the blocks so that the most similar pairs are adjacent before final compression.

Re-evaluating the malware classification task with these new methods showed substantial improvements. For instance, bzip2’s accuracy jumped from 29.8% to 52.2% with a single reference sample using interleaving, and from 55.2% to 75.2% with five reference samples. Significant gains were also observed for zlib. These results validate the proposed techniques as effective practical workarounds for the inherent limitations of standard NCD when processing large files, successfully bridging the gap between its information-theoretic promise and its real-world application.


Comments & Academic Discussion

Loading comments...

Leave a Comment