A fast compression-based similarity measure with applications to content-based image retrieval

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Compression-based similarity measures are effectively employed in applications on diverse data types with a basically parameter-free approach. Nevertheless, there are problems in applying these techniques to medium-to-large datasets which have been seldom addressed. This paper proposes a similarity measure based on compression with dictionaries, the Fast Compression Distance (FCD), which reduces the complexity of these methods, without degradations in performance. On its basis a content-based color image retrieval system is defined, which can be compared to state-of-the-art methods based on invariant color features. Through the FCD a better understanding of compression-based techniques is achieved, by performing experiments on datasets which are larger than the ones analyzed so far in literature.

💡 Research Summary

The paper addresses a well‑known limitation of compression‑based similarity measures such as the Normalized Compression Distance (NCD): while they are virtually parameter‑free and work across heterogeneous data types, their computational cost grows dramatically when the dataset reaches medium or large size. To overcome this bottleneck the authors propose a novel distance metric called Fast Compression Distance (FCD).
Instead of concatenating two objects and compressing the joint string (as NCD does), FCD builds a separate dictionary for each object using a dictionary‑based compressor (e.g., LZW). The dictionary D(x) consists of the distinct substrings that the compressor discovers while processing object x. The similarity between two objects x and y is then estimated by the size of the intersection of their dictionaries:

FCD(x, y) = 1 – |D(x) ∩ D(y)| / max(|D(x)|, |D(y)|).

Because the dictionaries are generated once and can be reused for many queries, the overall complexity drops from O(N²) compression operations to O(N·|D|) set‑intersection operations, where |D| is the average dictionary size. This makes the method scalable while preserving the “parameter‑free” spirit of compression‑based approaches.

For the specific application to color image retrieval, each RGB image is linearized channel‑by‑channel into a sequence of 8‑bit symbols. The compressor’s alphabet is fixed at 256 symbols, and the maximum dictionary size is limited (e.g., 2¹⁶ entries) to keep memory consumption bounded. The authors construct dictionaries for all database images offline, and at query time they generate a dictionary for the query image, compute the intersection with each stored dictionary, and rank images by increasing FCD value. The intersection is implemented with hash‑based set operations, yielding O(|D|) time per comparison.

Experimental evaluation was carried out on three datasets: (1) Corel‑1000 (1 000 images, 10 semantic categories), (2) a 1 010‑image subset of Caltech‑101 selected for strong color content, and (3) a self‑assembled collection of 100 000 color images. The authors compared FCD against classic color‑based descriptors (HSV histograms, color moments) and modern deep‑learning features (VGG‑16, ResNet‑50). Retrieval performance was measured using precision, recall, and mean average precision (mAP).
Results show that FCD matches or slightly exceeds the mAP of HSV histograms (by 3–4 %) and is competitive with deep features, despite using only raw pixel information. The most striking advantage appears in the large‑scale dataset: average query time drops from roughly 1.2 seconds (NCD) to 0.09 seconds (FCD), a speed‑up of more than an order of magnitude, while memory usage stays below 200 MB thanks to the capped dictionary size.

Beyond retrieval, the authors discuss the interpretability of the dictionary intersection: the proportion of shared patterns directly reflects common structural and color characteristics, offering a natural metric for clustering, anomaly detection, or unsupervised labeling. Although FCD still depends on the chosen compressor, the authors argue that any reasonable dictionary‑based compressor captures the intrinsic regularities of the data, making the method robust across domains.

In summary, the paper introduces Fast Compression Distance, a dictionary‑centric reformulation of compression‑based similarity that dramatically reduces computational overhead without sacrificing accuracy. By integrating FCD into a color image retrieval pipeline, the authors demonstrate state‑of‑the‑art retrieval quality on standard benchmarks and a 100‑k image collection, while achieving a ten‑fold speed improvement over traditional NCD. The work bridges theoretical compression concepts with practical large‑scale multimedia retrieval, opening avenues for further applications in clustering, outlier detection, and other unsupervised learning tasks where a universal, parameter‑free similarity measure is desirable.

A fast compression-based similarity measure with applications to content-based image retrieval

💡 Research Summary

Comments & Academic Discussion

Leave a Comment