EntroGD: Scalable Generalized Deduplication for Efficient Direct Analytics on Compressed IoT Data

EntroGD: Scalable Generalized Deduplication for Efficient Direct Analytics on Compressed IoT Data
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Massive data streams from IoT and cyber-physical systems must be processed under strict bandwidth, latency, and resource constraints. Generalized Deduplication (GD) is a promising lossless compression framework, as it supports random access and direct analytics on compressed data. However, existing GD algorithms exhibit quadratic complexity $\mathcal{O}(nd^{2})$, which limits their scalability for high-dimensional datasets. This paper proposes \textbf{EntroGD}, an entropy-guided GD framework that decouples analytical fidelity from compression efficiency to achieve linear complexity $\mathcal{O}(nd)$. EntroGD adopts a two-stage design, first constructing compact condensed samples to preserve information critical for analytics, and then applying entropy-based bit selection to maximize compression. Experiments on 18 IoT datasets show that EntroGD reduces configuration time by up to $53.5\times$ compared to state-of-the-art GD compressors. Moreover, by enabling analytics with access to only $2.6%$ of the original data volume, EntroGD accelerates clustering by up to $31.6\times$ with negligible loss in accuracy. Overall, EntroGD provides a scalable and system-efficient solution for direct analytics on compressed IoT data.


💡 Research Summary

The paper addresses the growing challenge of processing massive, high‑dimensional data streams generated by Internet‑of‑Things (IoT) devices and cyber‑physical systems under strict bandwidth, latency, and resource constraints. While lossless compression techniques such as Bzip2, LZ4, Snappy, zlib, and Zstd reduce data volume, they require full decompression before any analytics can be performed, adding undesirable overhead for edge and real‑time workloads. Generalized Deduplication (GD) has emerged as a promising alternative because it groups similar data chunks into “bases” and stores the remaining “deviations” separately, enabling random access and direct analytics on the compact base representation. However, state‑of‑the‑art GD algorithms, notably GreedyGD, suffer from two fundamental limitations: (1) the iterative bit‑selection process incurs quadratic time complexity O(nd²), where n is the number of samples and d the dimensionality, making them unsuitable for large‑scale IoT datasets; and (2) compression efficiency and analytical fidelity are tightly coupled, forcing a trade‑off that hampers scalability.

EntroGD (Entropy‑guided GD) is proposed to overcome these limitations by decoupling the two objectives into distinct stages and by replacing the costly iterative search with a simple entropy‑based ordering of bits. The first stage, “condensed sample generation,” selects a limited set of base bits (up to a user‑defined maximum mₘₐₓ) and constructs a small collection of condensed samples. For each base, the mean of its associated deviations is added to the base value, producing a representative sample sⱼ = bⱼ + (1/wⱼ)∑ₖ δⱼ,ₖ, where wⱼ is the number of deviations linked to that base. Each sample receives a weight wⱼ reflecting its contribution to the original data distribution. These weighted samples are appended to the original dataset, forming an extended dataset of size n + m, but because m ≪ n the overall storage overhead is negligible. Importantly, downstream analytics (e.g., clustering, anomaly detection) operate exclusively on the condensed samples and their weights, eliminating the need to access the full raw data.

The second stage performs compression on the extended dataset. EntroGD computes the entropy H(i) of every bit position i across all binary chunks (original and condensed). Bits are sorted in ascending order of entropy, and low‑entropy bits are added sequentially to the base‑bit set B. After each addition, the BASETREE data structure counts the number of unique bases n_b and evaluates the compressed size S using the formula S = n_b l_b + (n+m)(l_d + l_id) + m l_w + S_params. The process stops when S does not improve for τ consecutive bits (a plateau threshold, set to 10 in experiments). Because entropy values are computed once and never recomputed, the entire bit‑selection and compression pipeline runs in linear time O(nd).

Complexity analysis shows that entropy computation, condensed‑sample generation, and the final compression each require O(n l_c) operations, where l_c (the total number of bits per sample) is proportional to d (e.g., 32 d for 32‑bit data). Consequently, the overall runtime is O(nd), a dramatic reduction from the O(nd²) cost of GreedyGD.

The authors evaluate EntroGD on 18 diverse IoT datasets covering a range of sizes (0.4 KB to 5 GB), dimensionalities (4–17), and data types (integer, 32‑bit float). Baselines include GreedyGD, an enhanced GreedyGD+ (which stores deviation means), and several universal compressors (Bzip2, LZ4, Snappy, zlib, Zstd) configured for maximum compression. The primary analytic workload is k‑means clustering. Results demonstrate that EntroGD reduces configuration time by up to 53.5× relative to GreedyGD, while achieving compression ratios comparable to the best‑performing universal compressors. When clustering on compressed data, accessing only 2.6 % of the original volume, EntroGD speeds up the clustering process by up to 31.6× with less than 0.3 % loss in clustering quality (measured by silhouette score and cluster purity). GreedyGD+ narrows the accuracy gap by storing deviation means, but still cannot match EntroGD’s speed‑accuracy trade‑off because it retains the quadratic selection cost.

In summary, EntroGD introduces three key innovations: (1) an entropy‑guided, non‑iterative bit‑selection mechanism that brings GD configuration to linear time; (2) a condensed‑sample generation technique that preserves essential analytic information while keeping the additional data footprint minimal; and (3) a clear separation of compression and analytics objectives, allowing each to be optimized independently. This combination enables scalable, low‑latency analytics on compressed IoT streams, making it suitable for edge devices with limited compute and memory resources. The paper suggests future work on theoretical guarantees for entropy‑based bit ordering and extensions to privacy‑preserving distributed analytics.


Comments & Academic Discussion

Loading comments...

Leave a Comment