LogPrism: Unifying Structure and Variable Encoding for Effective Log Compression
In the field of log compression, the prevailing “parse-then-compress” paradigm fundamentally limits effectiveness by treating log parsing and compression as isolated objectives. While parsers prioritize semantic accuracy (i.e., event identification), they often obscure deep correlations between static templates and dynamic variables that are critical for storage efficiency. In this paper, we investigate this misalignment through a comprehensive empirical study and propose LogPrism, a framework that bridges the gap via unified redundancy encoding. Rather than relying on a rigid pre-parsing step, LogPrism dynamically integrates structural extraction with variable encoding by constructing a Unified Redundancy Tree (URT). This hierarchical approach effectively mines “structure+variable” co-occurrence patterns, capturing deep contextual redundancies while accelerating processing through pre-emptive pattern encoding. Extensive experiments on 16 benchmark datasets confirm that LogPrism establishes a new state-of-the-art. It achieves the highest compression ratio on 14 datasets, surpassing existing baselines by margins of 6.12% to 83.34%, while delivering superior throughput at 29.87 MB/s (1.68$\times$~43.04$\times$ faster than competitors). Moreover, when configured in single-archive mode to maximize global pattern discovery, LogPrism boosts its compression ratio by 273.27%, outperforming the best baseline by 19.39% with a 2.62$\times$ speed advantage.
💡 Research Summary
The paper “LogPrism: Unifying Structure and Variable Encoding for Effective Log Compression” challenges the dominant “parse‑then‑compress” pipeline that treats log parsing and compression as separate, sequential tasks. The authors first conduct a comprehensive empirical study involving nine state‑of‑the‑art log parsers (Drain, AEL, IPLoM, LFA, LogSig, MoLFI, SHISO, Spell, and the LogReducer parser) and four parser‑based compressors (Logzip, LogReducer, LogShrink, and Denum). Using 14 large‑scale benchmark datasets from LogHub (over 50 GB and 262 million log entries), they demonstrate that high parsing accuracy does not guarantee superior compression. Over‑generalized templates shift entropy to the variable stream, while overly specific templates increase dictionary overhead, both degrading compression ratios. Moreover, the conventional pipeline ignores two crucial sources of redundancy: (1) strong correlations between specific variable values and their templates, and (2) predictable co‑occurrence patterns among variables within a single log entry.
To address these shortcomings, the authors introduce LogPrism, a framework built around a Unified Redundancy Tree (URT). The URT is constructed in three hierarchical stages. First, stable high‑frequency tokens are extracted to form a structural tree that captures the common skeleton of the logs. Second, variable sub‑trees are attached to this skeleton, enabling the mining of frequent “structure + variable” co‑occurrence patterns. By encoding such joint patterns as single symbols, LogPrism dramatically reduces the size of the template dictionary and the entropy of the variable stream. Third, the remaining high‑entropy “long‑tail” variables are processed by a specialized sorting and stream‑normalization pipeline, which efficiently compresses the residual data.
LogPrism also features a parallel‑aware architecture. In its default multi‑chunk mode, the dataset is partitioned into independent chunks that are processed concurrently, achieving throughput of 29.87 MB/s—1.68× to 43.04× faster than the best competing compressors. In a single‑archive mode, the entire log set is fed into a single URT, maximizing global pattern discovery. This mode yields an average compression‑ratio boost of 273.27% and outperforms the strongest baseline (Denum) by 19.39% while only incurring a 2.62× slowdown, which remains acceptable for offline archival workloads.
Experimental results show that LogPrism attains the highest compression ratio on 14 of the 16 benchmark datasets, with improvements ranging from 6.12% to 83.34% over the best existing methods. The framework also consistently outperforms Denum, a number‑centric compressor that relies on parser‑based preprocessing, confirming that unified redundancy encoding is beneficial even for numerically dominated logs.
The paper’s contributions are threefold: (1) the first large‑scale quantitative analysis of how parser choice impacts downstream compression, revealing a critical misalignment between semantic parsing goals and storage efficiency; (2) the introduction of the unified redundancy encoding paradigm and the concrete implementation of LogPrism, including the URT data structure and a parallel‑friendly processing pipeline; (3) an extensive evaluation that establishes new state‑of‑the‑art performance in both compression ratio and speed.
In conclusion, the authors argue that future log compression systems should be co‑designed with parsing, moving beyond the simplistic extraction of templates toward models that explicitly capture template‑variable and inter‑variable correlations. LogPrism serves as a proof‑of‑concept that such integration can dramatically lower storage costs while preserving, or even enhancing, the analytical value of archived logs.
Comments & Academic Discussion
Loading comments...
Leave a Comment