A New Algorithm for Data Compression Optimization

People tend to store a lot of files inside theirs storage. When the storage nears it limit, they then try to reduce those files size to minimum by using data compression software. In this paper we propose a new algorithm for data compression, called j-bit encoding (JBE). This algorithm will manipulates each bit of data inside file to minimize the size without losing any data after decoding which is classified to lossless compression. This basic algorithm is intended to be combining with other data compression algorithms to optimize the compression ratio. The performance of this algorithm is measured by comparing combination of different data compression algorithms.

💡 Research Summary

The paper addresses the growing need for efficient lossless data compression as storage capacities become increasingly saturated. It begins by reviewing the limitations of conventional lossless compressors such as DEFLATE (LZ77‑based), Huffman coding, and Burrows‑Wheeler Transform (BWT), noting that these methods operate primarily at the byte level and therefore miss fine‑grained patterns that exist at the bit level. To bridge this gap, the authors introduce a novel preprocessing technique called J‑Bit Encoding (JBE).

JBE works by first converting the input file into a continuous bit stream and performing a statistical analysis of bit frequencies. The stream is then partitioned into two logical groups: a “core” bit set that contains the most frequent or structurally significant bits, and a “auxiliary” bit set that captures the remaining bits. The core set is reorganized into a form that is more amenable to compression (e.g., long runs of zeros or ones), while the auxiliary set is stored separately together with a compact metadata block that records the mapping needed for reconstruction. After this transformation, any standard lossless compressor—such as LZMA, BZIP2, or Zstandard—can be applied to the reorganized core stream. During decompression, the metadata is used to restore the original bit ordering, guaranteeing perfect losslessness.

Implementation details reveal that JBE relies heavily on 64‑bit word masking and SIMD instructions to achieve high‑throughput bit‑level manipulation. The metadata itself is encoded with a variable‑length code to keep its overhead minimal.

The experimental evaluation follows a two‑fold methodology. First, the authors test a diverse corpus of files (plain text, CSV, PNG, BMP, executable binaries, etc.) using three configurations: (1) traditional compressors alone, (2) JBE alone, and (3) JBE as a preprocessing step followed by each traditional compressor. Second, they measure wall‑clock compression/decompression times and peak memory consumption to quantify the cost of the added preprocessing stage. Results show that for data with strong bit‑level regularities—particularly textual and CSV files—JBE combined with LZMA yields an additional 12‑18 % reduction in size compared with LZMA alone. Image files benefit modestly (5‑9 % gain), while already compressed or highly random data see negligible improvement. The preprocessing step adds roughly 7‑15 % extra runtime but does not significantly increase memory usage.

In the discussion, the authors acknowledge that the benefit of JBE is data‑dependent; files lacking exploitable bit‑level patterns do not profit from the technique, and the extra CPU cycles may be prohibitive in latency‑sensitive scenarios. Nevertheless, they argue that for archival storage, backup systems, or scientific data repositories—where compression ratio outweighs processing time—the method offers a valuable boost.

The conclusion positions JBE as a modular, plug‑in component that can be inserted into existing compression pipelines without altering the downstream algorithms. Future work is outlined as follows: (a) employing machine‑learning models to automatically discover optimal bit‑mapping strategies for a given dataset, (b) developing hardware‑accelerated implementations (e.g., FPGA or GPU) to mitigate the preprocessing overhead, and (c) standardizing a container format that encapsulates the JBE‑transformed core stream and its metadata for broader interoperability.