An Efficient Technique for Text Compression

For storing a word or the whole text segment, we need a huge storage space. Typically a character requires 1 Byte for storing it in memory. Compression of the memory is very important for data management. In case of memory requirement compression for text data, lossless memory compression is needed. We are suggesting a lossless memory requirement compression method for text data compression. The proposed compression method will compress the text segment or the text file based on two level approaches firstly reduction and secondly compression. Reduction will be done using a word lookup table not using traditional indexing system, then compression will be done using currently available compression methods. The word lookup table will be a part of the operating system and the reduction will be done by the operating system. According to this method each word will be replaced by an address value. This method can quite effectively reduce the size of persistent memory required for text data. At the end of the first level compression with the use of word lookup table, a binary file containing the addresses will be generated. Since the proposed method does not use any compression algorithm in the first level so this file can be compressed using the popular compression algorithms and finally will provide a great deal of data compression on purely English text data.

💡 Research Summary

The paper proposes a two‑stage lossless compression scheme for textual data that aims to reduce the storage footprint of English‑language text. In the first stage, called “reduction,” the authors suggest replacing every word in a document with a fixed‑size address drawn from a word lookup table that resides in the operating system. This table is a static dictionary that maps each known word to a unique identifier; the identifier is stored using a predetermined number of bytes (typically 2–4). By substituting variable‑length character strings with these compact identifiers, the raw size of the text can be dramatically lowered, especially because the average English word occupies more bytes than the address representation. The output of this stage is a binary file consisting solely of address values; no traditional compression algorithm is applied at this point.

The second stage leverages any existing lossless compressor (e.g., gzip, bzip2, LZMA). Since the address file has a much lower entropy than the original character stream, standard compressors can achieve higher compression ratios on it. The authors argue that the combination of the two stages yields “a great deal of data compression” for purely English text.

While the concept is straightforward, several technical concerns arise. First, the size of the dictionary and the bit‑width of the addresses are tightly coupled. To accommodate one million distinct words, at least 20 bits (≈3 bytes) per address are required, which reduces the net gain compared to a smaller dictionary. The paper does not discuss how the dictionary size is chosen, nor does it provide an analysis of the memory overhead incurred by storing the lookup table in the OS. Second, handling out‑of‑vocabulary (OOV) words—new terminology, proper nouns, misspellings, or domain‑specific jargon—is not addressed. If such words are left uncompressed, they can dominate the address file and erode the benefits of the first stage. Third, the treatment of case sensitivity, punctuation, numbers, and other non‑alphabetic tokens is left unspecified, raising questions about the robustness of the mapping in real‑world texts.

Embedding the dictionary in the operating system has both advantages and drawbacks. On the positive side, a system‑wide dictionary can be shared across applications, eliminating redundant storage and providing a uniform compression baseline for all text files on a machine. It also simplifies the decompression path because the same dictionary is guaranteed to be present on any system running the OS. On the downside, updating the dictionary (e.g., adding new words) becomes a system‑level operation that may break compatibility with files compressed using an older version of the dictionary. Versioning, backward compatibility, and the impact on multi‑language environments (where the dictionary would need to grow dramatically) are not explored.

From an algorithmic perspective, the first stage is essentially a static substitution coding. Its effectiveness depends on the frequency distribution of words relative to the address space. If the most common words are assigned short, densely packed addresses, the subsequent generic compressor can exploit the resulting redundancy even more effectively. Conversely, if the address space is sparsely populated or if the dictionary is oversized, the fixed‑length addresses may introduce unnecessary overhead that outweighs the gains from word substitution.

The paper lacks empirical evaluation. No compression ratios, processing times, or memory consumption figures are presented, nor is there a comparison with established text compressors such as Huffman coding, LZW, PPM, or BWT‑based schemes. Without such data, it is impossible to quantify the claimed “great deal of data compression” or to assess whether the added complexity of maintaining an OS‑level dictionary is justified.

In summary, the authors introduce an intriguing two‑level approach: first replace words with OS‑managed dictionary addresses, then apply any off‑the‑shelf lossless compressor. The idea has merit, particularly for large homogeneous English corpora where a static dictionary can capture the majority of vocabulary. However, practical deployment would require careful design of dictionary size, address width, OOV handling, and version control, as well as thorough benchmarking against existing methods. If these issues are addressed, the technique could become a viable complement to current text compression tools, especially in environments where the OS can guarantee the presence of a shared, high‑quality word lookup table.

💡 Research Summary

📜 Original Paper Content