Study On Universal Lossless Data Compression by using Context Dependence Multilevel Pattern Matching Grammar Transform
In this paper, the context dependence multilevel pattern matching(in short CDMPM) grammar transform is proposed; based on this grammar transform, the universal lossless data compression algorithm, CDMPM code is then developed. Moreover we get a upper bound of this algorithms’ worst case redundancy among all individual sequences of length n from a finite alphabet.
💡 Research Summary
**
The paper introduces a novel grammar‑based transformation called Context‑Dependent Multilevel Pattern Matching (CDMPM) and builds a universal lossless compression algorithm, the CDMPM code, on top of it. The work is positioned as an extension of earlier multilevel pattern‑matching (MPM) and its side‑information variant (CMPM). While MPM and CMPM already achieve a worst‑case redundancy bounded by a term proportional to (\log n), CDMPM adds context dependence to the multilevel pattern matching process, aiming to capture finer statistical regularities in the source data.
Core Construction
Given an input sequence (x) of length (n) over a finite alphabet (A), two positive integers (I) and (r) define the ((I,r))-grammar transform. The transform proceeds in several stages:
- r‑ary Decomposition – The length (n) is expressed in base‑(r); the string is partitioned into non‑overlapping substrings whose lengths follow the digits of this expansion.
- Multilevel Blocking – For each level (i) (from 0 up to (I)) the current substring is divided into blocks of length (I^{r}); each block is further split into sub‑blocks of length (i^{r}).
- Context Construction – Each block is associated with a context consisting of a fixed initial sequence of length (i^{r}) together with the context inherited from the previous level.
- Label Assignment – Identical blocks receive the same integer label, distinct blocks receive distinct labels, and labels are assigned sequentially starting from 1.
- Special Symbol ‘s’ – Within each context‑dependent subsequence, the first occurrence of a block is marked by a special symbol ‘s’; subsequent occurrences are labeled by the number of distinct blocks seen up to that point.
Level 1 skips steps 4 and 5, using the raw context directly. The result of the transform is a collection of labeled sequences ({T_i}) together with their associated context sequences ({C_i}).
Arithmetic Encoding
The CDMPM code encodes the labeled sequences using an adaptive arithmetic coder. For each context (\gamma) and symbol (\beta) a counter (C_{\gamma}(\beta)) is maintained. When a symbol is processed, its probability is estimated as
\
Comments & Academic Discussion
Loading comments...
Leave a Comment