Enumeration of sequences with large alphabets

Enumeration of sequences with large alphabets

This study focuses on efficient schemes for enumerative coding of $\sigma$–ary sequences by mainly borrowing ideas from "Oktem & Astola’s \cite{Oktem99} hierarchical enumerative coding and Schalkwijk’s \cite{Schalkwijk72} asymptotically optimal combinatorial code on binary sequences. By observing that the number of distinct $\sigma$–dimensional vectors having an inner sum of $n$, where the values in each dimension are in range $[0…n]$ is $K(\sigma,n) = \sum_{i=0}^{\sigma-1} {{n-1} \choose {\sigma-1-i}} {{\sigma} \choose {i}}$, we propose representing $C$ vector via enumeration, and present necessary algorithms to perform this task. We prove $\log K(\sigma,n)$ requires approximately $ (\sigma -1) \log (\sigma-1) $ less bits than the naive $(\sigma-1)\lceil \log (n+1) \rceil$ representation for relatively large $n$, and examine the results for varying alphabet sizes experimentally. We extend the basic scheme for the enumerative coding of $\sigma$–ary sequences by introducing a new method for large alphabets. We experimentally show that the newly introduced technique is superior to the basic scheme by providing experiments on DNA sequences.


💡 Research Summary

The paper addresses the problem of efficiently enumerative‑coding σ‑ary sequences, where the alphabet size σ may be large. Building on Oktem & Astola’s hierarchical enumerative coding (originally for binary sequences) and Schalkwijk’s asymptotically optimal combinatorial code, the authors generalize these ideas to arbitrary σ. The central combinatorial observation is that the number of σ‑dimensional integer vectors C = (c₁,…,c_σ) with each component in