There is a large literature devoted to the problem of finding an optimal (min-cost) prefix-free code with an unequal letter-cost encoding alphabet of size. While there is no known polynomial time algorithm for solving it optimally there are many good heuristics that all provide additive errors to optimal. The additive error in these algorithms usually depends linearly upon the largest encoding letter size. This paper was motivated by the problem of finding optimal codes when the encoding alphabet is infinite. Because the largest letter cost is infinite, the previous analyses could give infinite error bounds. We provide a new algorithm that works with infinite encoding alphabets. When restricted to the finite alphabet case, our algorithm often provides better error bounds than the best previous ones known.
Let Σ = {σ 1 , σ 2 . . . . , σ t } be an encoding alphabet. Word w ∈ Σ * is a prefix of word w ′ ∈ Σ * if w ′ = wu where u ∈ Σ * is a non-empty word. A Code over Σ is a collection of words C = {w 1 , . . . , w n }. Code C is prefix-free if for all i = j w i is not a prefix of w j . See Figure 1.
Let cost(w) be the length or number of characters in w. Given a set of associated probabilities p 1 , p 2 , . . . , p n ≥ 0, i p i = 1, the cost of the code is Cost(C) = n i=1 cost(w i )p i . The prefix coding problem, sometimes known as the Huffman encoding problem is to find a prefix-free code over Σ of minimum cost. This problem is very well studied and has a well-known x aaa aab ab b cost(x) 3 5 4 3
x aaa aab ab aaba cost(x) 3 5 4 6
Figure 1: In this example Σ = {a, b}. The code on the left is {aaa, aab, ab, b} which is prefix free. The code on the right is {aaa, aab, ab, aaba} which is not prefix-free because aab is a prefix of aaba. The second row of the tables contain the costs of the codewords when cost(a) = 1 and cost(b) = 3. Figure 2: Two min-cost prefix free codes for probabilities 2/6, 2/6, 1/6, 1/6 and their tree representations. The code on the left is optimal for c 1 = c 2 = 1 while the code on the right, the prefix-free code from Figure 1, is optimal for c 1 = 1, c 2 = 3.
O(tn log n)-time greedy-algorithm due to Huffman [14] (O(tn)-time if the p i are sorted in non-decreasing order). Alphabetic coding is the same problem with the additional constraint that the codewords must be chosen in increasing alphabetic order (with respect to the words to be encoded). This corresponds, for example, to the problem of constructing optimal (with respect to average search time) search trees for items with the given access probabilities or frequencies. Such a code can be constructed in O(tn 3 ) time [16].
One well studied generalization of the problem is to let the encoding letters have different costs. That is, let σ i ∈ Σ have associated cost c i . The cost of codeword w = σ i 1 σ i 2 . . . σ i l will be cost(w) = l k=1 c i k , i.e., the sum of the costs of its letters (rather than the length of the codeword) with the cost of the code still being defined as Cost(C) = n i=1 cost(w i )p i with this new cost function.
The existing, large, literature on the problem of finding a minimal-cost prefix free code when the c i are no longer equal, which will be surveyed below, assumes that Σ is a finite alphabet, i.e., that t = |Σ| < ∞. The original motivation of this paper was to address the problem when Σ is unbounded. which, as will briefly be described in Section 3 models certain types of language restrictions on prefix free codes and the imposition of different cost metrics on search trees. The tools developed, though, turn out to provide improved approximation bounds for many of the finite cases as well. More specifically, it was known [20,23] 1 that 1 c H(p 1 , . . . , p n ) ≤ OP T where H(p 1 , . . . , p n ) = -n i=1 p i log p i is the entropy of the distribution, c is the unique positive root of the characteristic equation 1 = t i=1 2 -cc i and OP T is the minimum cost of any prefix free code for those p i . Note that in this paper, log x will always denote log 2 x.
The known efficient algorithms create a code T that satisfies
where C(T ) is the cost of code T , C = (c 1 , c 2 , • • • , c t ) and f (C) is some function of the letter costs C, with the actual value of f (C) depending upon the particular algorithm. Since 1 c H(p 1 , . . . , p n ) ≤ OP T , code T has an additive error at most f (C) from OP T. The f (C) corresponding to the different algorithms shared an almost linear dependence upon the value c t = max(C), the largest letter cost. They therefore can not be used for infinite C. In this paper we present a new algorithmic variation (all algorithms for this problem start with the same splitting procedure so they are all, in some sense, variations of each other) with a new analysis:
• (Theorems 2 and 3) For finite C we derive new additive error bounds f (C) which in many cases, are much better than the old ones.
is bounded, then we can still give a bound of type (1). For example, if c m = 1 + ⌊ m-1 2 ⌋, i.e., exactly two letters each of length i, = 1, 2, 3, . . ., then we can show that f (C)
where f (C, ǫ) is some constant based only on C and ǫ.
We now provide some more history and motivation. For a simple example, refer to Figure 2. Both codes have minimum cost for the frequencies (p 1 , p 2 , p 3 , p 4 ) = 1 3 , 1 3 , 1 6 , 1 6 but under different letter costs. The code {00, 01, 10, 11} has minimum cost for the standard Huffman problem in which of Σ = {0, 1} and c 1 = c 2 = 1, i.e., the cost of a word is the number of bits it contains. The code {aaa, aab, ab, b} has minimum cost for the alphabet Σ = {a, b} in which the length of an “a” is 1 and the length of a “b” is 3, i.e., C = (1, 3).
The unequal letter cost coding problem was originally motivated by coding problems in which different char
This content is AI-processed based on open access ArXiv data.