Huffman coding is often presented as the optimal solution to Twenty Questions. However, a caveat is that Twenty Questions games always end with a reply of "Yes," whereas Huffman codewords need not obey this constraint. We bring resolution to this issue, and prove that the average number of questions still lies between H(X) and H(X)+1.
Deep Dive into Twenty Questions Games Always End With Yes.
Huffman coding is often presented as the optimal solution to Twenty Questions. However, a caveat is that Twenty Questions games always end with a reply of “Yes,” whereas Huffman codewords need not obey this constraint. We bring resolution to this issue, and prove that the average number of questions still lies between H(X) and H(X)+1.
Twenty Questions is a classic parlour game involving an answerer and a questioner. The questioner must guess what object the answerer is thinking of, but is only allowed to ask questions whose answers are either "Yes" or "No". Popular initial questions include: "Is it an animal? Is it a vegetable? Is it a mineral?" The name of the game arises from the fact that if one bit of information could be acquired from each question, then twenty questions can distinguish between 2 20 different objects, which should be more than sufficient.
Courses in information theory often cast Huffman coding as the optimal approach to Twenty Questions. Given the set of possible objects and their probabilities, the questioner associates a Huffman codeword with each object, and then inquires about each bit of the codeword that the questioner is thinking of. The average number of questions is the Huffman tree’s average depth, which is no less than H(X), and less than H(X) + 1, where X is the random variable indicating which of n objects the answerer is thinking of.
However, upon further thought, there is a disparity between Huffman coding and how Twenty Questions games are played. Namely, real-world Twenty Questions games always terminate with the questioner pinpointing a specific object (e.g., “Is it a tank?” [1]), to which the answerer replies, “Yes!” In terms of source coding, this is equivalent to enforcing what we call the terminating yes constraint: all codewords must terminate with “1”. Yet Huffman codes do not satisfy this constraint! In short, Huffman trees determine X, but do not specify X.
In this paper, we first provide an example showing that simply appending branches to a Huffman tree may not produce the optimal Twenty Questions tree. We then prove that even under the terminating yes constraint, the average number of questions lies strictly between H(X) and H(X) + 1.
Since Huffman coding solves Twenty Questions without a terminating yes, a natural idea is to first produce the Huffman tree, and then append branches to it so the terminating yes constraint is satisfied. Call the result an augmented Huffman tree. In the following example, we show that augmented Huffman trees may not be optimal Twenty Questions trees.
Suppose there are only four objects the answerer could be thinking of. Denote them by x1, x2, x3, x4, with corresponding probabilities p1 ≥ p2 ≥ p3 ≥ p4. Figure 1 shows the only two four-leaf questioning trees possible up to graph isomorphism, where the dashed edges have been added to accommodate the terminating yes constraint. Although there are many possible assignments of objects to leaves, the assignments shown in Figure 1 One naturally imagines that the choice of a questioning tree should depend on the probability distribution. For instance, if the probabilities are close to uniform, we would guess that the balanced tree is better. However, if we let Q1 and Q2 denote the average number of questions used by the unary and balanced trees, respectively, then
and the difference is
with equality if and only if the distribution is uniform. Apparently the unary tree dominates the balanced tree, regardless of the probabilities! We think this makes for a good bar bet.
This example demonstrates that augmenting a Huffman tree does not necessarily produce the optimal Twenty Questions tree. For example, if the probabilities were (3/10, 3/10, 2/10, 2/10), then the resulting augmented Huffman tree would yield the balanced tree, although the unary tree is better. In fact, among all distributions for which the Huffman algorithm produces a balanced tree, the maximum difference in the average number of questions required by the balanced and unary trees approaches 1/3, and is achieved with the distribution ( 1 3 -ǫ, 1 3 -ǫ, 1 3 -ǫ, 3ǫ).
Let LH be the average depth of the Huffman tree, and let Lyes be the average depth of the optimal Twenty Questions tree. In this section, we prove
Note that these are the same bounds satisfied by LH , except for the strict inequality in the lower bound. We first require two Lemmas.
Lemma 3.1 (Half-Bit Lemma): A binary tree that does not satisfy the terminating yes constraint can be modified to satisfy it while adding no more than 1/2 to the average depth.
Proof: Let T be a tree that does not satisfy the terminating yes constraint. By appending a branch to all leaves whose codewords end with 0, we can construct an augmented tree T ′ that does satisfy it. (This forces all leaves to sway in the same direction.) To minimize the increase in average depth, interchange siblings in T as necessary so that the lower probability sibling is always the one that receives the appended branch. Consequently, if the average length of T is L, the average length of T ′ will be no more than L + 1/2. Lemma 3.2 (Gallager’s Redundancy Bound): For all finite distributions, LH -H(X) ≤ p1 + σ, where p1 is the largest probability, and σ := 1 -log 2 e + log 2 (log 2 e) ≈ 0.086.
Proof: See Gallager [3]. Theor
…(Full text truncated)…
This content is AI-processed based on ArXiv data.