An Optimal Labeling Scheme for Ancestry Queries

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

An ancestry labeling scheme assigns labels (bit strings) to the nodes of rooted trees such that ancestry queries between any two nodes in a tree can be answered merely by looking at their corresponding labels. The quality of an ancestry labeling scheme is measured by its label size, that is the maximal number of bits in a label of a tree node. In addition to its theoretical appeal, the design of efficient ancestry labeling schemes is motivated by applications in web search engines. For this purpose, even small improvements in the label size are important. In fact, the literature about this topic is interested in the exact label size rather than just its order of magnitude. As a result, following the proposal of a simple interval-based ancestry scheme with label size $2\log_2 n$ bits (Kannan et al., STOC ‘88), a considerable amount of work was devoted to improve the bound on the size of a label. The current state of the art upper bound is $\log_2 n + O(\sqrt{\log n})$ bits (Abiteboul et al., SODA ‘02) which is still far from the known $\log_2 n + \Omega(\log\log n)$ bits lower bound (Alstrup et al., SODA ‘03). In this paper we close the gap between the known lower and upper bounds, by constructing an ancestry labeling scheme with label size $\log_2 n + O(\log\log n)$ bits. In addition to the optimal label size, our scheme assigns the labels in linear time and can support any ancestry query in constant time.

💡 Research Summary

The paper addresses the classic problem of ancestry labeling: assigning a compact bit‑string label to each node of a rooted tree so that, given only the two labels of nodes u and v, one can decide whether u is an ancestor of v. The quality of a scheme is measured by the maximum label length over all nodes in all n‑node trees. Early work introduced a simple interval‑based scheme that uses 2 log n bits (Kannan et al., 1988). Subsequent research gradually reduced the label size, reaching the best known upper bound of log n + O(√log n) bits (Abiteboul et al., 2002) while a lower bound of log n + Ω(log log n) bits was proved (Alstrup et al., 2003). The gap between these bounds had remained open.

The authors close this gap by presenting a labeling scheme whose labels have size log n + O(log log n) bits, which matches the lower bound up to constant factors. Their construction proceeds in three conceptual steps:

Heavy‑Light Decomposition and Supervisors – For each internal node, the child with the largest subtree weight is designated “heavy”; all other children are “light”. The root is considered light. For any node u, sup(u) is defined as the deepest light node on the path from u to the root (if u itself is light, sup(u)=u). This hierarchy creates a layered structure where every node belongs to a light “supervisor” that is higher in the tree but not too far away.
Interval Assignment with Local Partial Order – The algorithm performs a depth‑first search that visits light children before heavy ones, assigning DFS numbers 0…n‑1. Each node u receives an interval I(u) based on its DFS number and the maximal DFS number in its subtree. Additionally, the interval I(sup(u)) of its supervisor is stored. The intervals are required to satisfy two local partial‑order conditions:
- (lpo1) I(u) is contained in the intersection of I(sup(u)) and I(sup(parent(u))) for non‑root nodes.
- (lpo2) For every “local quasi‑ancestor” (nodes on the path from parent(u) to sup(parent(u)) plus their light children, excluding nodes with larger DFS numbers) the interval of that quasi‑ancestor precedes I(u) in the order ≺ (which means its right endpoint is smaller than the left endpoint of I(u)). These conditions guarantee a well‑nested family of intervals that can be encoded efficiently.
Decoding via Two Simple Tests – Given labels of u and v, the decoder checks:
- d1: I(v) ⊂ I(sup(u)) (i.e., v’s interval is strictly inside the supervisor interval of u);
- d2: I(u) ≺ I(v) or I(u)=I(sup(u)). The authors prove that the two tests hold simultaneously if and only if u is an ancestor of v. Thus ancestry queries are answered with only constant‑time integer comparisons and bit‑shifts.

Label size analysis – The set of possible intervals is kept small by carefully bounding the number of distinct interval lengths; the authors show that only O(n / log n) different intervals are needed. Encoding an interval therefore requires log n bits, and the extra information needed to identify the supervisor adds only O(log log n) bits. Consequently each label occupies log n + O(log log n) bits.

Complexity – The labeling algorithm runs in linear time: a single DFS plus constant‑time operations per node. The decoder uses only basic RAM operations (addition, subtraction, shifts, and comparisons) and thus answers queries in O(1) time on a word‑RAM where word size is Ω(log n).

Impact – The scheme matches the theoretical lower bound up to constant factors, providing essentially optimal ancestry labeling for arbitrary rooted trees. Because the label size improvement is only a logarithmic factor, it is especially valuable for applications such as XML indexing, where millions of node labels must reside in main memory. Smaller labels reduce memory footprint, improve cache behavior, and can lead to faster query processing in practice.

In summary, the paper delivers a near‑optimal ancestry labeling scheme with label size log n + Θ(log log n) bits, linear‑time construction, and constant‑time queries, thereby resolving a long‑standing open problem in the field of informative labeling.

An Optimal Labeling Scheme for Ancestry Queries

💡 Research Summary

Comments & Academic Discussion

Leave a Comment