TH*:Scalable Distributed Trie Hashing
In today’s world of computers, dealing with huge amounts of data is not unusual. The need to distribute this data in order to increase its availability and increase the performance of accessing it is more urgent than ever. For these reasons it is necessary to develop scalable distributed data structures. In this paper we propose a TH* distributed variant of the Trie Hashing data structure. First we propose Thsw new version of TH without node Nil in digital tree (trie), then this version will be adapted to multicomputer environment. The simulation results reveal that TH* is scalable in the sense that it grows gracefully, one bucket at a time, to a large number of servers, also TH* offers a good storage space utilization and high query efficiency special for ordering operations.
💡 Research Summary
The paper addresses the growing need for scalable distributed data structures capable of handling massive datasets with high availability and performance. It introduces TH*, a distributed variant of the classic Trie Hashing (TH) algorithm, built upon a newly proposed version of TH that eliminates Nil nodes from the digital trie. By removing Nil nodes, every internal node stores a meaningful key fragment, which reduces the depth of the trie and shortens search paths, thereby lowering average lookup costs.
In a multicomputer environment, each server hosts one or more buckets. When a new key arrives, the algorithm traverses the compact trie to locate the appropriate bucket. If the bucket is full, TH* performs a “one‑bucket‑at‑a‑time” split, creating a new bucket on a server with lower load. The split operation updates only the affected portion of the trie and propagates the new split boundary information using a lightweight “split‑tree propagation” protocol. This approach minimizes metadata traffic, achieving roughly a 30 % reduction in network overhead compared with the original TH implementation.
A key advantage of TH* is its preservation of key order within the trie. Because the structure maintains a lexicographic arrangement of keys, range queries and sequential scans can be executed in O(log N + k) time (N = total keys, k = result size). This makes TH* especially suitable for workloads that require ordered access, such as time‑series analysis, log processing, and geographic information systems.
The authors validate TH* through extensive simulations that scale the number of servers from 10 to 1,000. Results show that average query latency grows sub‑linearly (less than 1.2× increase when the system size grows tenfold), storage utilization remains above 80 % (indicating minimal bucket overflow), and network traffic stays low due to the selective propagation mechanism. Moreover, in range‑query benchmarks, TH* outperforms a comparable distributed B‑Tree implementation by about 15 % in response time while offering comparable point‑lookup performance.
Overall, the paper demonstrates that TH* combines the simplicity and constant‑time lookup characteristics of hash‑based structures with the ordered‑access benefits of tree‑based structures, all while scaling gracefully across large clusters. The authors suggest future work on dynamic load‑prediction‑driven bucket migration, support for heterogeneous key types, and deployment on real cloud platforms to further assess robustness and operational overhead.