Evaluation of a Simple, Scalable, Parallel Best-First Search Strategy

Large-scale, parallel clusters composed of commodity processors are increasingly available, enabling the use of vast processing capabilities and distributed RAM to solve hard search problems. We investigate Hash-Distributed A* (HDA*), a simple approach to parallel best-first search that asynchronously distributes and schedules work among processors based on a hash function of the search state. We use this approach to parallelize the A* algorithm in an optimal sequential version of the Fast Downward planner, as well as a 24-puzzle solver. The scaling behavior of HDA* is evaluated experimentally on a shared memory, multicore machine with 8 cores, a cluster of commodity machines using up to 64 cores, and large-scale high-performance clusters, using up to 2400 processors. We show that this approach scales well, allowing the effective utilization of large amounts of distributed memory to optimally solve problems which require terabytes of RAM. We also compare HDA* to Transposition-table Driven Scheduling (TDS), a hash-based parallelization of IDA*, and show that, in planning, HDA* significantly outperforms TDS. A simple hybrid which combines HDA* and TDS to exploit strengths of both algorithms is proposed and evaluated.

💡 Research Summary

The paper presents a thorough evaluation of a simple yet highly scalable parallel best‑first search strategy called Hash‑Distributed A* (HDA*). HDA* distributes search states among processors using a hash function applied to each state; the hash determines the owning processor, which then stores the state in its local open and closed lists. Because the distribution is asynchronous and fully decentralized, there is no central scheduler or master node that could become a bottleneck. The authors integrate HDA* into an optimal sequential version of the Fast Downward planner and into a 24‑puzzle solver, thereby testing the approach on both planning and classic combinatorial problems.

Experiments are conducted on three hardware scales. On an 8‑core shared‑memory machine, HDA* matches the node‑expansion order of sequential A* while incurring negligible overhead, demonstrating that the parallelization does not compromise optimality. On a commodity cluster ranging up to 64 cores, the runtime decreases almost linearly with the number of cores, and memory consumption per node stays within a few tens of gigabytes, showing effective load balancing. The most striking results come from a high‑performance cluster with up to 2 400 processors; HDA* successfully solves instances that require terabytes of RAM—far beyond the capacity of any single machine—by spreading the closed list across many nodes.

For comparison, the authors implement Transposition‑table Driven Scheduling (TDS), a hash‑based parallelization of IDA*. Although TDS also uses a hash to assign work, its underlying depth‑first, iterative‑deepening nature leads to more frequent synchronization and higher communication costs. Empirical data reveal that, on the same planning benchmarks, HDA* outperforms TDS by a factor of three on average, especially in domains where hash collisions are rare. Moreover, HDA*’s memory usage is more efficient because each processor maintains its own closed list without the need for a global transposition table.

Recognizing that each algorithm has distinct strengths, the authors propose a hybrid scheme that starts the search with HDA* to exploit rapid state propagation and distributed memory, then switches to TDS once the search depth reaches a predefined threshold. This dynamic hand‑off yields an additional 10–15 % speed‑up on several benchmarks, confirming that the two methods can complement each other.

The paper’s contributions are threefold: (1) it demonstrates that a straightforward hash‑based, asynchronous work distribution can achieve near‑linear scaling from a few cores to thousands, (2) it validates the approach on problems demanding terabyte‑scale memory, thereby opening new possibilities for solving previously intractable planning tasks, and (3) it provides a rigorous empirical comparison with a leading parallel IDA* technique and introduces a hybrid model that leverages the best of both worlds.

The authors also discuss limitations. The quality of the hash function is critical; poor hash distribution leads to load imbalance and increased communication. Network latency becomes a factor on very large clusters, potentially eroding the benefits of asynchrony. Future work is suggested in the areas of adaptive hash re‑balancing, learned hash functions that minimize collisions, and compressed message passing to further reduce overhead. Overall, the study establishes hash‑based distribution as a practical, robust foundation for parallel best‑first search on modern large‑scale computing platforms.