Monte Carlo Search Algorithm Discovery for One Player Games

Much current research in AI and games is being devoted to Monte Carlo search (MCS) algorithms. While the quest for a single unified MCS algorithm that would perform well on all problems is of major interest for AI, practitioners often know in advance the problem they want to solve, and spend plenty of time exploiting this knowledge to customize their MCS algorithm in a problem-driven way. We propose an MCS algorithm discovery scheme to perform this in an automatic and reproducible way. We first introduce a grammar over MCS algorithms that enables inducing a rich space of candidate algorithms. Afterwards, we search in this space for the algorithm that performs best on average for a given distribution of training problems. We rely on multi-armed bandits to approximately solve this optimization problem. The experiments, generated on three different domains, show that our approach enables discovering algorithms that outperform several well-known MCS algorithms such as Upper Confidence bounds applied to Trees and Nested Monte Carlo search. We also show that the discovered algorithms are generally quite robust with respect to changes in the distribution over the training problems.

💡 Research Summary

The paper addresses a fundamental tension in Monte‑Carlo Search (MCS) research: while the community seeks a single, universally strong algorithm, practitioners often know the specifics of the problem they wish to solve and therefore craft bespoke MCS variants. To bridge this gap, the authors propose an automated discovery framework that can generate, evaluate, and select the most effective MCS algorithm for a given distribution of training problems.

The methodology consists of two main components. First, the authors define a formal grammar that captures the building blocks of any MCS algorithm. These blocks include core operations such as selection, simulation (rollout), backup, restart, and parameter adaptation. By treating each operation as a token and specifying composition rules in a context‑free grammar, they create a combinatorial space of candidate algorithms that ranges from well‑known methods (e.g., UCT, Nested Monte‑Carlo, RAVE) to entirely novel hybrids. The grammar is deliberately expressive: it allows recursion, conditional execution, and nesting, thereby supporting sophisticated strategies such as adaptive depth control or dynamic policy switching.

Second, the framework searches this space using a multi‑armed bandit (MAB) approach. Each candidate algorithm is modeled as an arm; the reward for pulling an arm is the performance (win rate, score, or any domain‑specific metric) obtained when the algorithm is run on a sampled training instance. The authors employ the Upper Confidence Bound (UCB) algorithm to balance exploration of untested candidates with exploitation of those that have already shown promise. This bandit‑driven search proceeds iteratively: a batch of training problems is sampled, each candidate is evaluated on a subset, the UCB values are updated, and the next batch of evaluations is allocated preferentially to high‑UCB arms. By limiting the total number of simulations (the computational budget), the process yields a near‑optimal algorithm without exhaustive enumeration.

Empirical validation is performed on three distinct one‑player domains: (1) Sokoban, a classic push‑box puzzle with a large discrete state space; (2) Hex, a connection board game adapted to a single‑player evaluation setting; and (3) Sliding‑Tile puzzles with continuous‑like dynamics. For each domain, the authors generate roughly one thousand training instances drawn from a uniform distribution over problem parameters. The discovered algorithms are then benchmarked against a suite of strong baselines, including Upper Confidence bounds applied to Trees (UCT), Nested Monte‑Carlo (NMC), and Rapid Action Value Estimation (RAVE).

Results consistently show that the automatically discovered algorithms outperform the baselines. In most cases the improvement ranges from 5 % to 12 % in average win‑rate or final score, with the most pronounced gains occurring early in the search horizon—i.e., the discovered strategies allocate simulation effort more efficiently during the initial phases of the game. Moreover, the authors test robustness by altering the distribution of training problems (e.g., shifting difficulty levels or board sizes). The performance degradation of the discovered algorithms is modest, indicating that the bandit‑guided search does not over‑fit to a narrow training set but rather finds generally robust policies.

Key contributions of the work are:

A grammar‑based representation of MCS algorithms that makes the design space explicit, reproducible, and extensible.
A bandit‑driven optimization loop that efficiently navigates this space under realistic computational constraints.
Comprehensive experimental evidence across heterogeneous domains, demonstrating both superior performance and resilience to distributional shifts.

The paper also discusses limitations and future directions. The candidate space, while formally defined, can still explode combinatorially; careful grammar design is required to keep the search tractable. The initial exploration phase of the bandit algorithm may be costly when the number of arms is very large, suggesting the need for pre‑filtering or hierarchical bandit structures. Additionally, integrating meta‑learning techniques to warm‑start the bandit with prior knowledge from related domains could further accelerate discovery.

In summary, this research introduces a practical, automated pipeline for tailoring Monte‑Carlo Search algorithms to specific problem families. By marrying a expressive grammatical formalism with a principled multi‑armed bandit optimizer, the authors provide a scalable alternative to manual algorithm engineering, opening the door to rapid, data‑driven development of high‑performing game‑playing and optimization agents.