Variance-Aware Prior-Based Tree Policies for Monte Carlo Tree Search

Monte Carlo Tree Search (MCTS) has profoundly influenced reinforcement learning (RL) by integrating planning and learning in tasks requiring long-horizon reasoning, exemplified by the AlphaZero family of algorithms. Central to MCTS is the search strategy, governed by a tree policy based on an upper confidence bound (UCB) applied to trees (UCT). A key factor in the success of AlphaZero is the introduction of a prior term in the UCB1-based tree policy PUCT, which improves exploration efficiency and thus accelerates training. While many alternative UCBs with stronger theoretical guarantees than UCB1 exist, extending them to prior-based UCTs has been challenging, since PUCT was derived empirically rather than from first principles. Recent work retrospectively justified PUCT by framing MCTS as a regularized policy optimization (RPO) problem. Building on this perspective, we introduce Inverse-RPO, a general methodology that systematically derives prior-based UCTs from any prior-free UCB. Applying this method to the variance-aware UCB-V, we obtain two new prior-based tree policies that incorporate variance estimates into the search. Experiments indicate that these variance-aware prior-based UCTs outperform PUCT across multiple benchmarks without incurring additional computational cost. We also provide an extension of the mctx library supporting variance-aware UCTs, showing that the required code changes are minimal and intended to facilitate further research on principled prior-based UCTs. Code: github.com/Max-We/inverse-rpo.

💡 Research Summary

Monte Carlo Tree Search (MCTS) has become a cornerstone of modern reinforcement learning, especially after the success of AlphaZero‑style algorithms that combine planning with deep learning. The core of MCTS is the tree policy, traditionally based on the Upper Confidence Bound applied to trees (UCT). AlphaZero introduced a prior term into the UCT formula, resulting in the PUCT algorithm, which empirically improves exploration by biasing the search toward moves that a neural network predicts to be promising. Although many stronger, theoretically‑grounded UCB variants exist—such as variance‑aware UCB‑V—integrating them with a prior has been difficult because PUCT was derived heuristically rather than from a principled optimization framework.

Recent work reframed MCTS as a regularized policy optimization (RPO) problem, showing that PUCT can be interpreted as the solution of a constrained optimization where the prior acts as a regularizer. Building on this insight, the authors propose Inverse‑RPO, a systematic methodology that starts from any prior‑free UCB formula, writes its confidence bound as a Lagrangian term, and then injects a prior by adding a logarithmic regularizer. This process yields a new family of prior‑based tree policies that retain the theoretical properties of the original UCB while benefiting from the guidance of a neural prior.

Applying Inverse‑RPO to the variance‑aware UCB‑V produces two novel tree policies. The first adds a simple log‑prior term to the UCB‑V bound: \

💡 Research Summary

📜 Original Paper Content