Monte Carlo Tree Search (MCTS) has profoundly influenced reinforcement learning (RL) by integrating planning and learning in tasks requiring long-horizon reasoning, exemplified by the AlphaZero family of algorithms. Central to MCTS is the search strategy, governed by a tree policy based on an upper confidence bound (UCB) applied to trees (UCT). A key factor in the success of AlphaZero is the introduction of a prior term in the UCB1 -based tree policy PUCT, which improves exploration efficiency and thus accelerates training. While many alternative UCBs with stronger theoretical guarantees than UCB1 exist, extending them to prior-based UCTs has been challenging, since PUCT was derived empirically rather than from first principles. Recent work retrospectively justified PUCT by framing MCTS as a regularized policy optimization (RPO) problem. Building on this perspective, we introduce Inverse-RPO, a general methodology that systematically derives prior-based UCTs from any prior-free UCB. Applying this method to the variance-aware UCB-V, we obtain two new prior-based tree policies that incorporate variance estimates into the search. Experiments indicate that these variance-aware prior-based UCTs outperform PUCT across multiple benchmarks without incurring additional computational cost. We also provide an extension of the mctx library supporting variance-aware UCTs, showing that the required code changes are minimal and intended to facilitate further research on principled prior-based UCTs. Code: https: //github.com/Max-We/inverse-rpo.
📄 Full Content
The combination of reinforcement learning (RL) with Monte Carlo Tree Search (MCTS) has led to major advances in artificial intelligence. Starting with Al-phaGo (Silver et al., 2016), and subsequently generalized by AlphaZero (Silver et al., 2018) and MuZero (Schrittwieser et al., 2020), this line of work has achieved superhuman performance across domains requiring long-horizon reasoning and complex decisionmaking. These results underscore the power of integrating learning with search-based planning, and they motivate ongoing efforts to develop more efficient and broadly applicable variants of MCTS and AlphaZerostyle methods.
A central component of MCTS is the tree policy, which balances exploration and exploitation to minimize regret. Before AlphaZero, such policies were derived from upper confidence bounds (UCBs) such as UCB1 (Auer et al.), giving rise to the well-studied family of UCT algorithms, which apply UCBs to tree search. Over time, many variants beyond UCB1 -including UCB-V, Bayesian UCT, and UCB1-Uniform/Power (Audibert et al., 2009;Tesauro et al., 2012;Asai and Wissow, 2024)-have been explored and shown to have a significant effect on the MCTS performance. With the AlphaZero family of algorithms, UCB1 was extended by incorporating a prior term estimated by a neural network, yielding PUCT. This prior-based extension of UCB1 greatly improved search efficiency in both small and large action spaces (Wu et al., 2023) and has since become the de facto standard tree policy. However, extending this prior-based approach to other UCBs has proven difficult. While the authors claim that PUCT is a variant of PUCB (Rosin, 2011), which in itself is an extension of UCB1 with contextual information, a complete proof was never presented. Indeed, the concrete form of PUCT deviates from UCB1 and PUCB by introducing a heuristic decay of the exploration term, and it is generally assumed to have been derived empirically rather than from formal guar-antees1 . We hypothesize that the extension of other UCBs to prior-based UCTs in the context of MCTS, although promising in theory, has been underexplored for that reason.
Table 1: Four prior-based UCT rules arranged by base UCB (columns) and heuristic form (rows). The heuristic form of the UCTs is described in Section 2.1. Our contributions are marked with *.
UCB1 UCB-V canonical form UCT-P UCT-V-P* heuristic form PUCT PUCT-V*
Recent work has reinterpreted MCTS as regularized policy optimization (RPO), showing that PUCT can be viewed as tracking the solution to a specific RPO.
Our key insight is that this perspective not only provides an understanding for the form of prior-based UCBs in hindsight, such as previously described for PUCT (Grill et al., 2020), but also the theoretical foundation needed to systematically derive any priorbased UCT directly from prior-free UCBs by expressing them as an RPO. Building on this insight, we continue to study prior-based UCTs beyond PUCT by extending other, potentially stronger, UCB-based policies with prior terms. More concretely, we make the following key contributions:
Inverse-RPO. We introduce Inverse-RPO, a principled, step-by-step method that transforms a UCB into its prior-based counterpart. Unlike prior work that starts from an already prior-based selector such as PUCT (Grill et al., 2020) our method derives a prior-based selector systematically from its prior-free base form (e.g., UCB1 ). While prior work provides the formal framework linking MCTS and UCTs to RPO (Grill et al., 2020), we rearrange and slightly extend this approach into an easy-to-follow methodology, enabling researchers to apply it directly to their UCB of choice in future work.
Variance-Aware Prior-Based UCTs.
To explore prior-based UCTs beyond PUCT, we instantiate Inverse-RPO on the variance-aware UCB-V to obtain two prior-based tree policies (see Table 1): (i) UCT-V-P, a principled RPO-derived variant; and (ii) PUCT-V, an heuristic analogue aligned with the practical form of PUCT. As experimental baselines, we compare these derived tree-policies against PUCT (the de facto choice in the AlphaZero family of algorithms), while also benchmarking against UCT-P ( Grill et al., 2020), which can be viewed as a prior-based UCB1 without the heuristic alterations introduced with PUCT.
Empirical Validation and Implementation. Across a range of benchmark domains, we show that our variance-aware prior-based UCT-V-P and PUCT-V consistently match or outperform UCT-P and PUCT respectively, indicating that the benefits of replacing UCB1 with stronger UCBs such as UCB-V in MCTS extend naturally to the prior-based MCTS as in the AlphaZero family of algorithms. We further propose an efficient implementation strategy for variance-aware MCTS, demonstrating that the derived UCT-V-P and PUCT-V can be deployed in practice as easily as the commonly used PUCT and at no extra computational overhead.
Before presenting our methodology, we briefly review the key background concepts and notation needed throughout the paper. We begin with Monte Carlo Tree Search (MCTS) and its standard UCT formulation, followed by the regularized policy optimization (RPO) perspective that provides the foundation for our derivations.
Monte Carlo Tree Search (MCTS) is a widely used planning algorithm that incrementally builds a search tree through repeated simulations (see Appendix B). During search, a tree policy based on an upper confidence bound (UCB) balances exploration and exploitation (Kocsis and Szepesvári, 2006) 2 . When UCB1 is applied to trees, this yields the classical upper confidence bound for trees (UCT1 ) (Kocsis and Szepesvári, 2006):
Here q a is the empirical action value, n a its visit count, and N = b n b the total visits at the node. UCT1 is provably optimal in the sense that it achieves the correct exploration-exploitation trade-off and converges to the optimal policy as the number of visits grows. Throughout this work, we add 1 to the visit count n a , without loss of generality, to avoid division by zero and to simplify the subsequent analysis.
2 Notation: (1) We use UCB/UCT in upright font as generic descriptors for the family of upper confidence bound rules (UCT denotes a UCB applied to trees). (2) Concrete algorithms/instantiations are written in italics (e.g., UCB-V, PUCT ). (3) The canonical Hoeffding-based forms are written UCB1 /UCT1 to distinguish them from the generic descriptors in (1). A suffix “-P” indicates a prior-based extension (e.g., UCT-P, PUCT-V ).
The action selection rule used in AlphaZero, commonly referred to as PUCT (Silver et al., 2017), was introduced later. It augments UCB1 with the policy prior π θ (a), which is being approximated by a neural network.
PUCT Heuristic Exploration Decay. Besides the prior term, PUCT (2) departs from the principled UCB1 rule by adopting a different exploration bonus that scales only with the square root of the total visit count N , rather than with √ log N . Formally, this amounts to replacing the exploration term
Later, Grill et al. (2020) proposed a principled variant, UCT-P, which, similar to PUCT extends UCB1 by incorporating the policy prior, but without the heuristic exploration decay.
By formalizing MCTS as a regularized policy optimization (RPO) problem, they showed that UCT-P directly expresses an RPO and that even PUCT can be cast within this framework-thus providing a theoretical justification in hindsight for its heuristic form.
Many machine-learning problems have been expressed as convex optimization problems (Bubeck, 2015), such as Support Vector Machines (SVMs) (Scholkopf and Smola) or Trust Region Policy Optimization (TRPO) (Schulman et al., 2017). Equivalently, reinforcement learning (RL) can be interpreted as a convex optimization problem by expressing it as RPO
where y is a distribution over actions, q the corresponding q-values, and R : S 2 → R a divergence-based convex regularizer that keeps y close to the prior policy π θ (Neu et al., 2017;Geist et al., 2019;Grill et al., 2020).
(1) corresponds to the solution of an RPO with the Hellinger distance:
(
where A denotes the action set and S is the |A|dimensional probability simplex.
Similarly, they showed that PUCT (2) expresses the solution to an RPO with the reverse-KL distance:
From this RPO perspective, the UCT-P (3) and PUCT (2) can be recovered by considering the optimal action of the RPOs and evaluating the marginal one-step gain when selecting action a. Following prior work, we keep the notation ∂ ∂na ; operationally, this denotes the change along the coupled MCTS update in which both n a and the total count N = b n b increase by one. 2. Define a separable f -regularizer. Select a convex generator f such that f ′ (r) = -h(r), yielding a prior-free RPO.
Lift the regularizer with a prior. Note that the prior-free RPO corresponds to the special case of an implicit prior-based RPO with uniform prior; generalize it by replacing the separable fregularizer with a Csiszár f -divergence D f (π θ , y), thereby obtaining an explicit prior-based RPO.
Recover the prior-based UCT rule. Take the marginal gain with respect to n a to derive the prior-based UCT selector.
For demonstration, we now apply the Inverse-RPO pipeline to the classical UCT1 score (1) and obtain the prior-based rule UCT-P in (3). The same steps extend to other UCT-style scores (see Sec. 4 for UCT-V ). Let a UCT-style selector S a (q, n, N ) and the empirical visit distribution π(a) be defined as:
Using this notation, the UCT1 score (cf. Eq. ( 1)) becomes:
(10)
Factorize the UCT bonus.
We decompose the exploration term into a global scale Φ(N ) and a monotone shape function h of the empirical visit probability π(a). This separates the dependence on N and n a and sets up the correspondence h = -f ′ used by the RPO derivation.
Define a separable f -regularizer.
Choose a convex generator whose (negative) derivative is h:
In this case, f H is the Hellinger function, which is convex and satisfies f H (1) = 0. This yields the RPO with a separable f -regularizer :
Taking the marginal one-step gain with respect to n a recovers the UCT1 scoring rule matching (11):
Lift the regularizer with a prior.
We now lift the separable f -regularizer to the Csiszár f -divergence form with a prior π θ :
Utilizing the previously defined convex generator ( 12), D H is a Hellinger-type f -divergence. Using this divergence, the prior-based RPO objective L UCT-P and the corresponding greedy expansion rule a ⋆ UCT-P are identical to the ones presented by Grill et al. (2020):
( 16)
Recover the prior-based UCT rule.
Solving the derivative condition in a ⋆ UCT-P and substituting f H ′ (r) = -h(r) yields the UCT-P selection rule. This rule coincides with the formulation of Grill et al. (2020) and can be interpreted as the prior-based analogue of the classical, prior-free UCT selection rule.
4 UCT-V-P and PUCT-V: Variance-Aware Prior-based UCTs
Our aim is to go beyond UCB1, studying alternative base UCBs with tighter confidence bonuses and deriving their prior-based counterparts via the Inverse-RPO pipeline. A natural candidate is UCB-V, which augments the exploration bonus with an empiricalvariance term and is obtained from a Bernstein-type concentration inequality (in contrast to the Hoeffding inequality underlying UCB1 ) (Audibert et al., 2009).
Under the same bounded-reward assumption, this yields variance-adaptive bonuses and correspondingly tighter instance-dependent guarantees than UCB1, without changing the problem setting. The varianceaware UCB-V applied to MCTS (Audibert et al., 2009;Wissow and Asai, 2024) is
where σa is the empirical reward standard deviation for action a consistent with earlier notation. We set c 1 = √ 2 and c 2 = 3, so that the above expression is algebraically identical to the definition of Audibert et al. (2009), with the constants absorbed into c 1 and c 2 .
Analogous to the PUCT exploration-decay heuristic (see Section 2.1), we introduce an heuristic variant, UCT-V-H, which rewrites the exploration bonus as shown in (20). This heuristic form is introduced to make the comparison with PUCT meaningful as a whole; without it, we could only compare against the principled baseline UCT-P.
We apply the Inverse-RPO pipeline to obtain variance-aware, prior-based counterparts of UCT-V and its heuristic decay UCT-V-H. Specifically, the pipeline yields (i) UCT-style selection rules that can be used as drop-in replacements for PUCT /UCT-P during tree traversal and (ii) corresponding RPO objectives that mirror the selection rules in the optimization view of MCTS.
The prior enters the exploration bonus as π θ (a), reweighting both the variance and bias terms of UCB-V. (ii) The placement of the prior inside a square root for UCT-V-P follows from the divergences used in the Inverse-RPO lift (Hellinger vs. reverse-KL) and is specified in the next RPO objectives (other box). (iii) For a uniform prior, both selectors reduce to their prior-free forms.
Result: Variance-aware prior-based RPO targets.
UCT-V-P:
(24) PUCT-V (heuristic prior-based variant):
Derivations: see Appendix C.
In contrast to the UCT-P (5) and PUCT (6) optimization targets, which use a single regularizer term with one weight λ N , our variance-aware contributions use two regularizer terms with distinct weights: a variance-term weight λ
(1)
N and a bias-term weight λ
(2)
N . (ii) As a result of the heuristic form of UCT-V-H in line with PUCT, the two variance-aware objectives are identical in their second regularizer term, and they differ only in the first regularizer and its weight
Our experimental aim is twofold: (i) to implement the new variance-aware UCT policies PUCT-V and UCT-V-P introduced in Section 4; and (ii) to evaluate their performance relative to the classical prior-based baselines PUCT and UCT-P. We first describe the implementation details of the variance-aware extensions before turning to empirical comparisons.
We provide a variance-aware MCTS implementation by extending the mctx 3 library (DeepMind et al., 2020). Enabling UCT-V -style rules requires propagating both empirical means and variances from a leaf to the root. To this end, we adopt Welford’s online update (see Algorithm 1), which is numerically stable and adds only a constant-time, constantmemory augmentation to the standard mean back-3 https://github.com/google-deepmind/mctx
propagation (Welford, 1962). Concretely, each node stores (n, µ, σ 2 ) instead of (n, µ), where n is the visit count. The control flow and backward pass remain identical to standard mean backpropagation, with the starred ( ⋆ ) lines denoting the added variance-tracking updates. During the selection phase, we also incorporate the proposed PUCT-V and UCT-V-P rules.
In the AlphaZero framework, a neural network is trained to approximate both the value function and the empirical visit distribution produced by MCTS. For our purposes, no additional variance head is required and the empirical variance from the tree search is sufficient.
Algorithm 1 Variance-aware single-node update Input: parent stats (n, µ, σ 2 ); discounted value v = r + γ • v child . Complexity: each update requires O(1) arithmetic operations and O(1) memory.
Overall, adapting MCTS to be variance-aware and to use the proposed selection rules requires only three lines of code, excluding the additional variance field in the data structures.
We evaluate on the MinAtar suite (Young and Tian, 2019), a widely used benchmark offering stochastic and deterministic Atari-style environments that preserve the core dynamics of the original games while being computationally efficient4 . We access MinAtar through the PGX interface (Koyamada et al., 2023), which provides JAX-compatible environments and an open-source AlphaZero training script that we adapt for our experiments. The search/training pipeline is kept fixed across selectors to ensure a controlled comparison.
Unless otherwise noted, we run N sim = 64 simulations per move to generate training data. Evaluation is conducted at regular intervals in batches of 256 trajectories per seed, and with at least three seeds the per- checkpoint estimates are sufficiently stable for meaningful comparisons. We adopt the network and optimization settings summarized in Table 2, holding hyperparameters constant across all methods to isolate the effect of the selection rule. Finally, we evaluate the learned policy head without search to assess representation and policy quality directly and to avoid confounding from test-time MCTS.
Observations Empirically, the measured wall-clock time per training step and per evaluation is essentially identical across selectors, indicating that the proposed variance-aware MCTS and selection rules incur no additional compute overhead. Figure 2 reports the average return of the trained policy head under all benchmarked selection rules. We compare UCT-V-P to UCT-P (heuristic-free) and PUCT-V to PUCT (heuristic-based). Across all environments, the variance-aware selectors match or exceed their variance-unaware baselines. In particular, UCT-V-P consistently outperforms UCT-P, showing that variance adjustment alone can substantially improve exploration. For the heuristic-based variants, PUCT-V surpasses PUCT on the stochastic games Asterix and Seaquest, and performs comparably on deterministic ones. Overall, variance-aware selection rules with priors yield consistent improvements, especially in stochastic settings, with negligible computational overhead and only minor modifications to MCTS.
AlphaZero family and prior-based tree policies.
Planning with MCTS coupled to learned function approximators became prominent with AlphaGo (Silver et al., 2016) and was iterated upon by AlphaZero (Silver et al., 2018) and MuZero (Schrittwieser et al., 2020). Furthermore, Stochastic MuZero (Antonoglou et al., 2022) handles stochastic dynamics while retaining PUCT, whereas Gumbel MuZero (Danihelka et al., 2022) adopts a Gumbel-based policy-improvement objective explicitly cast as regularized policy optimization (RPO). A unifying ingredient in these systems is a prior-based tree policy that injects a policy prior into the exploration bonus. Empirically, PUCT (and close relatives) has become the de facto choice across domains (Kemmerling et al., 2024).
UCT family and stronger UCB bonuses. Beyond UCB1, theoretically grounded UCT variants continue to be proposed (Browne et al., 2012). Among such developments, variance-aware Bernstein bonuses offer tighter instance-dependent guarantees under bounded rewards, which is why we select UCB-V (Audibert et al., 2009) as our base. Recent work explores alternative distributional assumptions (e.g., Gaussian and extreme-value regimes) with tailored regret analyses for classical planning (Wissow and Asai, 2024;Asai and Wissow, 2024). Notably, these methods are not prior-based by construction, so systematic prior-based extensions remain largely missing in the literature.
Bayesian MCTS. Variance-aware and uncertaintyquantifying approaches to MCTS are active research directions. Bayesian variants (Bayes-UCT1/2 ) maintain posteriors over node values and act via uncertainty bands (Tesauro et al., 2012); recent work explores richer uncertainty models and online inference (Greshler et al., 2024;Chen et al., 2025). While compelling, these methods typically introduce additional modelling choices, extra hyperparameters, and nontrivial bookkeeping. Our proposed variance-aware prior-based tree policies based on UCB-V likewise bring (frequentist) uncertainty quantification into the selection rule, yet integrate as drop-in replacements in the widely adopted AlphaZero-style MCTS with minimal changes.
Regularized policy optimization (RPO) and MCTS. Regularization-based views of RL connect policy improvement to convex programs with divergence penalties (Neu et al., 2017;Geist et al., 2019). Grill et al. (2020) brought this perspective to MCTS, thereby providing a retrospective theoretical understanding for prior-based tree policies such as PUCT. Follow-up analyses developed regret bounds for RPOguided MCTS and studied entropy-based regularizers and backup operators (Dam et al., 2021). Complementing entropy-centric analyses, we focus on UCTstyle bonuses by deriving variance-aware, prior-based selectors with matching RPO objectives (Eqs. 23 and 25).
In this paper, we (1) proposed Inverse-RPO, a principled framework to derive prior-based UCTs from their prior-free base forms, and (2) instantiated this framework by deriving two prior-based versions of UCB-V.
The resulting variance-aware prior-based tree-policies, UCT-V-P and PUCT-V, leverage variance estimates to improve search efficiency and outperform existing prior-based tree-policies UCT-P and PUCT across multiple benchmarks, with minimal implementation overhead.
Beyond the empirical results, our derivations of UCT-V-P and PUCT-V via the Inverse-RPO pipeline yield two RPO objectives that can be used as policy-training targets when casting MCTS as an optimization problem in future work. Another avenue for future work is to augment the network with a learned variance head, placed alongside the standard value and policy heads in the AlphaZero family, to refine search-based variance estimates and further improve the stability and performance of variance-aware prior-based UCTs. Finally, we invite the community to revisit the wellgrounded UCB literature through this lens and make principled use of its depth by systematically deriving yet underexplored prior-based UCTs.
with scaling terms λ
Define a separable f -regularizer. Choose convex generators whose (negative) derivatives match h H and h KL :
This yields the RPO with a separable f -regularizer :
whose marginal-gain rule in n a recovers (27).
The prior-based objective is exactly the form stated in the main text:
(cf. ( 23)) L UCT-V-P (y) = q ⊤ y -λ UCT-V-1 N D H (π θ , y) -λ UCT-V-2 N D KL (π θ , y).
Recover the prior-based UCT rule. Taking the directional derivative in n a yields the greedy expansion rule reported in the main text:
(cf. ( 21)) S UCT-V-P a (q, n, N ) = q a + c 1 • σa π θ (a) log N 1+na + c 2 • π θ (a) log N 1+na .
C.2 PUCT-V (heuristic variant; derivation)
Factorize the UCT bonus. For the heuristic variant (20), the bonus factorizes as
with h H (r, σ) = σ r , h KL (r) = 1 r , λ UCT-V-H-1
Define a separable f -regularizer. Choose convex generators with (negative) derivatives h H and h KL :
This yields the RPO with a separable f -regularizer :
whose marginal-gain rule recovers (32).
Lift the regularizer with a prior. Lifting to Csiszár forms with prior π θ gives the prior-based objective reported in the main text:
(cf. ( 25)) L PUCT-V (y) = q ⊤ y -λ UCT-V-H-1 N D KL (π θ , y) -λ UCT-V-H-2 N D KL (π θ , y).
Recover the prior-based UCT rule. Taking the directional derivative in n a yields the selection rule as stated: (cf. ( 22)) S PUCT-V a (q, n, N ) = q a + c 1 • π θ (a) σa
In our experiments we used the hyperparameters in Table 2 consistently across all benchmarks.
See the discussion byGrill et al. (2020) or the historical context in a Google Groups thread.
We exclude the freeway environment, as all evaluated algorithms consistently fail to achieve learning progress there.