Gaussian Process Bandits for Tree Search: Theory and Application to Planning in Discounted MDPs
We motivate and analyse a new Tree Search algorithm, GPTS, based on recent theoretical advances in the use of Gaussian Processes for Bandit problems. We consider tree paths as arms and we assume the target/reward function is drawn from a GP distribution. The posterior mean and variance, after observing data, are used to define confidence intervals for the function values, and we sequentially play arms with highest upper confidence bounds. We give an efficient implementation of GPTS and we adapt previous regret bounds by determining the decay rate of the eigenvalues of the kernel matrix on the whole set of tree paths. We consider two kernels in the feature space of binary vectors indexed by the nodes of the tree: linear and Gaussian. The regret grows in square root of the number of iterations T, up to a logarithmic factor, with a constant that improves with bigger Gaussian kernel widths. We focus on practical values of T, smaller than the number of arms. Finally, we apply GPTS to Open Loop Planning in discounted Markov Decision Processes by modelling the reward as a discounted sum of independent Gaussian Processes. We report similar regret bounds to those of the OLOP algorithm.
💡 Research Summary
The paper introduces GPTS, a novel tree‑search algorithm that leverages recent advances in Gaussian‑process (GP) bandits. The core idea is to treat every root‑to‑leaf path in a search tree as an arm of a stochastic bandit problem and to assume that the unknown reward function defined over these paths is a sample from a GP. After each observation, the posterior mean μ(x) and variance σ²(x) for any path x are computed; an upper confidence bound (UCB) is formed as μ(x)+β_tσ(x), where β_t controls the exploration‑exploitation trade‑off. The algorithm sequentially selects the path with the highest UCB, observes its reward, and updates the GP posterior.
Two kernels are examined in the binary‑vector feature space that encodes the presence of nodes along a path: a linear kernel k_lin(x,x′)=xᵀx′ and a radial‑basis‑function (RBF) kernel k_rbf(x,x′)=exp(−‖x−x′‖²/(2σ²)). The linear kernel corresponds to a simple additive model over nodes, while the RBF kernel captures similarity between paths that share many nodes, with the bandwidth σ governing how quickly correlation decays with Hamming distance.
A major technical contribution is the analysis of the eigenvalue decay of the kernel matrix K_ℱ constructed over the entire set ℱ of possible paths. For the linear kernel, eigenvalues decay polynomially (λ_i = O(i^{−2})), whereas for the RBF kernel they decay exponentially (λ_i = O(exp(−c i^{1/d}))) where d is the tree depth. This rapid decay yields a bounded information gain γ_T = O(log T). Plugging γ_T into the standard GP‑UCB regret bound gives R_T = O(√{T γ_T β_T}) = O(√T · log T)·C(σ), where the constant C(σ) shrinks as the kernel width σ grows. Consequently, the regret grows only as the square root of the number of iterations T (up to logarithmic factors), even when the total number of arms (paths) is exponential in the tree depth.
From an implementation standpoint, GPTS exploits the hierarchical structure of the tree. Each node stores sufficient statistics (e.g., count of visits, sum of observed rewards) so that updating the posterior after selecting a new path requires only processing the nodes belonging to that path. This yields an update cost proportional to the tree depth rather than the total number of paths, making the algorithm scalable to very large trees.
The authors further apply GPTS to open‑loop planning in discounted Markov Decision Processes (MDPs). A planning horizon H defines a tree where each level corresponds to a decision step; the immediate reward at step t is modeled as an independent GP r_t. The discounted return G = Σ_{t=0}^{H−1} γ^t r_t is thus a linear combination of independent GPs and remains a GP. By treating each possible action sequence as a path, GPTS can be used to select the sequence with the highest UCB. The resulting regret bound matches that of the Optimistic Linear‑Programming (OLOP) algorithm, i.e., O(√T · log T), while offering a more principled Bayesian treatment of uncertainty.
Empirical results (as reported) confirm three key observations: (1) larger RBF bandwidths lead to smaller empirical regret, aligning with the theoretical constant improvement; (2) GPTS performs well even when the budget T is far smaller than the total number of paths, demonstrating effective exploitation of shared sub‑structures; (3) in discounted MDP planning, GPTS attains performance comparable to or better than OLOP across a range of benchmark problems.
In summary, the paper makes a solid contribution by marrying GP‑based bandit theory with tree‑structured search. It provides rigorous regret analysis based on eigenvalue decay, offers practical algorithms with linear‑in‑depth computational complexity, and extends the methodology to planning in stochastic control settings. The work opens avenues for further research on richer kernels, adaptive bandwidth selection, and integration with Monte‑Carlo tree search frameworks.
Comments & Academic Discussion
Loading comments...
Leave a Comment