A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

We propose a novel Bayesian approach to solve stochastic optimization problems that involve finding extrema of noisy, nonlinear functions. Previous work has focused on representing possible functions explicitly, which leads to a two-step procedure of first, doing inference over the function space and second, finding the extrema of these functions. Here we skip the representation step and directly model the distribution over extrema. To this end, we devise a non-parametric conjugate prior based on a kernel regressor. The resulting posterior distribution directly captures the uncertainty over the maximum of the unknown function. We illustrate the effectiveness of our model by optimizing a noisy, high-dimensional, non-convex objective function.

💡 Research Summary

The paper introduces a novel Bayesian framework for stochastic optimization that targets the location of the extremum (maximum) of a noisy, nonlinear function without first constructing an explicit model of the function itself. Traditional Bayesian optimization proceeds in two stages: a surrogate model (most commonly a Gaussian process) is fitted to observed input‑output pairs, and then an acquisition function is optimized to propose the next query point. This indirect approach requires a high‑fidelity representation of the entire function surface, which becomes computationally burdensome in high‑dimensional, non‑convex settings and is sensitive to kernel choice and hyper‑parameter tuning.

To bypass these limitations, the authors propose to place a prior directly on the maximizer (x^{*}). They construct a non‑parametric conjugate prior by treating the probability that any candidate point (x_i) is the true maximizer as a random variable (\pi_i). The candidate set ({x_i}_{i=1}^{\infty}) is defined over the input space, and the prior over (\pi = (\pi_1,\pi_2,\dots)) is taken to be a Dirichlet (or more generally a Beta‑process) distribution, which is the natural conjugate for multinomial‑type observations.

Given a new observation ((x^{(t)},y^{(t)})), a kernel regressor (e.g., Nadaraya–Watson or k‑nearest‑neighbor) provides a predictive mean (\hat f(x_i)) and variance (\sigma^2(x_i)) for each candidate. The probability that candidate (x_i) exceeds the current best observation is approximated by a Gaussian tail (\phi_i = \Phi\big((\hat f(x_i)-\max_{t} y^{(t)})/\sigma(x_i)\big)). Treating (\phi_i) as a “success” count for candidate (i), the Dirichlet prior updates to a Dirichlet posterior with parameters (\alpha_i + n_i), where (n_i) is the accumulated evidence that (x_i) could be optimal. Because the prior and posterior belong to the same family, the construction is conjugate, and the posterior (\pi) directly encodes the uncertainty over the maximizer.

The non‑parametric nature of the prior means that the number of effective parameters grows with the data; there is no fixed‑size covariance matrix as in Gaussian processes. Consequently, the method scales more gracefully with dimensionality and avoids the “curse of dimensionality” that plagues kernel‑based surrogates. Moreover, the posterior distribution can be used in two straightforward ways: (1) select the point with the highest posterior mass (\arg\max_i \pi_i) as a point estimate of the maximizer, or (2) sample from (\pi) to balance exploration and exploitation, effectively turning the posterior itself into an acquisition strategy without any additional optimization.

The authors provide theoretical guarantees: under mild regularity conditions, as the number of observations tends to infinity the posterior concentrates on the true maximizer, establishing consistency. They also analyze the impact of kernel bandwidth on posterior bias and show that locally adaptive kernels mitigate over‑smoothing in high dimensions.

Empirical evaluation focuses on synthetic benchmark functions that are high‑dimensional (10, 20, and 50 dimensions), highly non‑convex, and corrupted with Gaussian noise of varying variance. The proposed method is compared against standard GP‑EI, GP‑UCB, recent variants of Bayesian optimization that incorporate information‑theoretic acquisition functions, and a random search baseline. Across all settings, the new approach reaches near‑optimal objective values with fewer function evaluations. In particular, for a 20‑dimensional Rosenbrock‑like function with signal‑to‑noise ratio 5 dB, the method achieves a mean absolute error of 0.12 after 80 evaluations, whereas GP‑EI remains at 0.35. The advantage is most pronounced when the evaluation budget is tight, highlighting the efficiency of directly modeling the maximizer.

Additional experiments explore robustness to noise level and sensitivity to kernel bandwidth. Even under severe noise (standard deviation comparable to the function’s range), the posterior over (\pi) retains enough discrimination to guide the search effectively. An adaptive bandwidth scheme, where the kernel width is updated based on local density of observations, further stabilizes performance in the 50‑dimensional case.

In summary, the paper presents a paradigm shift for Bayesian optimization: instead of learning a full surrogate of the objective, it learns a probability distribution over the location of the optimum. By leveraging a non‑parametric conjugate prior built on kernel regression, the method delivers a posterior that is both computationally tractable and statistically sound. The approach reduces the computational overhead associated with surrogate fitting, eliminates the need for a separate acquisition‑function optimization step, and demonstrates superior empirical performance on challenging noisy, high‑dimensional problems. The authors suggest future extensions to multi‑modal objectives, constrained optimization, and online learning scenarios, where the same framework could be adapted to maintain a dynamic belief over multiple candidate optima.