Universal priors: solving empirical Bayes via Bayesian inference and pretraining

Reading time: 6 minute
...

📝 Original Info

  • Title: Universal priors: solving empirical Bayes via Bayesian inference and pretraining
  • ArXiv ID: 2602.15136
  • Date: 2026-02-16
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. **

📝 Abstract

We theoretically justify the recent empirical finding of [Teh et al., 2025] that a transformer pretrained on synthetically generated data achieves strong performance on empirical Bayes (EB) problems. We take an indirect approach to this question: rather than analyzing the model architecture or training dynamics, we ask why a pretrained Bayes estimator, trained under a prespecified training distribution, can adapt to arbitrary test distributions. Focusing on Poisson EB problems, we identify the existence of universal priors such that training under these priors yields a near-optimal regret bound of $\widetilde{O}(\frac{1}{n})$ uniformly over all test distributions. Our analysis leverages the classical phenomenon of posterior contraction in Bayesian statistics, showing that the pretrained transformer adapts to unknown test distributions precisely through posterior contraction. This perspective also explains the phenomenon of length generalization, in which the test sequence length exceeds the training length, as the model performs Bayesian inference using a generalized posterior.

💡 Deep Analysis

📄 Full Content

Consider the following empirical Bayes task in the Poisson model: let θ 1 , . . . , θ n be i.i.d. drawn from some unknown prior G 0 supported on [0, A], and the observations X n be conditionally independent with X i ∼ Poi(θ i ) given θ n . Here we assume the knowledge of A, but do not impose any condition on the prior G 0 (not even continuity or smoothness). The target of empirical Bayes is to propose an estimator θ n = θ n (X n ) that is nearly optimal on every problem instance, i.e., achieves a competitive performance compared to the Bayes estimator with the oracle knowledge of G 0 . The standard notion in empirical Bayes to quantify the estimator performance is the regret, defined as the excess MSE ∥θ n,(m) -T ζ (X n,(m) )∥ 2 2 .

(ERM)

10 Return the trained transformer as the final estimator:

over the Bayes risk:

where θ G 0 (X i ) = E G 0 [θ i |X i ] is the Bayes estimator (posterior mean) with the knowledge of G 0 , and θ G 0 (X n ) = (θ G 0 (X 1 ), . . . , θ G 0 (X n )). A small regret Regret( θ n ) = o(1) means that the Bayes risk can be asymptotically attained by the legal estimator θ n . Compared with classical statistical estimators (like the MLE), empirical Bayes estimators usually enjoy a much better empirical performance due to instance-wise guarantees and implicit adaptations to the prior structure [Rob51, Rob56, JZ09, HNWSW25]. There exist several ways to solve empirical Bayes problems in the literature. Specializing to the Poisson empirical Bayes model, the earliest example is the Robbins’s estimator based on f -modeling (i.e., mimicking the form of the Bayes estimator θ G 0 (X)). A numerically more stable approach is the g-modeling, which learns a prior from data and uses the Bayes estimator under the learned prior. A notable example for learning the prior is the nonparametric MLE (NPMLE), as well as a broader class of minimum distance estimators [JPW25,VKWL21]. Finally, a more modern approach is to use empirical risk minimization (ERM), which minimizes a properly constructed loss function evaluated on X n over a suitably chosen function class [BZ22,JPTW23].

The recent work [TJP25] proposes to solve empirical Bayes problems via a different strategy of using a pretrained estimator. Unlike the previous ERM approach which trains a separate model for each sample X n , a pretrained model learns a function θ n (•) from a large pool of properly generated training data and enables extremely fast computation at test time (namely, applying θ n (•) directly to X n ). This strategy is inspired by recent successes of TabPFN [HMEH23, HMP + 25], which similarly applies a single pretrained model across diverse categorical datasets. This therefore achieves the cost amortization objective in a similar spirit with amortized inference (see, for example, [ZMSDH25]), where after pretraining with synthetic data, a transformer can perform quick inference on millions of batches of new observations. As shown via experimental results in [TJP25] for Poisson-EB, a welltrained transformer indeed achieves a better performance than the state-of-the-art NPMLE-based estimator at only a tiny fraction of inference time.

In this paper, we further investigate the approach of [TJP25] and address the question of why such a pretrained estimator can solve empirical Bayes problems. Specifically, we explain why a pretrained Bayes estimator trained under a prespecified training distribution can adapt to all possible test distributions; we refer to such training distributions as universal priors. One might expect that universal priors must be carefully engineered to achieve this ambitious goal; interestingly, this turns out not to be the case. As a concrete illustration, Algorithm 1 presents a remarkably simple training prior: in each training batch, it selects k = O(log n) locations uniformly from [0, A] and assigns a uniform weight vector on the probability simplex. In the same batch, the parameters θ n are then drawn i.i.d. from the pmf k j=1 w j δ λ j . The performance of the resulting pretrained estimator θ n is summarized in the following result.

Theorem 1.1. Let M → ∞, and assume that the training procedure finds the global minimizer of (ERM). Then for a large enough hyperparameter c 0 > 0, the pretrained estimator θ n in Algorithm 1 satisfies

where C = C(A, c 0 ) is an absolute constant depending only on A and c 0 .

We make a few remarks on Theorem 1.1.

• The regret bound in Theorem 1.1 is near-optimal: For A = Θ(1), it was shown in [PW21, Theorem 2] that the minimax regret is Θ( 1 n ( log n log log n ) 2 ). Therefore, Theorem 1.1 shows that although the pretrained estimator θ n is constructed in a Bayesian framework, it achieves near-optimal frequentist guarantees (off by a log n factor) even in the worst case.

• The universal prior constructed in Algorithm 1, albeit remarkably simple, is a high-dimensional mixture of i.i.d. priors. Mathematically, let G = k j=1 w j δ λ j be the random pmf used in Algorithm 1, the f

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut