Reading time: 38 minute
...

๐Ÿ“ Original Info

  • Title:
  • ArXiv ID: 2512.22471
  • Date:
  • Authors: Unknown

๐Ÿ“ Abstract

Modern sequence models often appear to behave as Bayesian learners, but it remains unclear whether this reflects genuine probabilistic inference or task-specific heuristics. We introduce Bayesian wind tunnelscontrolled environments where the true posterior is known in closed form and memorization is provably impossible-to resolve this question empirically. In these settings, small transformers reproduce exact Bayesian posteriors for filtering and hypothesis elimination with 10 -3 -10 -4 bit accuracy, while capacity-matched MLPs fail by orders of magnitude. To understand which architectural ingredients enable exact inference, we decompose Bayesian computation into three inference primitives: (i) belief accumulation-integrating evidence into a running posterior; (ii) belief transport-propagating beliefs forward through stochastic dynamics; and (iii) random-access binding-retrieving stored hypotheses by content rather than position. Different tasks demand different subsets of these primitives, and different architectures can realize different subsets. Comparing Transformers, Mamba, LSTMs, and MLPs across bijection learning, HMM filtering, and associative recall, we find that Transformers realize all three primitives; Mamba realizes accumulation and transport but struggles with random-access binding; LSTMs realize only accumulation (of static sufficient statistics); and MLPs realize none. Geometric diagnostics reveal orthogonal key bases, low-dimensional value manifolds parameterized by posterior entropy, and-in Mamba-five discrete clusters corresponding to HMM hidden states. These results demonstrate that Bayesian computation is not monolithic: its realizability depends on the inference primitives a task demands and the architectural mechanisms available to implement them. Bayesian wind tunnels provide a foundation for mechanistically connecting small, verifiable systems to reasoning phenomena observed in large language models. * Currently at Google DeepMind. Work performed while at Dream Sports.

๐Ÿ“„ Full Content

Can transformers perform exact Bayesian inference-filtering and hypothesis elimination-or do they merely approximate it through pattern matching? Natural language offers no ground-truth posterior against which to verify predictions, and modern LLMs are too large and too entangled with their data to separate genuine probabilistic computation from memorization. Even when models behave Bayesianly, there is no direct way to confirm that the internal computation matches Bayes' rule. Our approach. We replace unverifiable natural data with Bayesian wind tunnels: controlled prediction tasks where (1) the analytic posterior is known exactly at each step, (2) the hypothesis space is so large that memorization is computationally infeasible, (3) in-context prediction requires genuine probabilistic inference. This converts a qualitative question ("does it do Bayes?") into a quantitative test: does the model's predictive entropy match the analytic posterior entropy position by position? Four wind tunnels.

We study four settings:

โ€ข Bijection learning: a discrete hypothesis-elimination problem with a closed-form posterior.

โ€ข Hidden Markov Models (HMMs): a sequential, stochastic inference problem requiring recursive updates. โ€ข Bayesian regression: a continuous inference problem with closed-form Gaussian posterior over linear weights. โ€ข Associative recall: a content-based retrieval task testing the binding primitive.

To understand which architectural ingredients enable Bayesian inference, we decompose it into three inference primitives:

(1) Belief accumulation: integrating evidence into a running posterior (e.g., updating ๐‘ƒ (๐œƒ | ๐‘ฅ 1:๐‘ก ) as observations arrive). (2) Belief transport: propagating beliefs forward through stochastic dynamics (e.g., HMM filtering where hidden states evolve). (3) Random-access binding: retrieving stored hypotheses by content rather than position (e.g., recalling a target given a probe cue).

Different tasks demand different subsets of these primitives, and different architectures can realize different subsets. We test four architectures: Transformers, Mamba (a selective state-space model), LSTMs, and MLPs. Figure 1 summarizes which architectures realize which primitives. Findings.

Transformers realize all three primitives and succeed on all tasks. Mamba realizes accumulation and transport-achieving state-of-the-art on HMM filtering-but struggles with random-access binding in associative recall. LSTMs realize only accumulation of static sufficient statistics: they succeed on belief revision (where the statistic is fixed-dimensional) but fail when the sufficient statistic itself evolves under dynamics or must be indexed by content. MLPs realize none and fail uniformly. Geometric diagnostics reveal orthogonal key axes, progressive alignment, and a low-dimensional value manifold parameterized by entropy. On HMM tracking, Mamba’s final layer organizes into five clusters-one per hidden state-showing that the model discovers the corner , Vol. 1, No. 1, Article . Publication date: January .

geometry of the belief simplex. Contribution. This paper provides the first empirical proof that neural sequence models can realize exact Bayesian posteriors, introduces Bayesian wind tunnels as a tool for probing algorithmic reasoning in verifiable settings, and establishes a taxonomy of inference primitives that explains which architectures succeed on which tasks. The primitives framework unifies prior observations: content-based routing enables accumulation and transport; attention’s random-access capability additionally enables binding. Clarification on “Bayesian inference.”

We do not claim a Bayesian posterior over network weights; we show that the learned predictor implements the Bayesian posterior predictive over latent task variables-the filtering posterior over hidden states in HMMs, or the elimination posterior over bijections. This is a statement about the input-output function the transformer computes, not about weight-space uncertainty. Specifically, we demonstrate exact Bayesian inference in tasks whose posteriors factorize sequentially over the input-filtering and elimination, not general Bayesian model selection. Roadmap: this paper as Lemma 1. This paper establishes the existence and internal geometry of exact Bayesian inference in transformers under verifiable conditions. It is the first of three papers forming a unified argument:

Paper II shows that this geometry arises generically from gradient dynamics under cross-entropy training, explaining why transformers learn Bayesian structure. Paper III shows how these primitives compose in partially observed settings closer to natural language. Together, the trilogy characterizes when, why, and how neural sequence models implement probabilistic reasoning.

Structural Theorem (Inference Primitives and Architectural Realizability). Bayesian inference in sequential prediction decomposes into three primitives:

(1) belief accumulation over fixed latent hypotheses, (2) belief transport under latent state dynamics, and

(3) random-access binding between beliefs and past observations. Neural sequence architectures differ not in whether they can approximate Bayesian inference, but in which primitives they can realize:

โ€ข Recurrent architectures (LSTMs) implement accumulation of static sufficient statistics, but fail when inference requires transport of probability mass under nontrivial dynamics. โ€ข State-space models (Mamba) implement accumulation and transport, but fail when inference requires random-access binding to arbitrary past observations. โ€ข Attention-based transformers implement all three primitives by externalizing belief as a geometric, addressable representation rather than compressing it into fixed-size state. The dominance of transformers in reasoning tasks arises not from scale alone, but from primitive completeness: they are the minimal architecture realizing the full set of inference primitives.

Cross-entropy training on contextual prediction tasks has a well-known population optimum: the Bayesian posterior predictive distribution. This section formalizes that connection. The theory establishes what the learned function should be in the infinite-data, infinite-capacity limit; the empirical sections evaluate which architectures can approximate it in finite settings.

Consider a family of tasks indexed by a latent parameter ๐œƒ โˆผ ๐œ‹ (๐œƒ ). For each task:

โ€ข inputs ๐‘ฅ are drawn from some distribution (possibly adversarial or chosen by the experimenter), , Vol. 1, No. 1, Article . Publication date: January .

โ€ข labels are drawn according to ๐‘ฆ โˆผ ๐‘ (๐‘ฆ | ๐‘ฅ, ๐œƒ ),

โ€ข the model observes a context ๐‘ = {(๐‘ฅ ๐‘– , ๐‘ฆ ๐‘– )} ๐‘˜ ๐‘–=1 and must predict ๐‘ฆ for a new query input. We train a model ๐‘ž(๐‘ฆ | ๐‘ฅ, ๐‘) by minimizing population cross-entropy:

2.2 Cross-entropy minimizes to the Bayesian posterior predictive Theorem 1 (Population optimum of cross-entropy). The minimizer of (1) is the Bayesian posterior predictive distribution

where Remark 1. This result is architecture-agnostic: it defines the Bayes-optimal function but not whether any particular architecture can represent or learn it. Our experiments address this realizability question directly.

In the bijection task, each ๐œƒ is a bijection ๐œ‹ : {1, . . . , ๐‘‰ } โ†’ {1, . . . , ๐‘‰ }. A training sequence reveals ๐‘˜ -1 input-output pairs. Let O ๐‘˜ -1 be the set of outputs already observed. Because each input appears at most once per sequence, the current query ๐‘ฅ ๐‘˜ has never been seen before, so Bayes’ rule reduces to:

Hence the analytic posterior entropy is

producing a monotone staircase that shrinks by one bit whenever a mapping is revealed. This closed-form posterior allows direct, position-by-position comparison between model entropy and Bayesian entropy; memorization is computationally infeasible because the hypothesis space size ๐‘‰ ! is enormous.

In the HMM task, each ๐œƒ consists of:

โ€ข an initial state distribution ๐œ‹ 0 .

After observing ๐‘œ 1:๐‘ก , the true Bayesian posterior over hidden states is given by the forward algorithm:

Since models predict the next observation ๐‘œ ๐‘ก +1 rather than hidden states, we evaluate the predictive entropy:

where

Because every training sequence is generated from a freshly sampled (๐‘‡ , ๐ธ), the hypothesis space is massive and memorization is computationally infeasible. The model must learn to (i) parse the header encoding ๐‘‡ and ๐ธ, and (ii) implement a recursive Bayesian update.

In the regression task, each ๐œƒ is a weight vector ๐‘ค โˆˆ R ๐‘‘ with prior ๐‘ค โˆผ N (0, ๐ผ ). Given context observations (๐‘ฅ ๐‘– , ๐‘ฆ ๐‘– ) where ๐‘ฆ ๐‘– = ๐‘ค โŠค ๐‘ฅ ๐‘– + ๐œ– ๐‘– with ๐œ– ๐‘– โˆผ N (0, ๐œŽ 2 ), the posterior over ๐‘ค is Gaussian with closed-form mean and covariance. The posterior predictive at any query point ๐‘ฅ is also Gaussian, enabling exact computation of the Bayesian predictive distribution for comparison.

In the associative recall task, each sequence contains ๐‘ cue-target pairs (๐‘ ๐‘– , ๐‘ก ๐‘– ) followed by probe cues. Unlike the other tasks which test belief accumulation or transport, associative recall tests the binding primitive: the model must store all pairs and retrieve by content when the probe arrives. Success requires content-based routing-identifying which stored pair matches the probe-rather than sequential Bayesian updating.

The theoretical results above imply a practical diagnostic: a model that achieves the correct posterior entropy at every position is functionally Bayesian-it produces predictions with the same uncertainty profile as the exact posterior. Combined with the cross-entropy training objective (whose unique population minimizer is the Bayesian posterior predictive), low entropy calibration error provides strong evidence for Bayesian computation.

Beyond entropy: full distributional verification. Entropy-matching is necessary but not sufficient for distributional convergence in general, since different distributions can share the same entropy. In our wind tunnels, however, entropy serves as a diagnostic sufficient statistic: the structured nature of the tasks (discrete elimination over bijections, recursive filtering over HMM states) strongly constrains the space of distributions achieving a given entropy. To confirm full distributional equivalence, we directly verify via KL divergence that our trained transformers match the complete Bayesian posterior (Table 1). Throughout this paper we report entropy MAE as our primary metric because it provides an interpretable, bit-level measure that generalizes across tasks. Evaluating the entropy calibration error

therefore provides a direct, bit-level measure of Bayesian correctness, independent of accuracy or perplexity. In later sections we show that transformers achieve near-perfect calibration, while matched MLPs do not.

where ๐ฟ is the number of supervised prediction positions. Because each training instance uses a fresh bijection or a fresh HMM, memorization is infeasible; the model must perform genuine in-context inference.

Each sequence is derived from a new random bijection ๐œ‹ : {1, . . . , ๐‘‰ } โ†’ {1, . . . , ๐‘‰ } with ๐‘‰ = 20. At position ๐‘˜, the model has observed ๐‘˜ -1 distinct input-output pairs and must predict ๐œ‹ (๐‘ฅ ๐‘˜ ).

Because inputs never repeat, the Bayes-optimal posterior over ๐œ‹ (๐‘ฅ ๐‘˜ ) is uniform over the ๐‘‰ -๐‘˜ + 1 unseen values.

Bayesian ground truth. Let O ๐‘˜ -1 be observed outputs. Then

with entropy ๐ป Bayes (๐‘˜) = log 2 (๐‘‰ -๐‘˜ + 1).

Evaluation. We compute MAE over a held-out set of 2,000 bijections. Because 20! โ‰ˆ 2.4 ร— 10 18 possible bijections exist and training uses only 10 5 samples, no bijection is seen twice; the task enforces true hypothesis elimination.

The second wind tunnel probes a qualitatively different inferential structure: recursive belief updating. Each sequence is derived from a fresh HMM with ๐‘† = 5 hidden states and ๐‘‰ = 5 observation symbols. Transition rows and emission rows are drawn independently from a symmetric Dirichlet distribution with all concentration parameters equal to 1 (i.e., Dirichlet(1, 1, 1, 1, 1)), ensuring diverse and non-degenerate dynamics. The initial state distribution ๐œ‹ 0 is also drawn from Dirichlet(1, 1, 1, 1, 1).

โ€ข a 10-token header encoding flattened ๐‘‡ and ๐ธ, and โ€ข ๐พ observation-prediction pairs, each consisting of:

the observed symbol ๐‘œ ๐‘ก , a supervised prediction at that same position for ๐‘ (๐‘  ๐‘ก | ๐‘œ 1:๐‘ก ).

Bayesian ground truth: forward algorithm. For each HMM and for each time ๐‘ก we compute

normalized to ๐‘  ๐›ผ ๐‘ก (๐‘ ) = 1. The exact posterior entropy is This tests whether the model has learned a position-independent recursive algorithm or has merely memorized a finite-horizon computation.

Why memorization is computationally infeasible. Each sequence uses new ๐‘‡ , ๐ธ matrices and new stochastic emission trajectories. The space of possible HMMs exceeds 10 40 even under coarse discretization, ensuring that learned behavior cannot rely on recall of any particular HMM.

To isolate the random-access binding primitive, we use a standard associative recall task: the model must retrieve a target given a probe cue from a set of cue-target pairs presented in context.

Task format. Each sequence contains ๐‘ cue-target pairs (๐‘ ๐‘– , ๐‘ก ๐‘– ) followed by ๐‘„ probe cues. Cues and targets are drawn uniformly from disjoint vocabularies of size 256 each (total vocabulary 522 including special tokens). The model must predict the associated target for each probe. With ๐‘ = 16 pairs and ๐‘„ = 3 probes, sequence length is 137 tokens.

โ€ข Training: 50,000 sequences, 30 epochs โ€ข Models: Transformer, Mamba, LSTM with ๐‘‘ model = 128, comparable parameter counts (โˆผ570k-660k) โ€ข Metric: Exact-match accuracy on probe predictions Why this tests binding. Unlike bijection (where the hypothesis space shrinks predictably) or HMM (where beliefs evolve through dynamics), associative recall requires late, content-dependent retrieval: the model cannot know which cue-target pairs matter until the probe arrives. This requires storing all pairs and retrieving by content-the binding primitive.

Results preview. At ๐‘ = 16 pairs: Transformer achieves 100% accuracy (by epoch 12); Mamba achieves 97.8% (epoch 30); LSTM achieves 0.5% (random chance). The 2.5ร— longer training for Mamba and its imperfect accuracy suggest that selective SSM routing can approximate but not fully implement random-access binding. Full results appear in Section 4.6.

To test continuous latent variables, we use multivariate linear regression with a Gaussian prior.

Task format. Each sequence derives from a fresh linear regression with weights ๐‘ค โˆผ N (0, ๐ผ ๐‘‘ ) where ๐‘‘ = 3. Observations follow ๐‘ฆ ๐‘– = ๐‘ค โŠค ๐‘ฅ ๐‘– + ๐œ– ๐‘– with ๐œ– ๐‘– โˆผ N (0, 0.25) and inputs ๐‘ฅ ๐‘– โˆผ N (0, ๐ผ ๐‘‘ ). The model observes ๐‘˜ = 6 context pairs before predicting at a query point drawn from the same input distribution. Outputs are discretized into 41 bins uniformly spaced over [-5, 5], covering > 99.9% of the predictive mass given the prior and noise scales.

Bayesian ground truth. The posterior over ๐‘ค given context is Gaussian with closed-form mean and covariance, yielding a Gaussian predictive distribution at any query point. This enables exact computation of the Bayesian predictive for comparison.

Why this tests continuous inference. Unlike bijection (discrete hypotheses) or HMM (discrete states), regression requires inference over a continuous parameter space. The Gaussian structure ensures the posterior remains tractable while testing whether transformers can calibrate uncertainty in continuous settings.

Transformers. We use small but realistic transformer stacks:

โ€ข Bijection transformer (2.67M): 6 layers, 6 heads, ๐‘‘ model = 192, ๐‘‘ ffn = 768.

โ€ข HMM transformer (2.68M): 9 layers, 8 heads, ๐‘‘ model = 256, ๐‘‘ ffn = 1024.

Both use learned token embeddings, learned absolute positional embeddings, pre-norm residual blocks, and standard multi-head self-attention.

Mamba (selective state-space model). To test whether attention specifically is required, or whether content-based routing more generally suffices, we train Mamba models [8]: Mamba replaces attention with a selective state-space mechanism: input-dependent matrices (ฮ”, ๐ต, ๐ถ) gate the recurrent state update. This provides content-based routing without explicit query-key matching.

LSTM baselines. To test whether recurrence alone suffices, we train LSTMs: โ€ข Bijection LSTM (4.77M): 9 layers, hidden dimension 256.

โ€ข HMM LSTM (4.77M): 9 layers, hidden dimension 256.

LSTMs have recurrent state but use fixed gating-forget, input, and output gates do not depend on content relationships across sequence positions. This tests whether recurrence without contentbased routing can implement Bayesian inference.

Capacity-matched MLP baselines. To isolate the role of sequence structure entirely, we train MLPs with 18-20 layers, width 384-400, residual connections and layer normalization (parameter counts match transformers within 1%). The MLP receives the entire context sequence (all previous tokens concatenated) as input at each prediction position, but processes this flattened input without any attention or recurrence-testing whether feedforward computation over concatenated context suffices for Bayesian inference. We chose this design to give the MLP maximal access to context information; stronger attention-free baselines (e.g., Perceiver-style pooling, permutation-invariant aggregation) could narrow the gap but would reintroduce forms of content-based routing.

Training is identical across architectures for each task.

Optimization. AdamW with ๐›ฝ 1 = 0.9, ๐›ฝ 2 = 0.999, weight decay 0.01, gradient clipping at 1.0. Batch size is 64 for all tasks.

Learning rates and training steps.

โ€ข Bijections: constant 10 -3 for 150k steps.

โ€ข HMMs: 3 ร— 10 -4 with 1000-step warmup and cosine decay for 100k steps.

โ€ข Regression: 5 ร— 10 -4 for 50k steps.

Data sampling. Every batch draws fresh bijections or fresh HMMs; sequences never repeat.

Teacher forcing. Cross-entropy loss is applied at each supervised prediction position.

Ablation stability. Layer-wise and head-wise ablations are reported as averages over three random seeds; the HMM length-generalization results are also evaluated across multiple seeds to ensure robustness.

We evaluate whether transformers lie on the analytic Bayesian manifold using two behavioral tests:

(1) pointwise calibration-does ๐ป model (๐‘ก) match ๐ป Bayes (๐‘ก) at every position? (2) generalization-does the learned computation extend to unseen bijections, unseen HMMs, and longer sequences? We present results for bijections and HMMs in parallel, followed by MLP controls and multi-seed robustness.

A 2.67M-parameter transformer converges to the analytic posterior with near machine precision. Figure 3 shows the predictive entropy

overlaid on ๐ป Bayes (๐‘˜) = log 2 (๐‘‰ -๐‘˜ + 1). The curves coincide across all positions, including late steps where only 2-4 hypotheses remain. Quantitatively, the transformer achieves MAE = 3 ร— 10 -3 bits, averaged over 2,000 held-out bijections. This error is smaller than single-precision numerical noise in the analytic posterior. All analytic Bayes computations use double-precision arithmetic, so reported errors are meaningful. We verify full distributional agreement via KL and TVD: across all positions, KL(modelโˆฅBayes) < 0.01 nats and TVD < 3%, with per-position analysis confirming agreement holds across the full entropy range (high, moderate, and low entropy positions).

Per-sequence evidence. Aggregate calibration could hide averaging artifacts. Figure 4 plots eight individual entropy trajectories. Each displays the characteristic staircase pattern: entropy drops discretely whenever a new input-output pair eliminates hypotheses, and collapses to near zero when an input repeats and the mapping is known. The model performs stepwise Bayesian elimination, reproducing the curve sequence by sequence rather than merely matching it in expectation.

Inside-model consistency. Layer-wise ablations (Figure 5) show that removing any block increases error by more than an order of magnitude, confirming a deeply compositional computation. Headwise ablations (Figure 6) identify a single Layer 0 hypothesis-frame head whose removal is uniquely destructive, consistent with the geometric analysis in Section 5.

The 2.68M-parameter transformer also learns the forward algorithm for HMM inference.

Within training horizon (K=20). At ๐‘ก โ‰ค 20, model entropy tracks the exact forward-recursion entropy with MAE = 7.5 ร— 10 -5 bits.

The two curves are visually indistinguishable (Figure 7). (Note: The architecture comparison in Table 2 uses a smaller parameter-matched model at 2.7M for fair cross-architecture comparison, yielding MAE = 0.049 bits; the result here uses the task-optimized 2.68M transformer.)

Beyond training horizon (K=30, K=50). To test algorithmic generalization, we roll the model out to 1.5ร— and 2.5ร— training length. The transformer stays close to the analytic posterior:

Errors increase smoothly with ๐‘ก, with no discontinuity at ๐‘ก = 20 (the training boundary). This is strong evidence of a position-independent recursive algorithm rather than a finite-horizon memorized computation.

Per-position calibration. Figure 8 shows absolute error |๐ป model (๐‘ก) -๐ป Bayes (๐‘ก)|. Three patterns emerge:

(1) early positions are slightly noisier (uncertain initial state);

(2) mid-sequence positions achieve near-zero error at all lengths;

(3) late positions degrade smoothly with sequence length, consistent with accumulated numerical drift.

Per-sequence dynamics. Figure 9 shows the model tracking sequence-specific fluctuations: entropy dips when emissions strongly identify states and rises when observations are ambiguous. The transformer captures these dynamics exactly.

Semantic invariance under hidden-state relabeling. Hidden-state indices are purely symbolic: permuting the labels corresponds to the same latent process. We sample a random permutation ๐œŽ of {1, . . . , ๐‘† } and apply it to the HMM parameters by permuting rows and columns of ๐‘‡ (i.e., ๐‘‡ โ€ฒ ๐œŽ (๐‘– ),๐œŽ ( ๐‘— ) = ๐‘‡ ๐‘–,๐‘— ) and permuting rows of ๐ธ (i.e., ๐ธ โ€ฒ ๐œŽ (๐‘– ),๐‘œ = ๐ธ ๐‘–,๐‘œ ). We then recompute the analytic posterior under (๐‘‡ โ€ฒ , ๐ธ โ€ฒ ) and evaluate the model on sequences generated from the permuted HMM. If the model implements Bayesian filtering rather than associating meaning with specific state IDs, its entropy calibration should be unchanged up to numerical noise. Figure 10 shows MAE before vs. after permutation lies on the diagonal, with ฮ”MAE concentrated near zero.

To identify which components support stable rollout, we train a variant transformer in which attention is disabled in the top two layers but FFNs and residuals remain intact. The no-late-attention model fits the training horizon reasonably well (1.57 ร— 10 -3 bits), but breaks down under rollout:

The degradation factor grows from 21ร— (at ๐พ = 20) to 62ร— (at ๐พ = 50), demonstrating that late-layer attention is not required for fitting ๐พ = 20 but is essential for stable long-horizon Bayesian updates (Figure 11).

The associative recall task tests whether architectures can retrieve stored information by contentthe binding primitive. Unlike bijection or HMM tasks where the relevant context is determined by position or dynamics, here the model must identify which cue-target pair to retrieve based on a probe that arrives only at test time. The transformer achieves 100% accuracy on held-out sequences, demonstrating perfect content-based retrieval. This requires the attention mechanism: the probe cue must match against stored cues to route information from the corresponding target. MLPs did not succeed on this task under our training protocol (accuracy at chance), as they process each position independently without cross-token interaction.

To test continuous latent variables, we evaluate on multivariate linear regression with a Gaussian prior. The transformer (857k params, 4 layers) predicts discretized outputs over 41 bins, achieving KL = 0.034 nats to the closed-form Bayesian predictive; the capacity-matched MLP achieves KL = 0.22-6ร— worse. The gap is smaller than for discrete tasks, but the transformer still substantially outperforms on uncertainty calibration, demonstrating that the Bayesian inference capability extends to continuous parameter spaces.

To understand which architectural ingredients enable Bayesian inference, we decompose it into three inference primitives and test which architectures realize each:

โ€ข Belief accumulation: integrating evidence into a running posterior (tested by bijection elimination) โ€ข Belief transport: propagating beliefs through stochastic dynamics (tested by HMM filtering) โ€ข Random-access binding: retrieving stored hypotheses by content (tested by associative recall) Table 2 summarizes results across all three tasks.

Transformers realize all three primitives. Transformers achieve near-exact Bayesian posteriors on bijection and HMM, and perfect accuracy on associative recall with 64 cue-target pairs. Attention provides all three capabilities: accumulation via the residual stream, transport via content-based routing, and binding via query-key matching.

Mamba realizes accumulation and transport, but struggles with binding. Mamba outperforms the transformer on HMM tracking (0.024 vs 0.049 bits MAE), demonstrating that its selective state-space mechanism excels at belief transport. However, on associative recall, Mamba reaches only 97.8% accuracy and requires 2.5ร— more training epochs than the transformer. This matches the finding of Jelassi et al. [10] that SSMs struggle with retrieval tasks-Mamba’s selection mechanism implements LSTMs realize only accumulation of static sufficient statistics. The LSTM achieves 0.009 bits on bijection-comparable to Transformer and Mamba-but fails on both HMM (0.416 bits) and associative recall (0.5%, random chance). The key distinction: bijection elimination admits a static sufficient statistic (the set of observed outputs, representable in fixed dimension) that LSTM can track with its recurrent state. But HMM requires a sufficient statistic that evolves under dynamicsthe belief vector must be transported through the transition matrix at each step-and associative recall requires the statistic to be indexed by content. Under our training protocol, LSTM’s fixed gating accumulated evidence but did not learn to implement the content-dependent operations required for transport or binding.

MLPs realize no primitives. Without sequence structure, MLPs collapse to marginal predictions and fail uniformly.

The primitives taxonomy predicts performance. This decomposition explains the full pattern of results: each architecture succeeds precisely on tasks requiring only the primitives it can implement. The taxonomy also resolves apparent contradictions in the literature-Mamba beats transformers on tasks dominated by transport (HMM) but loses on tasks requiring binding (recall).

To ensure that Bayesian tracking is not an artifact of initialization or optimization noise, we repeated transformer HMM experiments across five independent random seeds. Per-position error curves for all seeds (Figure 13) nearly overlap at ๐พ = 20, ๐พ = 30, and ๐พ = 50. The seed-to-seed variability is negligible compared to the gap between architectures, confirming that the learned Bayesian algorithm is robust to initialization and training noise. The Mamba and LSTM HMM results in Table 2 are 5-seed averages with standard deviations of 0.009 and 0.003 bits respectively-confirming that the large performance gaps between architectures (e.g., LSTM’s 0.411 vs Mamba’s 0.024 bits on HMM) reflect genuine architectural differences rather than seed-to-seed variation.

The behavioral results in Section 4 demonstrate that small transformers track analytic Bayesian posteriors with sub-bit precision across two distinct wind-tunnel tasks. We now examine how this computation is implemented internally. Evidence from ablations, QK geometry, probe dynamics, and training trajectories reveals a consistent architectural mechanism: transformers perform Bayesian inference by constructing a representational frame, executing sequential eliminations within that frame, and progressively refining posterior precision across layers.

The computation begins with a structural operation: Layer 0 attention constructs the hypothesis space in which all subsequent inference takes place. Keys at this layer form an approximately orthogonal basis over input tokens (Figure 16), providing a coordinate system over which posterior mass can be represented and manipulated. We measure orthogonality via mean absolute offdiagonal cosine similarity between key vectors. Across 5 seeds, the bijection model achieves 0.052 ยฑ 0.004 versus 0.082 ยฑ 0.003 for random vectors in ๐‘‘ = 192 dimensions-a 37% reduction (๐‘ < 0.001, paired ๐‘ก-test). The HMM model shows similar structure: 0.061 ยฑ 0.006 vs. 0.079 ยฑ 0.002 random baseline. Head-wise ablations confirm the indispensability of this step. A single Layer 0 “hypothesis-frame head” dominates the layer’s contribution (Figure 6), and ablating this head alone severely disrupts calibration. Here “hypothesis-frame head” means the head whose keys span the near-orthogonal basis over hypothesis tokens and whose values instantiate the corresponding per-hypothesis slots in the residual stream. No other attention head exhibits comparable sensitivity. This identifies a structural bottleneck: forming the hypothesis frame is a prerequisite for any later Bayesian computation. Once established, this frame remains stable through training. Attention maps at Layer 0 change little across checkpoints, even as the value manifold and calibration improve substantially. The model therefore learns the geometry of the inference problem early, and subsequently refines numerical precision within this fixed frame.

With the hypothesis frame in place, the middle layers perform a layer-by-layer process that mirrors Bayesian elimination.

Progressive QK sharpening. As depth increases, queries align more strongly with the subset of keys consistent with the observed evidence (Figure 17). Early layers attend broadly; deeper layers concentrate attention almost exclusively on the feasible hypotheses. This geometric focusing parallels analytic Bayesian conditioning, where inconsistent hypotheses receive vanishing weight. 2024) closely matches the multi-seed average in Figure 13, further confirming that the length generalization pattern is not an artifact of a particular initialization.

Hierarchical compositionality. Layer-wise ablations (Figure 5) show that removing any single layer (attention + FFN, as implemented) increases calibration error by more than an order of magnitude. This demonstrates that the computation is not shallow or redundant. Each layer provides a distinct and non-interchangeable refinement step, forming a sequential, compositional realization of Bayesian updates. Together, these observations indicate that transformers implement Bayesian elimination not via a single transformation, but through a depth-wise sequence of projections and refinements within the Layer 0 frame.

Across all depths, attention serves a consistent geometric role: it retrieves the components of the belief state relevant for the next update. Three observations support this routing interpretation:

โ€ข Orthogonal keys (Figure 16) provide a basis for content-addressable lookup of hypotheses.

โ€ข Sharpened QK alignment across depth (Figure 17) routes residual-stream information toward the feasible hypothesis subspace. โ€ข Stable routing during late refinement (Figures 18 and19) shows that once the frame is correct, attention maps change minimally even as calibration improves. Routing is also essential for maintaining stable recursive inference. In the HMM task, disabling attention only in the top two layers leaves performance within the training horizon largely intact, but long-horizon inference collapses (Figure 11). Thus attention is required both for forming the initial hypothesis frame and for sustaining stable belief updates under extended rollout.

After routing stabilizes, the final layers refine the precision of the posterior representation. Figures 18 and19 show that:

โ€ข At intermediate checkpoints, value representations of low-entropy states are nearly collapsed and cannot reliably encode distinctions among small remaining hypothesis sets. โ€ข By the final checkpoint, these states lie along a smooth one-dimensional manifold parameterized by posterior entropy. This geometric unfurling enables fine-grained encoding of posterior confidence and accounts for lateposition improvements in calibration. This refinement occurs while attention maps remain nearly unchanged, producing a clear frame-precision dissociation: attention defines where information flows, while downstream transformations refine how precisely beliefs are encoded.

Across both wind tunnels, the evidence aligns into a three-stage mechanism (Figure 20):

(1) Foundational binding (Layer 0). Construct an orthogonal hypothesis frame. (Key geometry; catastrophic Layer 0 head ablations.) (2) Progressive elimination (middle layers). Sequentially suppress inconsistent hypotheses through sharpening QK alignment. (Layer-wise compositionality; geometric focusing.) (3) Precision refinement (late layers). Encode posterior entropy on a smooth value manifold while keeping routing fixed. (Value-manifold unfurling; frame-precision dissociation.) This structure mirrors the analytic decomposition of Bayesian conditioning: define a hypothesis space, update beliefs with evidence, and refine confidence as uncertainty decreases.

While the three-stage mechanism above is specific to transformer attention, Mamba achieves comparable HMM performance (0.024 vs 0.049 bits MAE) through qualitatively different mechanisms. Analyzing Mamba’s internal representations reveals how selective state-space dynamics implement belief transport without attention. Five-cluster geometry emerges from state selection. Mamba’s final-layer representations organize into five discrete clusters corresponding to the five HMM hidden states (Figure 2), with withincluster variation encoding posterior entropy. This is the same corner geometry of the belief simplex that transformers learn, achieved through input-dependent state selection rather than query-key matching.

Distributed representations enable belief transport. Layer-wise analysis (Figure 21) reveals a key difference between Mamba and LSTM: โ€ข Mamba: Final-layer representations achieve ๐‘… 2 = 0.40 for entropy prediction, with PC1 explaining only 21.9% of variance-the model maintains a distributed, multi-dimensional belief representation. โ€ข LSTM: Final-layer representations achieve ๐‘… 2 = 0.004 (near random), with PC1 explaining 92.3% of variance-the model collapses to a one-dimensional manifold uncorrelated with true posterior entropy.

This explains the primitives difference: under our training protocol, LSTM’s fixed gating did not maintain the multi-dimensional structure required for belief transport, while Mamba’s selection mechanism preserves it.

Layer-wise entropy encoding. Unlike transformers where entropy correlation increases sharply in later layers, Mamba shows moderate correlation (|๐‘Ÿ | โ‰ˆ 0.11) across middle layers with gradual improvement in linear predictability (๐‘… 2 ) from 0.025 at Layer 1 to 0.40 at Layer 9. This suggests Mamba accumulates information more gradually through its recurrent state rather than in discrete elimination steps.

Convergence to Bayesian geometry. The fact that different mechanisms-attention’s query-key matching and Mamba’s selective state updates-converge to similar geometric solutions (corner geometry of the belief simplex) suggests that Bayesian geometry may be a universal attractor for architectures with content-based routing. This provides a geometric explanation for why both architectures succeed at belief transport despite using fundamentally different computational primitives.

These empirical observations match predictions from recent analyses of gradient dynamics, which show that attention scores tend to stabilize once the correct routing structure has formed, while value and residual representations continue to refine precision. The observed stability of attention maps together with the unfolding of the value manifold provides direct evidence for this differential convergence of routing and precision.

The wind-tunnel experiments demonstrate that small transformers, trained with standard optimization and without architectural modifications, implement Bayesian inference with striking fidelity. In this section we discuss the broader implications of these results for interpretability, architectural necessity, and the connection between controlled wind tunnels and the behavior of large language models.

Across the bijection and HMM settings, the internal geometry uncovered in Section 5 reveals a consistent computational pattern. Transformers realize Bayesian conditioning through a stacked sequence of geometric operations:

(1) Foundational binding (Layer 0). Orthogonal keys create a hypothesis frame. The catastrophic effect of ablating the Layer 0 hypothesis-frame head (Figure 6) demonstrates that this frame is structurally indispensable. (2) Progressive elimination (middle layers). QK-alignment sharpens across depth (Figure 17), mirroring the multiplicative suppression of ruled-out hypotheses in analytic Bayesian updates. Layer-wise ablations (Figure 5) show that each layer contributes a non-interchangeable refinement step. (3) Precision refinement (late layers). Once routing stabilizes, value representations unfold into a low-dimensional manifold parameterized by posterior entropy (Figure 18), improving calibration particularly at late positions (Figure 19). This frame-precision dissociation reflects a division of labor: attention establishes where information flows, while subsequent transformations refine the numerical precision of the belief.

This hierarchy parallels Bayes’ rule: define a hypothesis space, integrate evidence, and refine the posterior. The transformer implements these steps using attention geometry and residual-stream representations.

A central conclusion from the ablation studies is that depth is not redundant. In both wind tunnels, removing any individual layer increases calibration error by more than an order of magnitude (Figure 5). This shows that Bayesian reasoning is expressed as a sequence of compositional projections, each layer refining the belief state in a way that cannot be collapsed into a single transformation. This stands in contrast to wide, shallow architectures: even with comparable parameter counts and identical training, MLPs fail to perform hypothesis elimination or state tracking (Section 4.6). Bayesian inference requires hierarchical refinement, and transformers supply the appropriate inductive bias through depth and residual composition.

While the wind tunnels are deliberately simplified, they capture the essential structure of probabilistic inference: integrating evidence over time to update latent beliefs. Large language models operate in a far more complex setting, with high-dimensional latent spaces and ambiguous, multi-modal evidence. Yet the geometric ingredients observed here-orthogonal hypothesis axes, depth-wise refinement, and stable routing-are structural rather than task-specific. The results therefore suggest that the probabilistic behaviors exhibited by LLMs may arise not only from scale or data richness but also from architectural geometry. Wind tunnels provide a verifiable lower bound: they show that transformers can implement Bayesian inference exactly when the posterior is known.

The four-architecture comparison (Table 2) reveals that Bayesian inference is not monolithicdifferent tasks demand different inference primitives, and different architectures realize different subsets.

Why Mamba beats Transformer on HMM but loses on recall. Mamba’s selective state-space mechanism implements content-based routing on transition dynamics: the input-dependent matrices (ฮ”, ๐ต, ๐ถ) control what information propagates forward through the recurrence. This is ideal for belief transport-tracking how a posterior evolves through stochastic dynamics-which explains Mamba’s superiority on HMM (0.024 vs 0.049 bits). But binding requires a different operation: given a query, retrieve the associated value from memory. Attention provides this directly via query-key matching with ๐‘‚ (1) access to any position. Mamba must simulate retrieval through its recurrent state, which is slower and less precise. This explains the 2.5ร— longer training and 97.8% vs 100% accuracy gap on associative recall.

Why LSTM succeeds on bijection but fails everywhere else. LSTM’s gates depend only on (โ„Ž ๐‘ก -1 , ๐‘ฅ ๐‘ก )under our training protocol, they did not learn to perform content-based matching across positions. This suffices for accumulating static sufficient statistics (bijection admits a fixed-dimensional statistic: the set of observed outputs), but fails when:

โ€ข The statistic must evolve under dynamics (transport): LSTM compresses HMM representations to a 1D manifold uncorrelated with the true 5D belief state-it did not transport the belief vector through the transition matrix โ€ข The statistic must be indexed by content (binding): LSTM achieves 0.5% on recall (random chance)-it did not retrieve by content

Implications. The primitives framework provides a principled basis for architecture selection: match the architecture’s capabilities to the task’s primitive requirements. It also explains why attention remains essential for tasks requiring flexible retrieval, even as alternatives like Mamba excel at sequence modeling.

The wind tunnels establish a principled baseline for mechanistic reasoning in transformers. If a model cannot implement Bayes in a setting with a closed-form posterior and impossible memorization, it offers little evidence of genuine inference capability in natural language. Conversely, the fact that small, verifiable transformers succeed here-with interpretable geometric mechanismssuggests that similar structures may underpin reasoning in large models. This provides a concrete research direction: search for the same geometric signatures in frontier LLMs. The diagnostics used here-key orthogonality, QK sharpening, value-manifold structure, and routing stability-offer testable predictions for analyzing pretrained language models. 7 Related Work

A long line of work interprets neural networks through a Bayesian lens, from classical analyses of predictive uncertainty [11,14] to variational or stochastic approximations of posterior inference [3,7]. Recent papers argue that, in large-data limits, minimizing cross-entropy implicitly targets the Bayesian posterior predictive [18,19]. These results concern what training should produce at the population level. Our contribution is complementary: a controlled setting in which the true posterior is known, memorization is computationally infeasible, and one can directly test whether a finite transformer actually realizes this Bayesian computation.

Transformers have been shown to perform algorithmic tasks in context, including arithmetic [6], synthetic induction [5], and more general pattern extrapolation [2,15]. Behaviorally, these models often resemble Bayesian learners, an observation formalized by recent explanatory theories [18,19]. Mรผller et al. [12] provide evidence that high-capacity transformers often mimic the Bayesian predictor during in-context learning, and Reuter et al. [17] demonstrate that transformers can perform full Bayesian inference for statistical models including generalized linear models and latent factor models, achieving results comparable to expensive exact methods. However, prior work cannot distinguish true Bayesian computation from learned heuristics or memorized templates, because the ground-truth posterior is unknown for natural language tasks. Furthermore, these studies focus on whether transformers can implement Bayesian inference, not why they succeed where other architectures fail. Our wind-tunnel methodology addresses both gaps: by constructing tasks with closed-form analytic posteriors and combinatorially large hypothesis spaces, we obtain a direct pointwise comparison between model predictions and Bayes’ rule. By comparing multiple architectures, we identify the inference primitives that explain success and failure. This moves the discussion from correlation to mechanism, and from existence to architectural characterization.

Mechanistic studies of transformers have revealed specialized attention heads for induction, copying, and retrieval [4,13]. Other work has examined QKV spaces, circuit decomposition, and sparse structures that arise during training [15]. These studies provide qualitative and circuit-level insight into model behaviors. Our contribution is to link these geometric structures directly to Bayesian inference in a setting where the posterior is known. We show that keys form near-orthogonal hypothesis axes, queries sharpen onto feasible hypotheses across depth, and value representations unfurl into a one-dimensional entropy manifold. This connects mechanistic interpretability to probabilistic computation in a rigorous way: the internal geometry needed for Bayesian reasoning becomes directly visible.

Alternative sequence models-state-space architectures [8,9], convolutional variants [16], and recurrent networks-often match transformers in perplexity on natural text. But perplexity conflates modeling and inference capability. Our results provide a finer test: whether an architecture can reproduce an analytic Bayesian posterior under strict non-memorization constraints. Recent work has identified a fundamental limitation of state-space models: Jelassi et al. [10] show that SSMs struggle with copying and retrieval tasks because they compress input into a fixed-size latent state. Transformers with 10ร— fewer parameters outperform Mamba on retrieval, and Mamba requires 100ร— more training data to learn simple copying. This limitation stems from the recurrent structure itself-information must be explicitly stored in state rather than retrieved on demand. Our primitives framework provides a unifying explanation for these findings. We decompose Bayesian inference into three primitives: belief accumulation (integrating evidence), belief transport (propagating beliefs through dynamics), and random-access binding (retrieving hypotheses by content). The copying/retrieval limitation identified by Jelassi et al. [10] corresponds precisely to our binding primitive. Mamba’s selective state-space mechanism [8]-where the transition matrices ฮ”, ๐ต, ๐ถ become input-dependent-implements content-based routing on the transition dynamics, enabling accumulation and transport. But this is fundamentally different from attention’s ability to directly retrieve arbitrary past positions via query-key matching. This explains our empirical pattern: Mamba outperforms transformers on HMM filtering (0.027 vs 0.049 bits MAE), a task dominated by belief transport, because its selection mechanism is optimized for controlling what information propagates forward. But Mamba is slower on associative recall (97.8% vs 100%, requiring 2.5ร— more epochs), a task requiring random-access binding. LSTMs fail on both because their gates depend only on (โ„Ž ๐‘ก -1 , ๐‘ฅ ๐‘ก )-they did not learn content-based matching across positions under our training protocol. The primitives taxonomy thus predicts which architecture will excel on which task, resolving apparent contradictions in the literature. Recent work shows SSM performance depends strongly on optimization: mimetic initialization can yield attention-like behavior in Mamba, improving copying and autoregressive tasks. Our primitives conclusions characterize what each architecture achieved under standard training, not absolute capability limits. With specialized initialization or broader hyperparameter sweeps, the gaps we observe might narrow-though the qualitative pattern (Mamba excelling at transport, struggling with binding) appears robust across our experiments.

Finally, concurrent work analyzes the gradient dynamics that create these structures during training [1]. They show that attention and value updates follow coupled laws that produce a stable routing frame and a progressively refined value manifold. Our empirical findings align with this picture: attention stabilizes early, while value vectors continue to encode the posterior with increasing resolution. Together, these perspectives connect the optimization trajectory to the geometric structure that implements Bayesian inference.

Our experiments are intentionally small-scale: they use controlled Bayesian wind tunnels with analytic posteriors, modest vocabulary sizes, and transformers with 2-3M parameters. This regime is what makes mechanistic verification possible, but it naturally abstracts away from the full complexity of natural-language inference. Several limitations therefore remain, which point directly toward future extensions.

Scale and richness of inference tasks. Bijections and HMMs capture essential elements of Bayesian computation-discrete elimination and recursive state tracking-but they represent only a narrow slice of the inference problems encountered by large language models. Future wind tunnels could incorporate richer latent-variable structures, including Kalman filtering, hierarchical Bayesian models, or causal graphical models, all of which have closed-form posteriors and allow precise verification.

Dimensionality of hypothesis spaces. Although the hypothesis spaces in both tasks are large enough to prevent memorization, their representational dimensionality is modest (e.g., five hidden states in HMMs). Larger systems with high-dimensional latent variables would test whether the geometric mechanisms we observe-orthogonal hypothesis axes, progressive Q-K sharpening, and value-manifold refinement-scale smoothly with dimensionality.

Connection to large pretrained models. Our geometric diagnostics (key orthogonality, scoregradient structure, value manifolds) are testable predictions for frontier LLMs. Whether similar Bayesian manifolds arise in large models trained on natural text remains an open question. Applying these tools directly to pretrained transformer layers is a natural next step and may reveal how approximate Bayesian structure manifests in more complex settings.

Architectural generality. The experiments here use standard transformers. It remains unclear whether alternative architectures-state-space models, deep MLPs with more sophisticated gating, or hybrid recurrent-attention systems-can form comparable Bayesian manifolds. Wind-tunnel evaluations could provide a principled benchmark for comparing architectures in terms of inference fidelity rather than perplexity alone.

Training dynamics and phase transitions. A notable empirical phenomenon is the frame-precision dissociation: attention maps stabilize early while value manifolds continue to unfurl and refine posterior precision. A systematic study of these phases-how early the frame forms, how quickly precision improves, and how these dynamics depend on depth, width, and data complexity-could lead to a more general theory of representation formation in transformers.

Towards natural-language wind tunnels. Ultimately, we aim to understand how the exact Bayesian reasoning demonstrated here relates to the approximate reasoning observed in natural language tasks. Wind tunnels provide a lower bound: they establish that transformers can implement Bayesian updates when the problem is well specified. The next challenge is to design controlled tasks embedded within naturalistic language data that preserve analytic structure while introducing real-world ambiguity.

We introduced Bayesian wind tunnels-controlled experimental settings with analytic posteriors and combinatorially large hypothesis spaces-to test whether neural sequence models genuinely implement Bayesian inference rather than merely mimicking it. Across multiple inference problems, small transformers converge to the exact Bayesian posterior with sub-bit calibration error, even at sequence lengths well beyond those seen in training. The key insight is that Bayesian inference is not monolithic. We decompose it into three inference primitives-belief accumulation, belief transport, and random-access binding-and show that different architectures realize different subsets:

โ€ข Transformers realize all three primitives and succeed uniformly.

โ€ข Mamba realizes accumulation and transport, achieving state-of-the-art on HMM filtering, but struggles with random-access binding.

โ€ข LSTMs realize only accumulation of static sufficient statistics: they succeed on belief revision (where the statistic is fixed-dimensional) but fail when the statistic must evolve under dynamics or be indexed by content. โ€ข MLPs realize none and fail uniformly.

Geometric diagnostics reveal how these primitives are implemented. Keys form an approximately orthogonal basis over hypotheses; queries progressively align with the feasible region of that basis; and value vectors organize along a low-dimensional manifold parameterized by posterior entropy. On HMM tracking, Mamba’s representations organize into five discrete clusters-one per hidden state-showing that the model discovers the corner geometry of the belief simplex. Training sculpts this manifold: attention patterns stabilize early, while value representations continue refining posterior precision-a frame-precision dissociation predicted by concurrent gradient-dynamics analysis. The wind-tunnel regime is intentionally simplified, but it establishes a clear lower bound: if a model cannot implement Bayes in settings where the posterior is known and memorization is impossible, it cannot do so in natural language. Conversely, our results show that content-based routing is sufficient for exact Bayesian inference when the task demands only the primitives that architecture can realize. This provides a principled foundation for studying approximate reasoning in larger models and offers concrete, testable predictions-orthogonal hypothesis axes, progressive Q-K sharpening, and value-manifold structure-for analysing pretrained LLMs. The primitives framework explains why transformers succeed: they furnish the architectural mechanisms for all three inference primitives. Attention provides random-access binding through query-key matching; content-based routing enables belief transport; and the residual stream accumulates evidence across positions. Understanding how these primitives scale to real-world language, and whether new architectures can realize them more efficiently, remains an important direction for future work.

, Vol. 1, No. 1, Article . Publication date: January .

content-based routing on transition dynamics, which enables transport but not direct random-access retrieval.

, Vol. 1, No. 1, Article . Publication date: January .

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut