Grid-World Representations in Transformers Reflect Predictive Geometry

G R I D - W O R L D R E P R E S E N T A T I O N S I N T R A N S F O R M E R S R E FL E C T P R E D I C T I V E G E O M E T RY Sasha Brenner Max Planck Institute for Human Cognitiv e and Brain Sciences, Leipzig, Germany Leipzig Univ ersity , Germany brenner@cbs.mpg.de Thomas R. Kn ¨ osche ∗ Max Planck Institute for Human Cognitiv e and Brain Sciences, Leipzig, Germany knoesche@cbs.mpg.de Nico Scherf ∗ Max Planck Institute for Human Cognitiv e and Brain Sciences, Leipzig, Germany Center for Scalable Data Analytics and Artiﬁcial Intelligence (ScaDS.AI), Dresden/Leipzig, Germany nscherf@cbs.mpg.de A B S T R AC T Next-token predictors often appear to develop internal representations of the la- tent world and its rules. The probabilistic nature of these models suggests a deep connection between the structure of the world and the geometry of probability distributions. In order to understand this link more precisely , we use a minimal stochastic process as a controlled setting: constrained random walks on a two- dimensional lattice that must reach a ﬁxed endpoint after a predetermined number of steps. Optimal prediction of this process solely depends on a suf ﬁcient vec- tor determined by the walker’ s position relativ e to the target and the remaining time horizon; in other words, the probability distributions are parametrized by the world’ s geometry . W e train decoder-only transformers on preﬁxes sampled from the exact distribution of these walks and compare their hidden activ ations to the analytically deriv ed suf ﬁcient vectors. Across models and layers, the learned rep- resentations align strongly with the ground-truth predicti ve vectors and are often low-dimensional. This provides a concrete example in which world-model–like representations can be directly traced back to the predictive geometry of the data itself. Although demonstrated in a simpliﬁed toy system, the analysis suggests that geometric representations supporting optimal prediction may provide a useful lens for studying how neural networks internalize grammatical and other structural constraints. 1 I N T RO D U C T I O N Large language models (LLMs) hav e proven to be extremely powerful and scalable models, b ut they are trained on a famously simple objective function: probability estimation of the next token in a sequence (Radford et al. (2019); Brown et al. (2020)). While being fundamentally probabilistic, these models must simultaneously learn strict syntax and subtle semantics, and their ability to do this scales with the number of parameters and the amount of training data (Kaplan et al. (2020)). This leads to the question of ho w natural grammar is internally encoded in the models, and whether ∗ These authors contributed equally . 1 these representations can ev en be interpreted. Indeed, plenty of research has tried to answer these questions since the early days of transformer-based LLMs, providing interesting insights (L ´ opez- Otal et al. (2025); Ferrando et al. (2024); Rogers et al. (2021)). Notably , much of this prior work indicates that very similar grammar-like representations often emerge during learning for dif ferent models (Manning et al. (2020); Diego-Sim ´ on et al. (2024); Liu et al. (2019)), which suggests that such structures may be implicit in the data distrib ution and that learning them could facilitate opti- mization. While it seems obvious that grammatical rules are implicit in any language dataset, placing that statement on solid theoretical grounds could be more challenging. A quantitative frame work of natural grammar is desirable in the study of neural network interpretability , because it could pro- vide formal objects amenable to existing techniques, including those from representational analysis (Sucholutsky et al. (2023)). Since frontier LLMs are based on neural network architectures trained on information-theoretic objecti ves, a reasonable theoretical foundation should be able to describe grammatical structures in terms of probability distributions and feature spaces. The foundations for such a framew ork indeed e xist; a long tradition in natural language processing (NLP) and com- putational linguistics focuses on the distributional aspects of linguistic elements (Sahlgren (2008); Boleda (2020); T urney and Pantel (2010)), and it laid part of the conceptual groundwork for the T ransformer architecture behind most frontier LLMs (V aswani et al. (2017)). Related to the question of how neural networks represent grammar is the concept of world models. Auto-regressi ve foundation models hav e been shown to encode geometric information about the semantic content of the token sequences being parsed; in other words, they seem to model the world their inputs refer to (Gurnee and T egmark (2024); Karv onen (2024); Iv anitskiy et al. (2023)). This is consistent with proposals attrib uting LLMs’ success in in-conte xt learning to latent-structure inference (Xie et al. (2022)), echoing the idea of world models in the ﬁeld of reinforcement learning, where agents can beneﬁt from compressed representations of the complex environments they interact with (Ha and Schmidhuber (2018); Hafner et al. (2025); Hansen et al. (2024)). The presumed reason is that only a small set of latent variables in these en vironments are actually relev ant for optimal prediction and control, and identifying them greatly reduces costs. This hints at the close relationship between world modeling and prediction, which is the basic language-modeling objecti ve. Here, we use a deliberately minimal toy setting to explore how grammar-like constraints can be represented in sequence models through the lens of optimal prediction and computational mechan- ics, following a recent line of interpretability research (Shai et al. (2024); Piotrowski et al. (2025); Riechers et al. (2025b); Shai et al. (2026)). This framew ork infers the computational properties of stochastic processes by keeping track of their causal states , sets of observ ation sequences which are exclusi vely deﬁned by their predictions of the future (Crutchﬁeld and Y oung (1989); Shalizi and Crutchﬁeld (2001)). Every stochastic process is therefore uniquely characterized by the topology and geometry of its causal-state network. Since LLMs are models of stochastic processes, linguistic structure may be reﬂected in topological and geometric regularities of the data distrib ution. If such regularities exist, neural networks trained for next-token prediction may learn internal representa- tions that capture them, particularly when doing so improves predictive accuracy and robustness. Although instantiating this as a concrete hypothesis for natural language is an ambitious goal, we may study smaller toy settings as a start to collect insights, a strategy that has already sho wn to be fruitful in this context (Piotro wski et al. (2025)). W e therefore focus on a set of toy re gular gram- mars generated by the movements of a random w alker on a square lattice with a single constraint: it must, after a ﬁxed number of time steps, arriv e at some predetermined location on the grid. While vastly simpler than a full-ﬂedged natural language, this grid walker has the adv antage of providing a highly interpretable ground truth with analytic solutions, which we believ e is key at this stage. Additionally , the random generation of sequences in these grammars is a non-stationary and non- ergodic process, like real language generation is. This is a relevant generalization with respect to prior systems studied under this framew ork. W e computed the exact joint probability distributions of sequences for a few grid walkers, each with its own characteristic length and endpoint. Decoder-only transformers were then trained on 2 samples taken from the exact marginal distributions of preﬁx sequences. W e consistently found low-dimensional representations that closely align with probability v ectors of prediction, similarly to what was reported in Shai et al. (2024) for other types of hidden-Marko v models (HMMs). These representations emerged in most internal layers of the trained transformers, but their geometries slightly varied across different layers and layer types. While such representations are extracted in a principled way from predictiv e distributions, they suggestiv ely resemble a grid-world model. Our simple toy model thus illustrates a concrete mechanism through which grammar and w orld models can directly emerge from the geometry of probability distributions. While this observation does not establish that the same mechanism applies to natural language, it provides a tractable example in which the relationship can be analyzed exactly . The remainder of this paper is org anized as follo ws. Section 2 revie ws prior work on predicti ve and belief-state representations, grammatical structure in language models, and emergent world models. Section 3 introduces the constrained grid-walker process and presents the analytical quantities used for training and ev aluation. Section 4 reports representational-alignment results across layers and model conﬁgurations, along with their interpretation in terms of predictiv e geometry . Section 5 discusses limitations and open challenges, and Section 6 concludes. 2 R E L A T E D W O R K 2 . 1 B E L I E F - S TA T E R E P R E S E N TA T I O N S I N N E U R A L N E T W O R K S W e largely base ourselves on a novel line of work that places large-language-model (LLM) inter- pretability under the lens of optimal prediction (Shai et al. (2024); Piotrowski et al. (2025); Riechers et al. (2025b); Riechers et al. (2025a)). Speciﬁcally rele v ant to it is a theoretical framework kno wn as computational mechanics , which studies dynamical and stochastic systems by identifying the minimal information about the past needed for optimal prediction of the future. In particular , when the latent generativ e mechanism is a homogeneous hidden Markov model (HMM), this minimally sufﬁcient predicti ve information coincides with the belief state , i.e . , the probability distrib ution o ver the hidden states of the HMM. In Shai et al. (2024), the authors trained transformers on sequences generated by a set of small but highly unpredictable edge-emitting HMMs. They found parsimonious linear mappings from the layer representations to the HMM’ s belief states within the hidden-state simplex. The authors’ explanation is that belief-state representations might allo w models to perform belief updates at each successiv e timestep of the input sequences. Indeed, a concrete mechanistic implementation of belief updating w as proposed in the context of self-attention in a later paper (Piotro wski et al. (2025)), and evidence for it was provided. Evidence for representational alignment with belief states has also been reported for different v ariants of RNN architectures (Riechers et al. (2025b)). Furthermore, a more general set of latent generativ e models was tested in Riechers et al. (2025b): generalized hidden Marko v models (GHMMs). These include HMMs, but also other non-Markovian systems that ne vertheless still enjoy a Marko v-style update rule ov er some set of generalized belief states, which do not necessarily correspond to probability distrib utions o ver states (but are related to them). In all studied examples, howe ver , the processes were stationary and ergodic. Since we are inter- ested in formal and natural languages, which are generally neither er godic nor stationary , studying predictiv e-state representations on such systems in a simple case like our grid walker is an impor- tant ﬁrst step in this direction. Moreov er , most belief states analyzed by the authors were restricted to a ﬂat manifold (although it often featured fractal structure within). In contrast, the grid walk er’ s predictiv e states are contained in a curved manifold within the probability simplex. There is, in addi- tion, a formal distinction to be made between predicti ve and belief states, which is brieﬂy discussed in Section 4.4. 3 2 . 2 G R A M M A R A N D S T RU C T U R E I N L A R G E L A N G UA G E M O D E L S There is extensi ve work analyzing ho w grammatical structure is reﬂected in neural language models trained on natural text (Manning et al. (2020); L ´ opez-Otal et al. (2025)). A common approach is to test whether syntactic information is recov erable from internal activ ations, using probing-style methods and geometric diagnostics (Peters et al. (2018); Liu et al. (2019); T enney et al. (2019); Hewitt and Manning (2019); W u et al. (2020); Coenen et al. (2019); Diego-Sim ´ on et al. (2024)). Other w ork inspects model components more directly by studying attention patterns and head spe- cialization (Clark et al. (2019); Htut et al. (2019); V ig and Belinko v (2019); V oita et al. (2019)). T o mov e beyond correlational e vidence, interventionist and causal analyses attempt to test whether hy- pothesized syntactic variables are actually used, and to localize intermediate computations (T ucker et al. (2021); Finlayson et al. (2021); Lepori et al. (2024)). While this literature provides many useful techniques, the combination of massiv e models and the complexity of natural language mak es it difﬁcult to isolate mechanisms cleanly . Our goal is there- fore not to compete on linguistic coverage, but to study a small but controlled setting where the relev ant predictiv e structure is explicit, allowing a sharp test of the link between optimal prediction, representation geometry , and grammatical constraints. 2 . 3 E M E R G E N T W O R L D M O D E L S Recent literature asks whether next-token predictors learn internal representations that function as world models . Early evidence in a controlled setting came from a GPT model trained on Othello mov e sequences, which dev eloped an emer gent internal representation of the board state that could be used to control model outputs via interventions (Li et al. (2023)). Similar ﬁndings were sub- sequently reported for chess-playing language models (Karvonen (2024)). In large-scale LLMs, representations hav e been shown to encode linear structure for space and time (Gurnee and T egmark (2024)). Complementary e vidence comes from maze-solving tasks, where transformers dev elop structured internal representations of topology and paths (Iv anitskiy et al. (2023)). Across all these settings, howe ver , establishing why the next-token prediction objecti ve should giv e rise to world- model geometry has proven elusi ve: the complexity of the tasks makes it difﬁcult to trace learned representations back analytically to the statistical structure of the data. 3 M E T H O D S W e work in a setting where the ground truth is available exactly: constrained random walkers on a 2-dimensional square lattice; walkers for short. For each walker conﬁguration, we compute the full next-step probability distribution analytically , train decoder-only transformers from scratch on preﬁx sequence samples drawn from these distributions with a next-token cross-entropy loss, and then ev aluate learned hidden states by ﬁtting linear probes to the analytically derived sufﬁcient predictiv e vectors across layers. Most of the code and hyperparameter choices are based on the public repository used in Riechers et al. (2025b). Below is a detailed e xplanation of the methods. 3 . 1 T H E G R I D W A L K E R W e use the grid walk er as a deliberately minimal to y model to assess whether and ho w simple gram- matical rules can be recovered from the internal representations of neural networks and interpreted as vectors for optimal prediction. In this subsection, we show how the walker’ s predictiv e structure can be deriv ed analytically and summarized by lo w-dimensional suf ﬁcient statistics. Grid walkers are stochastic systems that mov e on an inﬁnite 2D square lattice isomorphic to Z 2 . At each discrete timestep t , the walker can mov e either left ( L ), right ( R ), do wn ( D ), or up ( U ). Each of these actions corresponds to a displacement vector in the set E = { ∆ L , ∆ R , ∆ D , ∆ U } , (1) 4 PC1 PC2 (A) L ooper Gr ound- T ruth P r edictive V ectors PC1 PC2 (B) Layer A ctivations PC1 PC2 (C) Shif ted Endpoint PC1 PC2 (D) 1 2 3 4 5 6 7 P ath length Figure 1: Transformer layer representations closely resemble the ground-truth predicti ve vectors of Equation 4 (as computed in Appendix A.3). The left column corresponds to the ground-truth prin- cipal components (PCs), and the right column sho ws the PCs of the last layer norm representation, after Procrustes-alignment to the ground-truth vectors. Each point represents a distinct sequence of movements on the square lattice, colored by the number of steps (path length). These repre- sentations could be interpreted as grid-world models, and they simply correspond to minimal and sufﬁcient v ectors for prediction. where ∆ L = ( − 1 , 0) , ∆ R = (1 , 0) , ∆ D = (0 , − 1) , and ∆ U = (0 , 1) . More speciﬁcally , we study w alkers with a single constraint: they must reach some endpoint p after a ﬁx ed number of timesteps T . Thus, each walker is uniquely and fully characterized by the number T and the position vector p ; we refer to paths of length T that reach p as valid paths . The ﬁnite set of all valid paths constitutes a regular language with vocab ulary { L, R, D , U } . Its rules can be algebraically e xpressed in a simple w ay: keep a position vector x t := x ( s t ) = ( x, y ) associated to s t and perform the update x t +1 = x t + ∆ t at each timestep t , where c t ∈ { L, R, D, U } and ∆ t := ∆( c t ) ∈ E is a displacement vector from Equation 1; the string s T ∈ { L, R, D , U } T is then accepted if and only if x T = p . Note that x t = P t t ′ =1 ∆ t ′ . In order to make these systems stochastic, we deﬁne the symbol random v ariables C t and the strings S t = P t t ′ =1 C t ′ ∈ { L, R, D , U } t , where the sum of symbols or strings indicates concatenation. Furthermore, we only consider walkers for which, gi ven the endpoint, all paths of length T leading to it are equally lik ely . Symbolically , S T ∼ Unif { s T : x ( s T ) = p } . Partial walks of length t < T , 5 by contrast, are not uniformly distributed: their probability is proportional to the number of valid completions of the preﬁx, a quantity formally deﬁned and approximated in Appendix A.1. The e xpression for the next-step probability is particularly rele vant considering the predicti ve nature of the neural network models under scrutiny (see Appendix A.2 for its deri vation), Pr( c t +1 | s t ; p ) ≈ e − v t · ∆ t +1 P ∆ ′ t +1 ∈E e − v t · ∆ ′ t +1 , (2) v t := q  x t − p T − t − 1  , (3) where q is a non-linear , real-analytic vector function (see Appendix A.3). Equation 2 is the Softmax function applied to the logit vector − D v t , where D is a 4 × 2 matrix with the elements of E as columns. This e xpression is v alid for t ≪ T . Analyticity of q entails that the logit vectors − D v t lie on some curved 2-dimensional manifold. In the diffusive r e gime ∥ x t − p ∥ ≪ T − t , q can be linearized and yields Pr( c t +1 | s t ; p )    diffusi ve ≈ e − 2 x t − p T − t − 1 · ∆ t +1 P ∆ ′ t +1 ∈E e − 2 x t − p T − t − 1 · ∆ ′ t +1 , (4) where the logit vectors no w lie on a plane. Equation 2 tells us that knowledge of v t , a single 2D state v ector given by Equation 3, is sufﬁcient to make any future predictions of the system. For this reason, we refer to v t as the sufﬁcient vector . In practice, we compute it by inv erting Equation 2, as explained in Appendix A.3. In Section 4, we show e vidence that Transformers tend to autonomously ﬁnd this representation during training. 3 . 2 D A TA G E N E R A T I O N In practice, the neural networks that we use ne ver sample full length- T sequences; rather , we train the models on length- τ preﬁxes (with ﬁxed τ = 8 ) drawn from the marginal distribution o ver pre- ﬁxes induced by the full conditioned process (Equation 5 in Appendix A.1). In order to compute the full distribution, we use a dynamic programming algorithm (more details in Appendix A.1). W e trained six identical transformer neural networks, each on one of the six distinct grid walkers listed in T able 1. Three time horizons T were chosen for walk ers looping back to the origin, to test three different regimes of τ /T : a small, medium and big loop relative to the preﬁx length; these correspond to walkers 1, 2 and 3 in T able 1, and we will refer to them as loopers . Three additional walkers with shifted endpoint p = (4 , 0) were set up for the same three values of T , corresponding to walkers 4, 5 and 6. 3 . 3 S A M P L I N G , T R A I N I N G A N D V A L I D AT I O N Samples for training are drawn with replacement from the marginal distribution in batches. As prediction objectiv e, we use a next-tok en cross-entropy loss ov er all positions of the input sequences av eraged over the minibatch, with the ﬁrst loss term corresponding to predicting s 2 giv en s 1 (no beginning-of-sequence tok en). All models are optimized using the Adam algorithm during training. W e refer to Appendix A.6 for a comprehensi ve list of h yperparameters used. V alidation uses exact expected cross-entropy loss under the ground-truth distribution. The per- token loss is calculated as in training, but instead of the sample mean of a minibatch, we compute a weighted average over the entire set of possible preﬁxes at every validation round, where each partial sequence’ s per-token loss is weighted by its exact ground-truth probability . Importantly , this validation paradigm does not measure generalization, but controls for the global optimization problem. Final v alidation losses of the trained models are shown in T able 1 with respect to a baseline theoretical loss giv en by the ground-truth conditional entropy of an observ ation given a history . 6 T able 1: W alkers used for training. The last column sho ws the mean excess validation loss with respect to the theoretical lo wer bound after 20 000 epochs. The loss lower bound for the n th token is its conditional entropy giv en all previous n − 1 observations according to the exact probability distribution of the system. The excess loss sho wn in this table is an av erage ov er all sequences of length between 2 and 8, each with its o wn baseline units. All v alues are close to zero, conﬁrming that ev ery model has learned to predict the sequence distrib ution to near-theoretical-optimal accurac y . INDEX ENDPOINT p HORIZON T MEAN V AL LOSS 1 (0 , 0) 20 4 . 5 · 10 − 6 2 (0 , 0) 200 2 . 6 · 10 − 7 3 (0 , 0) 1000 5 . 5 · 10 − 5 4 (4 , 0) 20 5 . 9 · 10 − 6 5 (4 , 0) 200 < 10 − 8 6 (4 , 0) 1000 < 10 − 8 3 . 4 R E P R E S E N TA T I O N A L A N A LY S I S W e wish to ev aluate the degree of alignment between the transformer layer representations and the sufﬁcient vectors giv en by Equation 3. In order to do this, we rely on two separate quantities: the coefﬁcient of determination ( R 2 ) of a least-squares afﬁne map between the two representations, and the linear centered kernel alignment (lCKA), a widely used representational similarity metric (K ornblith et al. (2019)). W e compute an afﬁne map from activ ations to ground-truth targets. This is obtained after a weighted least-squares minimization. Explicitly , gi ven all preﬁxes s t of length t ≤ τ , their respective layer ℓ activ ations a ℓ ( s t ) ∈ R d ℓ and mean-centered log-probability vectors ¯ ℓ t ∈ R 4 (Appendix A.3), ˆ W ℓ , ˆ b ℓ = arg min W ℓ , b ℓ τ X t =1 X s t 1 n ( s t , t )   W ℓ a ℓ ( s t ) + b ℓ − ¯ ℓ t   2 , where the matrix W ℓ ∈ R 4 × d ℓ , together with the vector b ℓ ∈ R 4 , are the parameters of the afﬁne transformation, and n ( s t , t ) designates the degeneracy of s t , i.e. , the total number of length- t se- quences that are equiv alent to s t in the sense that they are also characterized by the triplet ( x t , t ) ; we apply the weights w ( s t ) = n ( s t , t ) − 1 to more accurately reﬂect the global geometry of the rep- resentation. T o obtain an unbiased measure of alignment, we ev aluate using a leave-one-group-out cross-validated weighted R 2 : for each equiv alence class ( x t , t ) , the af ﬁne map is re-ﬁtted on all remaining classes and applied to predict the held-out class. The residuals are then pooled across all groups to yield a single R 2 estimate that reﬂects out-of-sample generalization. The second similarity coefﬁcient that we use is the lCKA . It essentially computes the linear corre- lation between the centered cov ariance matrices of both representations. Explicitly , lCKA = ∥ A ⊤ ℓ L t ∥ 2 F ∥ L ⊤ t L t ∥ F ∥ A ⊤ ℓ A ℓ ∥ F , where A ℓ is the matrix with the weighted-centered and scaled versions of vectors a ℓ ( s t ) as rows, and each row represents a distinct preﬁx s t ; analogously , L t is the matrix with the weighted-centered and scaled versions of v ectors ¯ ℓ t as ro ws. The weights used for centering and scaling are the in verse degeneracies w ( s t ) described above. The main difference between these tw o metrics—apart from the fact that held-out test sets are used in R 2 computations—is that the R 2 coefﬁcient ignores the neural network’ s feature-space variance altogether , whereas lCKA ﬁlters out directions of lo w v ariance in feature space (K ornblith et al. (2019)). 7 Figure 2: Linear similarity metrics between layer activ ations and ground-truth predictive vectors. (A) and (D): Both R 2 and lCKA are high and tend to increase with layer depth, with Layer- Norms generally exhibiting stronger similarity . Moreov er , walkers with longer time horizons T hav e worse similarity scores given an endpoint, while loopers’ scores are higher in comparison to shifted-endpoint walk ers’ ones for a gi ven time horizon. In these plots, lines corresponding to loop- ers are solid, while those associated with shifted-endpoint w alkers are dashed. (B-C): The R 2 value for the T = 20 looper is shown as a function of the training step in (B), and it is compared with the excess v alidation loss from its corresponding timestep in (C). (E-F) The lCKA value for the same T = 20 looper is also shown as a function of training step (E) and compared to the loss (F). 4 R E S U L T S A N D D I S C U S S I O N W e trained six identical transformers on six different grid walkers respectiv ely , as shown in T able 1 and detailed in Section 3.2. In Section 3.1, we showed that knowledge of a 2-dimensional state vector v t (Equation 3) is suf ﬁcient to make predictions. W e hypothesized that the network ﬁnds this representation of the input. Figure 1 sho ws the ﬁrst two principal components of both the ground- truth sufﬁcient vectors and ﬁnal LayerNorm activ ations for walkers 1 and 4 of T able 1, where the resemblance becomes evident. 4 . 1 N E T W O R K AC T I V AT I O N S M A T C H P R E D I C T I V E G E O M E T R I E S W e compare layer acti v ations and ground-truth predictiv e vectors using two metrics: the coefﬁcient of determination ( R 2 ) of the linear mapping and the linear centered kernel alignment ( lCKA ), both described in Section 3.4. The y are shown in Figure 2 for ev ery walk er and for each layer of the resid- ual stream and pre-attention LayerNorms. The linear agreement between most layer representations and the ground-truth predictiv e vector v t was high. LayerNorm representations, howe ver , sho wed a visibly higher lCKA with respect to the ground truth, whereas residual stream v ectors had a poorer linear readability . R 2 did not vary as much as the lCKA , suggesting that the ﬁrst overestimates alignment. Although one might reasonably anticipate a strong match between pre-readout activ ations and the vectors v t , there was no guarantee that such alignment would also occur for the other LayerNorm representations. T o be precise, ﬁnal LayerNorm alignment makes sense because the transformer output logits are ﬁtted to the ground-truth logits during training; b ut these two are just afﬁne- transformed versions of, respectively , the ﬁnal LayerNorm activ ations and the predictive vectors v t —under this light, their ﬁt is foreseeable. In contrast, early and intermediate activ ations are less obviously constrained by the output layer; they are not strictly required to represent this predicti ve geometry linearly , but the y seem to do so, at least to some extent. 8 Figure 3: For the T = 20 looper , the number of principal components required to explain 99% of the variance conv erges to 2 for the ﬁnal LayerNorm’ s activ ations, but it rebounds for all previous hidden layers after around 10 4 training steps. T o analyze the intrinsic dimensionality of the representations, we perform principal component anal- ysis (PCA) on the layer activ ations at each training checkpoint, treating each preﬁx in the dataset as a sample in the analysis. As in the computation of the alignment measures, we apply the same weights of in verse degenerac y to the samples (see Section 3.4). In Figure 3, we sho w the required number of principal components needed to explain 99% of the variance of each layer across training for the T = 20 looper . The ﬁnal LayerNorm con verges to a 2-dimensional representation (Figure 1B), ev en though the activ ation space as a whole is 128-dimensional. While Equation 4 does seem to suggest that logits at the ﬁnal layer should be constrained to a 2-dimensional space, the Softmax function actually has a g auge freedom: any arbitrary function of x t can be added to the logits with- out changing the next-step distrib ution. In other words, the network did not strictly need to ﬁnd such a ﬂat representation. It is not rare, howe ver , for networks to ﬁnd low-dimensional representations (ev en nonlinear ones Ansuini et al. (2019)). Curiously , the number of dimensions spanned by the rest of the layer representations seems to con ver ge ﬁrst, but then rebounds slightly halfway through training. 4 . 2 P R E D I C T I V E G E O M E T RY A S A W O R L D M O D E L In Section 4.1, we argued how the alignment between the last LayerNorm acti v ations and the predic- tiv e v ectors v t deﬁned by Equation 3 is rather unsurprising from a purely mathematical perspecti ve. This seemingly tri vial result, howe ver , might hint at a v aluable insight. One may imagine the sce- nario of encountering the internal representations of Figures 1B and 1D without any kno wledge of the predictive geometry , i.e. , Figures 1A and 1C. At ﬁrst glance, the emergent representations of the model w ould seem to replicate the known structure of the grid w orld that the data is generated from. Since world models can impro ve prediction and control in man y different machine learning settings, this alone could explain aw ay the beneﬁts of such representations. In more complex con- texts, ﬁnding such a world representation sometimes entails a result of its own. The simple grid walkers we study illustrate a way in which world models can be directly extracted from the predic- 9 tiv e structure of the system. Indeed, the vectors deﬁned by Equation 3 can be thought of as world locations, corresponding to the relativ e position of the walker on the grid, di vided by the time left. In fact, these v ectors condense the spatial and temporal aspects of the world in a way that is sufﬁ- cient for prediction. Thus, these world-model-lik e representations can be putati vely traced back to a speciﬁc interaction between data, architecture and training: the minimal ener gy-based formulation of the data’ s predictiv e geometry relies on position v ectors, and the network itself processes v ectors optimally with respect to a prediction objectiv e. In other words, the world model is already implicit in the predictive structure of the system—the network simply has the right architecture and train- ing objectiv e to capture it. The grid walker thus serves as a proof of concept: a setting where the emergence of w orld-model representations from predictiv e geometry can be veriﬁed analytically . 4 . 3 P R E D I C T I V E G E O M E T RY A S G R A M M A R The step emissions of the grid w alkers produce length- T sequences of a regular language (Section 3.1). The formulation of such a formal language, ho wever , is not enough to describe the w alker—its stochastic nature provides an additional important ingredient. Indeed, the uniform distribution of valid paths is not information about v alidity , b ut occurrence—one may venture here a loose analogy with semantics, where the difference between sentences such as “I’m feeling blue” and “I’m feel- ing gr een” is not about syntactic correctness, but context-dependent usage frequency . Notably , contemporary frontier LLMs must model both deterministic/formal/syntactic rules and stochas- tic/informal/semantic ones based on a training objectiv e that is solely probabilistic. In our simple lattice model, a single parameter controls the degree of stochasticity of the next step, i.e. , the remaining time T − t − 1 . It plays an e xplicit role as the temperature in the Gibbs measure ov er the ne xt steps, Equation 4. As t → T − 1 , the distrib ution becomes deterministic 1 . Moreov er , since the logits attain their maximum when the displacement ∆ ∈ { ∆ L , ∆ R , ∆ D , ∆ U } is antiparal- lel to the vector x t − p , the ne xt-step distrib ution f aithfully e xpresses the formal grammatical rule of the language—arriving at p after T timesteps. Hence, the sufﬁciently predicti ve vectors of Equation 3 learned by the transformer encode, on the one hand, the temperature—and therefore stochasticity—in their length; on the other hand, the angle between these vectors is also crucial to model the formal rule. This geometric structure is thus sufﬁcient to describe both the stochastic and formal aspects of the simple lattice language under consideration. This perspectiv e again appears trivial in this case, but we believ e in the possibility that similar geometric structures might underlie more complex languages as well—it might be a matter of understanding how to decompose them into simpler parts. 4 . 4 B E L I E F S TA T E S O F T H E G R I D W A L K E R Whereas the previous work discussed in Section 2.1 focused on ergodic HMMs and their belief states, here we focus on predictiv e states instead. These two, ho we ver , are related. It can be sho wn that belief states can always be linearly mapped to predicti ve states, but not the other way around (see Riechers et al. (2025b) and this paper’ s Appendix A.4). This means that the dimensionality of the belief-state space is an upper bound to that of the predictiv e states, even if one takes inﬁnite future horizons into account. Our time-inhomogeneous HMM, howe ver , has a some what uninteresting belief-state space: the transitions between states are deterministic giv en observations (more about this in Section 5). This giv es a very loose dimension upper bound, as we know that the intrinsic dimensionality of predictive states is tw o 2 . Thus, our example sho ws ho w exploring the geometric structure of predictiv e states, rather than belief states, can be more informati ve in some settings. 1 W e ignore the regime assumptions of the approximation in this analysis because it holds on a qualitati ve le vel. 2 The upper bound of the 4-dimensional next-step predictive state we use is already 3, but the 2-dimensional manifold structure of predictiv e states holds ev en for inﬁnite time horizons. 10 5 L I M I T A T I O N S A N D F U T U R E C H A L L E N G E S T wo important caveats should be immediately recognized. Firstly , there is the issue of scalability and generalization—our toy system is deliberately simple, which makes this kind of analytical ex- amination possible, whereas in most systems such an analysis does not appear to be feasible. Despite these challenges, it is kno wn that neural networks often ﬁnd these world-like representations without direct supervision, ev en in more complex settings (Gurnee and T egmark (2024); Karvonen (2024); Ivanitskiy et al. (2023)). In order to trace these representations back to the system’ s generative or predictiv e structure, a decomposition method is necessary to break complex structures into tractable ones. There are preliminary ﬁndings which suggest that the theory of Markov processes, optimal prediction and computational mechanics might provide the right theoretical framew ork for this task (Shai et al. (2026)). Secondly , an important distinction needs to be made between generation and prediction, which are confounded in our speciﬁc setting. In general, generation relies on state-dependent computation, but this state might be hidden from an external observer that wishes to make observable predictions. Howe ver , since the initial and ﬁnal states are always the same for each trained network, and since the underlying model has deterministic state transitions gi ven a walk step ( L, R, D or U ), there is no relev ant information about the states that is hidden from the observer (the T ransformer in our case). Therefore, an optimal predictive model coincides with the most parsimonious generativ e model. Studying an e xample for which states are fundamentally hidden, such as in Shai et al. (2024), might provide ne w insights. Moreov er , the assumption of using ﬁnite HMMs as generativ e models might be too restricti ve if the goal is to apply these frameworks to languages higher in the Chomsky hierarchy , such as context- free grammars (CFGs), let alone natural language itself. Predicti ve states, unlike belief states, are not restricted to HMMs, so it may be useful to better understand the relationship between these and other objects such as probabilistic CFGs. W e postpone a detailed in vestigation of this signiﬁcant issue to future work. Another important aspect that w arrants further in vestig ation is the absence of an e xplicit assessment of generalization. In particular , it would be valuable to examine how representational alignment with predictiv e vectors relates to, and potentially predicts, the model’ s ability to generalize beyond the training distribution. 6 C O N C L U S I O N S After training transformer neural networks to predict the mov ements of constrained 2d-lattice walk- ers described in Section 3.1, we found that their internal representations resemble world models for these walkers, and they strongly match the 2-dimensional sufﬁcient vectors for prediction, i.e. , the position on the grid divided by the time left. This shows a very simple scenario in which the relationship between the world model and the stochastic structure of the system can be traced back analytically . It also supports a hypothesis outlined in Shai et al. (2024), which posits that artiﬁcial neural networks trained on next-token prediction internally represent belief states in order to perform Bayesian belief updates in context. The full sequences of steps that these walkers produce are strings in a simple ﬁnite regular language, so this system also serves as an elementary example of how grammatical rules can be encoded by a neural network under the same principle of optimal prediction—an alternativ e but complementary perspectiv e to that of world models. The network not only encodes the distance to the enforced endpoint on the grid, thus keeping track of syntax, but it also stores information about the remaining time, thereby keeping track of stochastic freedom, which decreases as time runs out and the walker must approach the goal. These results encourage a deeper in vestigation into more comple x languages, possibly starting with context-free grammars. Ultimately , the goal is to in vestigate ho w uni versal is the alignment between 11 the geometry of internal representations in predictiv e neural networks and predictiv e/belief states, with the hope that we can someday rely on this putativ e natural alignment for interpretability and steering. R E F E R E N C E S A. Ansuini, A. Laio, J. H. Macke, and D. Zoccolan. Intrinsic dimension of data representations in deep neural networks. In H. W allach, H. Larochelle, A. Be ygelzimer , F . d'Alch ´ e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Pr ocessing Systems , volume 32. Curran As- sociates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/ file/cfcce0621b49c983991ead4c3d4d3b6b- Paper.pdf . G. Boleda. Distrib utional Semantics and Linguistic Theory. Annual Review of Linguistics , 6(1): 213–234, 2020. ISSN 2333-9683. doi: 10.1146/annurev- linguistics- 011619- 030303. T . Brown, B. Mann, N. Ryder , M. Subbiah, J. D. Kaplan, P . Dhariwal, A. Neelakantan, P . Shyam, G. Sastry , A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger , T . Henighan, R. Child, A. Ramesh, D. Zie gler , J. W u, C. W inter , C. Hesse, M. Chen, E. Sigler , M. Litwin, S. Gray , B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutske ver , and D. Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Pr ocessing Systems , volume 33, pages 1877–1901. Curran As- sociates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/ file/1457c0d6bfcb4967418bfb8ac142f64a- Paper.pdf . K. Clark, U. Khandelwal, O. Le vy , and C. D. Manning. What does BER T look at? an analysis of BER T’s attention. In T . Linzen, G. Chrupała, Y . Belinkov , and D. Hupkes, editors, Pr oceedings of the 2019 ACL W orkshop BlackboxNLP: Analyzing and Interpr eting Neural Networks for NLP , pages 276–286, Florence, Italy , Aug. 2019. Association for Computational Linguistics. doi: 10. 18653/v1/W19- 4828. URL https://aclanthology.org/W19- 4828/ . A. Coenen, E. Reif, A. Y uan, B. Kim, A. Pearce, F . V i ´ egas, and M. W attenberg. V isualizing and Measuring the Geometry of BER T, 2019. J. P . Crutchﬁeld and K. Y oung. Inferring statistical complexity . Phys. Rev . Lett. , 63:105–108, Jul 1989. doi: 10.1103/PhysRe vLett.63.105. URL https://link.aps.org/doi/10.1103/ PhysRevLett.63.105 . P . Diego-Sim ´ on, S. D' Ascoli, E. Chemla, Y . Lakretz, and J.-R. King. A polar coordinate system represents syntax in large language models. In A. Globerson, L. Mackey , D. Bel- grav e, A. Fan, U. Paquet, J. T omczak, and C. Zhang, editors, Advances in Neural Information Pr ocessing Systems , volume 37, pages 105375–105396. Curran Associates, Inc., 2024. doi: 10.52202/079017- 3344. URL https://proceedings.neurips.cc/paper_files/paper/ 2024/file/be36e50757bf9cd280aa74f89a7d1c23- Paper- Conference.pdf . J. Ferrando, G. Sarti, A. Bisazza, and M. R. Costa-juss ` a. A primer on the inner workings of transformer-based language models. CoRR , abs/2405.00208, 2024. URL https://doi.org/ 10.48550/arXiv.2405.00208 . M. Finlayson, A. Mueller , S. Gehrmann, S. Shieber, T . Linzen, and Y . Belinkov . Causal analysis of syntactic agreement mechanisms in neural language models. In C. Zong, F . Xia, W . Li, and R. Navigli, editors, Pr oceedings of the 59th Annual Meeting of the Association for Computa- tional Linguistics and the 11th International J oint Confer ence on Natural Language Pr ocessing (V olume 1: Long P apers) , pages 1828–1843, Online, Aug. 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl- long.144. URL https://aclanthology.org/2021. acl- long.144/ . 12 W . Gurnee and M. T egmark. Language models represent space and time. In B. Kim, Y . Y ue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, edi- tors, International Conference on Learning Representations , volume 2024, pages 2483– 2503, 2024. URL https://proceedings.iclr.cc/paper_files/paper/2024/file/ 0a6059857ae5c82ea9726ee9282a7145- Paper- Conference.pdf . D. Ha and J. Schmidhuber . Recurrent world models facilitate policy e volution. In S. Ben- gio, H. W allach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, edi- tors, Advances in Neural Information Pr ocessing Systems , volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/ 2de5d16682c3c35007e4e92982f1a2ba- Paper.pdf . D. Hafner , J. Pasukonis, J. Ba, and T . Lillicrap. Mastering div erse domains through world models. Natur e , 640(8059):647–653, 2025. ISSN 1476-4687. doi: 10.1038/s41586- 025- 08744- 2. N. Hansen, H. Su, and X. W ang. TD-MPC2: Scalable, robust world models for continuous control. In The T welfth International Confer ence on Learning Repr esentations , 2024. doi: 10.48550/ arXiv .2310.16828. J. He witt and C. D. Manning. A structural probe for ﬁnding syntax in word representations. In J. Burstein, C. Doran, and T . Solorio, editors, Pr oceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies, V olume 1 (Long and Short P apers) , pages 4129–4138, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19- 1419. URL https://aclanthology.org/N19- 1419/ . P . M. Htut, J. Phang, S. Bordia, and S. R. Bowman. Do Attention Heads in BER T track syntactic dependencies?, 2019. M. Ivanitskiy , A. F . Spies, T . R ¨ auker , G. Corlouer, C. Mathwin, L. Quirke, C. Rager, R. Shah, D. V alentine, C. D. Behn, K. Inoue, and S. W . Fung. Linearly structured world representations in maze-solving transformers. In UniReps: the F irst W orkshop on Unifying Representations in Neural Models , 2023. URL https://openreview.net/forum?id=pZakRK1QHU . J. Kaplan, S. McCandlish, T . Henighan, T . B. Brown, B. Chess, R. Child, S. Gray , A. Radford, J. W u, and D. Amodei. Scaling Laws for Neural Language Models, 2020. A. Karvonen. Emergent world models and latent variable estimation in chess-playing language models. In Fir st Conference on Language Modeling , 2024. URL https://openreview.net/ forum?id=9n2s8l7XoQ . S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of Neural Network representations revisited. In Pr oceedings of the 36th International Conference on Machine Learning , 2019. doi: 10.48550/ARXIV .1905.00414. M. A. Lepori, T . Serre, and E. Pavlick. Uncovering Intermediate V ariables in T ransformers using Circuit Probing. In F irst Conference on Languag e Modeling , 2024. URL https: //openreview.net/forum?id=gUNeyiLNxr . K. Li, A. K. Hopkins, D. Bau, F . V i ´ egas, H. Pﬁster, and M. W attenberg. Emergent world represen- tations: Exploring a sequence model trained on a synthetic task. In International Conference on Learning Repr esentations , 2023. N. F . Liu, M. Gardner , Y . Belinkov , M. E. Peters, and N. A. Smith. Linguistic knowledge and transferability of contextual representations. In Pr oceedings of the 2019 Confer ence of the North American Chapter of the Association for Computational Linguistics: Human Language T echnolo- gies , 2019. doi: 10.18653/v1/N19- 1112. 13 M. L ´ opez-Otal, J. Gracia, J. Bernad, C. Bobed, L. Pitarch-Ballesteros, and E. Angl ´ es-Herrero. Lin- guistic Interpretability of T ransformer-based Language Models: A systematic revie w , 2025. C. D. Manning, K. Clark, J. Hewitt, U. Khandelwal, and O. Levy . Emergent linguistic structure in artiﬁcial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences , 2020. doi: 10.1073/pnas.1907367117. N. Nanda and J. Bloom. T ransformerlens. https://github.com/TransformerLensOrg/ TransformerLens , 2022. M. E. Peters, M. Neumann, L. Zettlemoyer , and W .-t. Y ih. Dissecting conte xtual word embeddings: Architecture and representation. In Pr oceedings of the 2018 Conference on Empirical Methods in Natural Languag e Pr ocessing , 2018. doi: 10.18653/v1/D18- 1179. M. Piotrowski, P . M. Riechers, D. Filan, and A. S. Shai. Constrained belief updates explain ge- ometric structures in transformer representations. In F orty-second International Conference on Machine Learning , 2025. doi: 10.48550/ARXIV .2502.01954. A. Radford, J. W u, R. Child, D. Luan, D. Amodei, and I. Sutske ver . Language Models are Unsuper - vised Multitask Learners. OpenAI Blog , 1(8):9, 2019. P . M. Riechers, H. R. Bigelow , E. A. Alt, and A. Shai. Next-token pretraining implies in-context learning, 2025a. P . M. Riechers, T . J. Elliott, and A. S. Shai. Neural networks leverage nominally quantum and post-quantum representations, 2025b. A. Rogers, O. K ov ale v a, and A. Rumshisky . A primer in bertology: What we know about how BER T works. T ransactions of the Association for Computational Linguistics , 8:842–866, 2021. ISSN 2307-387X. doi: 10.1162/tacl a 00349. M. Sahlgren. The distributional hypothesis. Italian Journal of Linguistics , 20(1):33–53, 2008. A. Shai, L. Amdahl-Culleton, C. L. Christensen, H. R. Bigelow , F . E. Rosas, A. B. Boyd, E. A. Alt, K. J. Ray , and P . M. Riechers. Transformers learn f actored representations, 2026. A. S. Shai, S. E. Marzen, L. T eixeira, A. G. Oldenziel, and P . M. Riechers. Transformers represent belief state geometry in their residual stream. In The Thirty-eighth Annual Conference on Neural Information Pr ocessing Systems , 2024. doi: 10.48550/ARXIV .2405.15943. URL https:// openreview.net/forum?id=YIB7REL8UC . C. R. Shalizi and J. P . Crutchﬁeld. Computational mechanics: Pattern and prediction, structure and simplicity . Journal of Statistical Physics , 104(3):817–879, 2001. ISSN 1572-9613. doi: 10.1023/A:1010388907793. I. Sucholutsky , L. Muttenthaler , A. W eller , A. Peng, A. Bobu, B. Kim, B. C. Lov e, C. J. Cue v a, E. Grant, I. Groen, J. Achterberg, J. B. T enenbaum, K. M. Collins, K. L. Hermann, K. Ok- tar , K. Greff, M. N. Hebart, N. Cloos, N. Krie geskorte, N. Jacoby , Q. Zhang, R. Marjieh, R. Geirhos, S. Chen, S. Kornblith, S. Rane, T . Konkle, T . P . O’Connell, T . Unterthiner, A. K. Lampinen, K.-R. M ¨ uller , M. T one va, and T . L. Grifﬁths. Getting aligned on representational alignment. T ransactions on Mac hine Learning Resear ch , 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=Hiq7lUh4Yn . I. T enney , D. Das, and E. Pavlick. BER T rediscovers the classical NLP pipeline. In Pr oceedings of the 57th Annual Meeting of the Association for Computational Linguistics , 2019. doi: 10.18653/ v1/P19- 1452. M. Tuck er , P . Qian, and R. Le vy . What if this modiﬁed that? Syntactic interventions with counter- factual embeddings. In F indings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , 2021. doi: 10.18653/v1/2021.ﬁndings- acl.76. 14 P . D. T urney and P . Pantel. From Frequency to Meaning: V ector Space Models of Semantics. Journal of Artiﬁcial Intelligence Resear ch , 37:141–188, 2010. ISSN 1076-9757. doi: 10.1613/jair .2934. A. V asw ani, N. Shazeer , N. Parmar , J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polo- sukhin. Attention is all you need. In I. Guyon, U. V . Luxbur g, S. Bengio, H. W allach, R. Fergus, S. V ishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_ files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa- Paper.pdf . J. V ig and Y . Belinkov . Analyzing the structure of attention in a transformer language model. In Pr o- ceedings of the 2019 ACL W orkshop BlackboxNLP: Analyzing and Interpr eting Neural Networks for NLP , 2019. doi: 10.18653/v1/W19- 4808. E. V oita, D. T albot, F . Moiseev , R. Sennrich, and I. Tito v . Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Pr oceedings of the 57th Annual Meeting of the Association for Computational Linguistics , 2019. doi: 10.18653/v1/P19- 1580. Z. W u, Y . Chen, B. Kao, and Q. Liu. Perturbed masking: Parameter -free probing for analyzing and interpreting BER T. In Pr oceedings of the 58th Annual Meeting of the Association for Computa- tional Linguistics , 2020. doi: 10.18653/v1/2020.acl- main.383. S. M. Xie, A. Raghunathan, P . Liang, and T . Ma. An Explanation of In-conte xt Learning as Implicit Bayesian Inference, 2022. A A P P E N D I X A . 1 C O M P U TA T I O N O F P RO BA B I L I T Y D I S T R I B U T I O N S All probability distributions used for training and validation are computed exactly via a dynamic- programming recursion, without any sampling or approximation. Marginal distrib ution of preﬁxes. W e consider walk ers for which all complete paths of length T ending at p are equally likely: S T ∼ Unif { s T : x ( s T ) = p } . The mar ginal distribution of a length- t preﬁx s t is obtained by summing ov er all valid completions, Pr(S t = s t | p ) = X s t +1 → T Pr(S T = s t + s t +1 → T | p ) 1 x ( s T )= p , (5) where s t 1 → t 2 denotes a substring from position t 1 to t 2 and 1 is the indicator function. Since Pr(S T | p ) is uniform, this simpliﬁes to a ratio of path counts, Pr(S t = s t | p ) = N p ( x ( s t )) N p ,T , (6) where N p ( x ( s t )) counts v alid paths with preﬁx s t ending at p , and N p ,T is the total count. Unlike the full-path distribution, this mar ginal is not uniform. Green’ s function of the 2D lattice walk. Let G n ( r ) denote the number of paths of e xactly n steps on Z 2 starting at the origin and ending at displacement r = ( r x , r y ) . It satisﬁes the four-neighbor con v olution recursion ov er E = { ∆ L , ∆ R , ∆ D , ∆ U } (Equation 1), G n ( r x , r y ) = X ∆ ∈E G n − 1 ( r x − ∆ x , r y − ∆ y ) , (7) with the initial condition G 0 ( 0 ) = 1 and G 0 ( r ) = 0 for r  = 0 . Since only positions with the same parity as n are reachable in exactly n steps ( G n ( r ) = 0 whenev er r x + r y ≡ n (mod 2) ), the nonzero entries at step n ﬁt in a ( n + 1) × ( n + 1) triangular array . 15 Rotated coordinate system. The recursion is implemented in a rotated basis ( i, j ) =  r x + r y + n 2 , r x − r y + n 2  , which maps the four-neighbor update to four corner additions on a square array and enforces the parity constraint as an integrality condition on ( i, j ) . All arithmetic is per- formed in log-space for numerical stability . Completion counts. The number of v alid completions of a partial walk ending at position x t after t steps, giv en endpoint p and total horizon T , equals N p ( x t ) = G T − t ( p − x t ) . The total number of valid paths is N p ,T = G T ( p ) , and the preﬁx probability follows from Equa- tion 6. Next-step probabilities. The conditional next-step probability is obtained exactly as the ratio Pr( c t +1 | s t ; p ) = G T − t − 1 ( p − x t − ∆ c t +1 ) G T − t ( p − x t ) , (8) and is well-deﬁned because G T − t ( p − x t ) = P ∆ ∈E G T − t − 1 ( p − x t − ∆) by the recursion of Equation 7, so the four numerator terms automatically sum to the denominator . In log-space this ratio is a simple subtraction. Precomputation. The recursion is run once forw ard from G 0 to G T , storing ev ery le v el. For each valid preﬁx s t (index ed by its equiv alence class ( x t , t ) ), Equation 8 gi ves an exact probability table ov er the four next symbols. The full joint distribution over length- τ preﬁxes is then assembled by multiplying consecutive next-step factors, and these tables are stored in memory for use as training batches and as validation tar gets. A . 2 A P P RO X I M A T I N G PA T H C O U N T F O R G R I D W A L K E R W e derive the Gaussian approximation to G n ( x ) that underpins the closed-form expressions in the main text. Fourier representation. At each time step, the walker takes one of the four displacements ∆ ∈ E . The characteristic function (Fourier transform) of a single uniform step is the structur e function λ ( k ) = X ∆ ∈E e i k · ∆ = e ik x + e − ik x + e ik y + e − ik y = 2 cos k x + 2 cos k y , (9) where k = ( k x , k y ) is the wav e vector conjugate to the lattice position. Since the n steps of an unconstrained walk are independent and identically distributed, the characteristic function of the total displacement x n = P n t ′ =1 ∆ t ′ is λ ( k ) n (by the conv olution theorem for sums of independent random variables). In verting the Fourier transform reco vers the number of w alks ending at ( x, y ) , G n ( x, y ) = 1 (2 π ) 2 Z π − π Z π − π e i ( k x x + k y y )  2 cos k x + 2 cos k y  n dk x dk y , (10) where the integration is ov er the ﬁrst Brillouin zone [ − π, π ] 2 of the square lattice. Note that λ ( 0 ) = 4 , so the Fourier transform ev aluated at k = 0 is P ( x,y ) G n ( x, y ) = 4 n as expected (the total number of unrestricted n -step walks). This integral representation is the starting point for the saddle- point approximation below . Saddle-point evaluation. For lar ge n and modest ∥ x ∥ , the integral is dominated by a saddle near k = 0 . Expanding the log-characteristic function, ln  2 cos k x + 2 cos k y  = ln 4 − k 2 x + k 2 y 4 + O  k 4 x + k 4 y  , 16 and substituting into Equation 10, G n ( x, y ) ≈ 4 n (2 π ) 2 Z ∞ −∞ Z ∞ −∞ e i ( k x x + k y y ) e − n ( k 2 x + k 2 y ) / 4 dk x dk y . Each factor is a standard Gaussian F ourier transform, R e ik x x e − nk 2 x / 4 dk x = 2 p π /n e − x 2 /n , gi ving G n ( x, y ) ≈ 4 n π n exp  − x 2 + y 2 n  . (11) Since N p ( x t ) = G T − t ( p − x t ) (the walk er has T − t steps remaining to reach p from x t ), substi- tuting n → T − t and x → p − x t yields N p ( x t ) ≈ 4 T − t ( π ( T − t )) − 1 exp( −∥ x t − p ∥ 2 / ( T − t )) . Derivation of the next-step f ormula. Using Equation 11 in Equation 8 and writing r = x t − p , Pr( c t +1 | s t ; p ) ∝ G T − t − 1 ( r + ∆ c t +1 ) ∝ e −∥ r +∆ c t +1 ∥ 2 / ( T − t − 1) . Expanding the square, − ∥ r + ∆ ∥ 2 T − t − 1 = − ∥ r ∥ 2 T − t − 1 + 2( p − x t ) · ∆ T − t − 1 − 1 T − t − 1 , the terms ∥ r ∥ 2 and 1 are constant with respect to c t +1 and cancel under normalization, so Pr( c t +1 | s t ; p ) ≈ e − 2 x t − p T − t − 1 · ∆ c t +1 X ∆ ′ ∈E e − 2 x t − p T − t − 1 · ∆ ′ , which is Equation (6) of the main text. The approximation is v alid whene ver ∥ x t − p ∥ ≪ T − t − 1 (diffusi ve regime). General (non-diffusive) saddle-point. The preceding Gaussian approximation is valid only in the diffusi ve re gime ∥ x t − p ∥ ≪ T − t − 1 . When the walker’ s displacement from the endpoint is comparable to the remaining horizon, the quadratic expansion of ln λ ( k ) around k = 0 is no longer adequate, and one must retain the full structure function λ ( k ) = 2 cos k x + 2 cos k y from Equation 9. Return to the exact Fourier integral of Equation 10 with n = T − t − 1 and displacement r t = p − x t − ∆ c t +1 : G T − t − 1 ( r t ) = 1 (2 π ) 2 Z [ − π ,π ] 2 e − i k · r t λ ( k ) T − t − 1 d 2 k . For large T − t − 1 , the inte grand is sharply peaked and the inte gral can be ev aluated by the method of steepest descents (saddle-point approximation). The exponent to be extremized is Φ( k ) := − i k · r t + ( T − t − 1) ln λ ( k ) . Setting ∇ k Φ = 0 yields the saddle-point condition r t T − t − 1 = i ∇ k λ ( k ) λ ( k )     k = k ∗ . Since ∇ k λ = ( − 2 sin k x , − 2 sin k y ) , the right-hand side is purely imaginary when k is real, whereas the left-hand side is real. The saddle therefore lies on the imaginary axis: we set k ∗ = i q with q ∈ R 2 . Substituting cos( iq c ) = cosh q c and sin( iq c ) = i sinh q c , the saddle-point equation becomes the transcendental system u c = sinh q c cosh q x + cosh q y , c ∈ { x, y } , u t := x t − p T − t − 1 , (12) 17 where we used r t = p − x t so that − r t / ( T − t − 1) = u t . Note that the sign of u t points from the endpoint towards the current position, encoding how far the walker still needs to travel relative to the time av ailable. At the saddle, the leading-order ev aluation of G T − t − 1 giv es a pre-exponential factor (from the Hessian determinant) times exp[ q · r t + ( T − t − 1) ln λ ( i q )] . When forming the next-step probability ratio G T − t − 1 ( p − x t − ∆ c t +1 ) /G T − t ( p − x t ) , the pre-exponential factors and all terms independent of c t +1 cancel under normalization. The survi ving dependence on the step direction comes from the shift r t → r t − ∆ c t +1 , which contributes a f actor e − q · ∆ c t +1 . Identifying v t ≡ q ( u t ) , we obtain Pr( c t +1 | s t ; p ) ≈ e − v t · ∆ c t +1 P ∆ ′ ∈E e − v t · ∆ ′ , which is the Softmax formula of Equation (5) in the main text, now valid beyond the diffusi ve regime. The nonlinear map q deﬁned by the implicit relation of Equation 12 is the one introduced in Section 3.1, so v t = q (( x t − p ) / ( T − t − 1)) . In practice, q is ev aluated by numerically inv erting Equation 12 for each value of u t . Finally , one can verify that the dif fusiv e limit is correctly recovered. When ∥ u t ∥ ≪ 1 , we hav e ∥ q ∥ ≪ 1 as well, so the hyperbolic functions can be linearized: sinh q c ≈ q c and cosh q c ≈ 1 + q 2 c / 2 ≈ 1 . Equation 12 then reduces to u c ≈ q c / (1 + 1) = q c / 2 , giving q ≈ 2 u t and thus v t ≈ 2( x t − p ) / ( T − t − 1) , which recov ers the linear formula of Equation (6). A . 3 C O M P U T I N G S U FFI C I E N T V E C T O R Formal deﬁnition. The sufﬁcient vector v t ∈ R 2 is deﬁned implicitly by the requirement that the Softmax formula Pr( c | s t ; p ) = e − v t · ∆ c P ∆ ′ ∈E e − v t · ∆ ′ (13) holds for all four symbols c ∈ { L, R, D, U } . Exactly this relation is guaranteed by the saddle-point analysis described in Appendix A.2, with v t = q (( x t − p ) / ( T − t − 1)) . In version fr om log-pr obabilities. T aking logarithms of Equation 13 giv es log Pr( c | s t ; p ) = − v t · ∆ c − log Z ( v t ) , so the vector of log-probabilities ℓ ∈ R 4 with entries ℓ c = log Pr( c | s t ; p ) satisﬁes ℓ = − D v t + λ 1 for some scalar λ , where D ∈ R 4 × 2 is the matrix whose rows are the displacement vectors ∆ L , ∆ R , ∆ D , ∆ U , and 1 is the all-ones vector . The columns of D are orthogonal and satisfy D ⊤ D = 2 I 2 , so the minimum-norm least-squares solution is v t = − 1 2 D ⊤ ℓ = 1 2  log p L − log p R log p D − log p U  , (14) where p c = Pr( c | s t ; p ) . The additive constant λ drops out because D ⊤ 1 = 0 (the rows of D sum to zero), so Equation 14 is independent of the gauge choice for the log-probabilities. Four -dimensional representation. In the representational-alignment analysis of Section 3.4, we use the mean-centered log-probability vector ¯ ℓ = ℓ − ¯ ℓ 1 (where ¯ ℓ is the mean of the four entries) as the regression target. This four-dimensional vector encodes exactly the same information as v t , since ¯ ℓ = − D v t (using D ⊤ 1 = 0 ), and the two representations are intercon vertible via Equation 14 and its inv erse. The centered representation is con venient for the regression because it removes the uninformativ e ov erall scale. 18 Numerical computation. In practice, the exact log-probabilities ℓ are obtained from the Green’ s function via Equation 8: for each equi valence class ( x t , t ) , we compute G T − t − 1 ( p − x t − ∆ c ) for all four ∆ c in log-space, normalize by their log-sum-exp, and store the result as the target for training and e valuation. The 2D form of Equation 14 is used for visualization and interpretability , while the 4D form is used for regression. Analyticity of q . The map q is the inv erse of F : R 2 → D , where D = { u : | u x | + | u y | < 1 } is the physically admissible domain (a lattice walk cannot displace faster than one unit per step on av erage, so | u x | + | u y | < 1 always holds), and F is deﬁned componentwise by Equation 12: F c ( q ) = sinh q c / (cosh q x + cosh q y ) . Since F is a ratio of hyperbolic functions it is real-analytic on R 2 . Its 2 × 2 Jacobian has entries ∂ F x ∂ q x = 1 + cosh q x cosh q y (cosh q x + cosh q y ) 2 , ∂ F x ∂ q y = − sinh q x sinh q y (cosh q x + cosh q y ) 2 , and analogously with x ↔ y for ∂ F y . A short computation giv es det J F ( q ) = 1 (cosh q x + cosh q y ) 2 > 0 ∀ q ∈ R 2 . Since J F is nowhere singular and F is real-analytic, the analytic inv erse function theorem guarantees that q = F − 1 : D → R 2 is also real-analytic—and in particular C ∞ —on its entire domain. A . 4 M A P P I N G B E T W E E N B E L I E F A N D P R E D I C T I V E S TA T E S Here we make precise the distinction between belief states and pr edictive states for the grid walker , and explain why the latter of fer a more informative lo w-dimensional characterization. Belief states. For a general hidden-Markov model (HMM), the belief state at time t gi ven obser- vation history s t is the posterior distribution over hidden states, µ t = Pr( X t = · | s t ) . In our grid walker , the hidden state at time t is the position X t = x t ∈ Z 2 . Since each step deterministically updates the position ( x t = P t t ′ =1 ∆ c t ′ is fully determined by the observation sequence), the belief state is µ t ( x ) = Pr( X t = x | s t ) = δ x , x ( s t ) , a point mass concentrated at the current position x ( s t ) . The belief-state space at depth t therefore consists of all positions reachable in t steps, a set of size O ( t 2 ) that grows without bound with t . Predicti ve states. The predictiv e state at time t , by contrast, is the distribution over future obser- vations: ψ t = Pr(S t +1: T = · | s t ; p ) . T wo histories s t and s ′ t are causally equivalent if ψ t = ψ ′ t , i.e. , if they produce identical predictiv e distributions for all futures. From Equation 8, the predictiv e distribution depends on s t only through the pair ( x ( s t ) , t ) : all histories that share the same position at the same time step are causally equi v- alent. The space of predictiv e states is therefore parameterized by ( x t , t ) , and via the continuous sufﬁcient vector v t = q (( x t − p ) / ( T − t − 1)) , it is embedded in a two-dimensional manifold inside ∆ 3 (the 3-simplex of probability v ectors ov er four symbols). Linear map from belief states to predictiv e states. For any HMM, there exists a linear map Φ such that ψ t = Φ µ t . This map is constructed as follows. Let X be the set of possible positions at time t ; then Pr( c | s t ; p ) = X x ∈X µ t ( x ) Pr( c | X t = x ; p ) = Pr( c | X t = x ( s t ); p ) , where the last equality holds because µ t is a point mass. In general, even when µ t is spread over sev eral states, the map µ t 7→ Pr( · | s t ; p ) is linear in the belief state. The in verse map does not e xist 19 in general: multiple belief states may yield the same predictiv e state, and in our setting the con verse is trivially true—the single nonzero component of µ t amounts to consulting one row of the matrix [Pr( c | X t = x ; p )] x ,c . Dimension comparison. The belief-state space at depth t is ( |X t | − 1) -dimensional (a simplex ov er the reachable positions), with |X t | = O ( t 2 ) . The predictive-state space is two-dimensional, as argued in Appendix A.5. The inequality dim( P ) ≤ dim( B ) predicted by the general theory holds, but in a degenerate way: the bound is so loose ( 2 vs. O ( t 2 ) ) that it is largely uninformativ e here. W orking directly with predictiv e states rather than belief states therefore yields a far more com- pact and geometrically meaningful characterization of the walker’ s computational structure—even though the belief-state update is tri vial (a deterministic point-mass shift), the predictiv e-state geom- etry captures the relev ant stochastic and spatial structure of the process. A . 5 D I M E N S I O N A L I T Y O F T H E P R E D I C T I O N M A N I F O L D Fix t and consider the next-step map f : R 2 → ∆ 3 deﬁned by f ( x t ) := Pr( · | s t ) , where we analytically extend the closed-form expressions to continuous x t . Whenev er f is differentiable, the rank theorem implies that the image of f is locally contained in a submanifold of dimension at most rank( D f ) ≤ 2 , with possible lower -dimensional degeneracies at points where rank( D f ) < 2 . In statistical mechanics, the Softmax distribution is referred to as the Boltzmann distribution. In the present setting, the “microstates” are the four possible next-step symbols c t +1 ∈ { L, R, D, U } (equiv alently , their displacements ∆ t +1 ∈ E ), and their corresponding “energies” can be identi- ﬁed (up to an additive constant and ov erall scale) with a quantity linear in the current position x t ; concretely , Equation 4 implies Pr( c t +1 | s t ; p ) ∝ exp ( − β t ϵ ( c t +1 ; s t , p )) , β t := 1 T − t − 1 , ϵ ( c t +1 ; s t , p ) := 2( x t − p ) · ∆ t +1 , so that decreasing “energy” corresponds to choosing a step whose displacement points towards the endpoint (i.e., to wards reducing ∥ x t − p ∥ ). The pref actor β t plays the role of an in verse temperature that increases as t → T − 1 , reﬂecting the f act that randomness must decrease as the path approaches its ﬁxed horizon; correspondingly , the approximation becomes singular at t = T − 1 , where one may replace the Softmax by a hard maximization. Finally , since the logits are given by D x t up to scaling, the effecti ve logit (and thus probability) vectors necessarily lie in a two-dimensional subspace determined by the rank of D , and the position x t acts as a suf ﬁcient statistic for predicting future steps (including multi-step predictiv e distrib utions) giv en the endpoint. A . 6 H Y P E R PA R A M E T E R S All hyperparameters were directly taken from Riechers et al. (2025b). All models are instances of the HookedTransformer architecture from the T ransformerLens library (Nanda and Bloom, 2022), a decoder-only transformer with L = 4 layers, H = 4 attention heads, head dimen- sion d head = 32 (model dimension d model = 128 ), MLP width d mlp = 512 with ReLU acti- vations, and LayerNorm normalization. The context window is ﬁxed to τ = 8 tokens. Train- ing ran for 20 000 epochs of 200 minibatches of size 256 . Networks were trained with PyT orch’ s Adam optimizer with default betas , eps and weight decay parameters— (0 . 9 , 0 . 999) , 10 − 8 and 0 , respectiv ely—and an initial learning rate of 5 · 10 − 5 . This learning rate was adjusted by Py- T orch’ s scheduler optim.lr scheduler.ReduceLROnPlateau , with parameters factor=0.5 , patience=1000 , cooldown=200 , threshold=1e-6 . 20

Grid-World Representations in Transformers Reflect Predictive Geometry

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment