Transformers learn variable-order Markov chains in-context
We study transformers’ in-context learning of variable-length Markov chains (VOMCs), focusing on the finite-sample accuracy as the number of in-context examples increases. Compared to fixed-order Markov chains (FOMCs), learning VOMCs is substantially more challenging due to the additional structural learning component. The problem is naturally suited to a Bayesian formulation, where the context-tree weighting (CTW) algorithm, originally developed in the information theory community for universal data compression, provides an optimal solution. Empirically, we find that single-layer transformers fail to learn VOMCs in context, whereas transformers with two or more layers can succeed, with additional layers yielding modest but noticeable improvements. In contrast to prior results on FOMCs, attention-only networks appear insufficient for VOMCs. To explain these findings, we provide explicit transformer constructions: one with $D+2$ layers that can exactly implement CTW for VOMCs of maximum order $D$, and a simplified two-layer construction that uses partial information for approximate blending, shedding light on why two-layer transformers can perform well.
💡 Research Summary
This paper investigates how transformer models perform in‑context learning (ICL) of variable‑order Markov chains (VOMCs), a class of stochastic processes where the next symbol depends on a suffix of variable length rather than a fixed order. Compared to fixed‑order Markov chains (FOMCs), VOMCs require simultaneous learning of both the underlying context‑tree structure and the conditional probability distributions at its leaves, making the task substantially harder. The authors frame the problem in a Bayesian setting and identify the Context‑Tree Weighting (CTW) algorithm—originally devised for universal data compression—as the Bayes‑optimal solution for ICL‑VOMC.
Empirically, the study trains transformers of varying depth (1–4 layers) and also attention‑only variants on synthetic sequences generated from random context trees with alphabet size three, context window N = 1536, and maximum tree depth D = 5. Performance is measured by the cumulative cross‑entropy loss (equivalently, compression rate) across the entire context window. Results show that single‑layer transformers and attention‑only networks perform near a unigram baseline, failing to capture the variable‑order dependencies. In contrast, transformers with two or more layers closely track the optimal CTW loss, with modest improvements as depth increases and a saturation effect around four layers. Traditional language‑model smoothing techniques such as Kneser‑Ney and PPM, which essentially operate as high‑order FOMC estimators, fall significantly short, especially toward the end of the context window where their fallback mechanisms dominate.
To explain these observations, the authors construct explicit transformer architectures. First, a D + 2‑layer transformer is presented that exactly implements the recursive CTW computation: lower layers aggregate suffix counts, intermediate layers perform the required blending of leaf probabilities, and the top layer outputs the CTW‑optimal next‑token distribution. Crucially, feed‑forward (FF) sub‑layers are responsible for the count aggregation, highlighting why attention‑only models cannot succeed on VOMCs. Second, a simplified two‑layer construction is introduced: the first layer supplies raw or partially aggregated count information, while the second layer’s FF network learns to approximate the CTW blending weights. Experiments confirm that this shallow model attains performance virtually indistinguishable from full CTW, indicating that precise recursion is not necessary—approximate statistics suffice.
The paper’s contributions are threefold: (1) it provides the first systematic study of finite‑sample ICL performance on VOMCs, demonstrating that transformers can learn variable‑order dependencies when equipped with sufficient depth; (2) it offers a constructive proof that a transformer with D + 2 layers can implement the Bayes‑optimal CTW algorithm, thereby establishing a concrete link between modern neural architectures and classical information‑theoretic methods; (3) it elucidates why two‑layer transformers already perform well, attributing the effect to the ability of a single FF layer to blend coarse count statistics. By positioning VOMCs as a clean, mathematically tractable benchmark with known optimal performance, the work isolates the mechanisms of in‑context learning from confounding factors present in real‑world language tasks, and opens avenues for designing more efficient transformer‑based learners for structured probabilistic models.
Comments & Academic Discussion
Loading comments...
Leave a Comment