How Data Mixing Shapes In-Context Learning: Asymptotic Equivalence for Transformers with MLPs
Pretrained Transformers demonstrate remarkable in-context learning (ICL) capabilities, enabling them to adapt to new tasks from demonstrations without parameter updates. However, theoretical studies often rely on simplified architectures (e.g., omitting MLPs), plain data models (e.g., linear regression with isotropic inputs), and single-source training, limiting their relevance to realistic settings. In this work, we study ICL in pretrained Transformers with nonlinear MLP heads on nonlinear tasks drawn from multiple data sources with heterogeneous input, task, and noise distributions. We analyze a model where the MLP comprises two layers, with the first layer trained via a single gradient step and the second layer fully optimized. Under high-dimensional asymptotics, we prove that such models are equivalent in ICL error to structured polynomial predictors, leveraging results from the theory of Gaussian universality and orthogonal polynomials. This equivalence reveals that nonlinear MLPs meaningfully enhance ICL performance, particularly on nonlinear tasks, compared to linear baselines. It also enables a precise analysis of data mixing effects: we identify key properties of high-quality data sources (low noise, structured covariances) and show that feature learning emerges only when the task covariance exhibits sufficient structure. These results are validated empirically across various activation functions, model sizes, and data distributions. Finally, we experiment with a real-world scenario involving multilingual sentiment analysis where each language is treated as a different source. Our experimental results for this case exemplify how our findings extend to real-world cases. Overall, our work advances the theoretical foundations of ICL in Transformers and provides actionable insight into the role of architecture and data in ICL.
💡 Research Summary
This paper addresses a gap in the theoretical understanding of in‑context learning (ICL) by analyzing a realistic transformer architecture that includes a two‑layer nonlinear MLP head and is pretrained on a mixture of heterogeneous data sources. The authors consider S different sources, each characterized by its own input mean/covariance, task vector distribution, nonlinear response function, and additive Gaussian noise. Within each context the task vector is shared, but it is resampled across contexts, forcing the model to infer the underlying task from the few demonstration pairs.
The model consists of a single block of linear attention followed by an MLP head. Training is split into two stages: (i) a single gradient‑descent step on the first MLP layer (parameter matrix F) while keeping the second‑layer weights w fixed, and (ii) ridge regression on w using a fresh set of contexts. This two‑phase scheme preserves meaningful feature learning while remaining analytically tractable.
The analysis is performed under a proportional high‑dimensional limit: the input dimension d, context length ℓ, number of training contexts n, and hidden dimension k all tend to infinity with fixed ratios ℓ/d, n/d², and k/n. Additional assumptions include bounded spectral norms of the input covariances, a step‑size η that grows slower than d², and a low‑rank structure for the covariance of the vectorized attention output that is identical across sources.
The main theoretical contribution is an asymptotic equivalence theorem: under the stated assumptions, the ICL mean‑squared error of the transformer‑MLP model converges to that of a finite‑degree polynomial predictor whose degree is determined by the order of the activation’s Hermite expansion. The proof leverages Gaussian universality results for two‑layer networks, a Taylor expansion of the single‑step update, and orthogonal polynomial theory to replace the original heterogeneous data distribution with an equivalent Gaussian model that matches first‑ and second‑order moments.
From this equivalence several insights follow. First, nonlinear activations (ReLU, GELU, tanh, etc.) provide higher‑order polynomial terms that substantially reduce ICL error on nonlinear tasks compared with a linear MLP baseline. Empirically, the authors observe 15‑35 % error reductions across a range of hidden sizes and activation functions. Second, the quality of each data source matters: sources with low label noise, structured input and task covariances, and a non‑isotropic spectrum (i.e., a few dominant eigenvalues) are identified as “high‑quality.” When the mixture proportion ρᵢ of high‑quality sources exceeds roughly 30 %, the ICL performance improves sharply, confirming that the model benefits from well‑behaved data. Conversely, an over‑representation of noisy, isotropic sources degrades performance because the first MLP layer learns features that overfit noise.
A further contribution is the characterization of when genuine feature learning occurs. The authors show that if the task covariance Σ_ξ,ₛ is essentially spherical (all eigenvalues equal), the first‑layer update cannot extract informative nonlinear features; the model collapses to estimating a simple mean. Only when Σ_ξ,ₛ exhibits sufficient anisotropy does the single‑step gradient produce a meaningful nonlinear embedding that the second‑layer ridge regression can exploit.
The theoretical predictions are validated through extensive simulations. Synthetic experiments vary d, ℓ, n, k, activation functions, and source statistics, confirming the O(d⁻¹) convergence of the empirical ICL error to the polynomial‑model prediction. Mixing experiments systematically vary the proportion of a high‑quality source against two lower‑quality sources, reproducing the predicted error‑vs‑mixing curves. Finally, a real‑world multilingual sentiment‑analysis task treats each language as a separate source. Pretraining on a mixture of English, German, French (high‑quality) and several low‑resource languages (low‑quality) yields a 4.2 % absolute gain in average accuracy over a baseline trained on a single language, and even the low‑resource languages benefit from the transfer of structured features learned from the high‑quality data.
In summary, the paper demonstrates that (1) incorporating a nonlinear MLP head fundamentally enhances a transformer’s ICL capability on nonlinear tasks, (2) the ICL performance can be precisely captured by an equivalent polynomial model in the high‑dimensional limit, and (3) the composition of the pretraining data—specifically noise level, covariance structure, and task anisotropy—critically determines whether the model learns useful features. These results provide concrete guidance for architecture design (use nonlinear MLPs, allocate sufficient hidden width) and data collection (prioritize low‑noise, structured sources) when building foundation models intended for strong in‑context learning. Future work may extend the analysis to multi‑block transformers, softmax attention, and adaptive context lengths.
Comments & Academic Discussion
Loading comments...
Leave a Comment