Is In-Context Learning Learning?

Is In-Context Learning Learning?
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In-context learning (ICL) allows some autoregressive models to solve tasks via next-token prediction and without needing further training. This has led to claims about these model’s ability to solve (learn) unseen tasks with only a few shots (exemplars) in the prompt. However, deduction does not always imply learning, as ICL does not explicitly encode a given observation. Instead, the models rely on their prior knowledge and the exemplars given, if any. We argue that, mathematically, ICL fits the definition of learning; however, its full characterisation requires empirical work. We then carry out a large-scale analysis of ICL ablating out or accounting for memorisation, pretraining, distributional shifts, and prompting style and phrasing. We find that, empirically, ICL is limited in its ability to learn and generalise to unseen tasks. Namely, in the limit where exemplars become more numerous, accuracy is insensitive to exemplar distribution, model, prompt style, and the input’s linguistic features. Instead, it deduces patterns from regularities in the prompt, which leads to distributional sensitivity, especially in prompting styles such as chain-of-thought. Given the varied accuracies and on formally similar tasks, we conclude that autoregression’s ad-hoc encoding is not a robust mechanism for learning, and suggests limited all-purpose generalisability.


💡 Research Summary

The paper asks whether in‑context learning (ICL) performed by large language models (LLMs) truly constitutes learning. The authors first reformulate the classic Probably Approximately Correct (PAC) framework to treat ICL as a learning process: an LLM must achieve bounded error on examples drawn from a training distribution P and maintain that bound on any new distribution Q, despite not updating its weights. This theoretical framing shows that ICL can be described as learning, but it leaves open how the model actually acquires task knowledge from the prompt alone.

To answer this, the authors conduct a massive empirical study involving four state‑of‑the‑art LLMs and nine formal tasks (parity checking, Hamiltonian‑cycle verification, stack manipulation, etc.). They generate 1.89 million predictions, varying the number of exemplars (1–hundreds), the prompt style (plain, chain‑of‑thought, automated prompt optimisation), and the distribution of training versus test data (label proportion, positional ordering, out‑of‑distribution shifts). They also replace natural‑language instructions with random alphabet strings to force the model to learn purely from the pattern of examples.

Key findings: (1) As the number of exemplars grows (≈50–100), accuracy gaps between models vanish and all prompting strategies converge, indicating that performance is driven more by the regularities encoded in the prompt than by the model’s intrinsic capabilities. (2) ICL is robust to changes in exemplar distribution (label balance, ordering) but highly brittle to distributional shifts between training and test sets, especially when using chain‑of‑thought or automated prompt optimisation. (3) Tasks that appear similar can exhibit up to a 31 % difference in accuracy, and traditional baselines such as decision trees or k‑nearest neighbours outperform ICL on half of the evaluated tasks. Consequently, the claim that a few shots suffice for learning is challenged; peak performance often requires dozens to hundreds of examples, and the gains are not consistent across tasks or prompt styles.

The authors conclude that while ICL fits a formal definition of learning, it relies on an “ad‑hoc” encoding of statistical regularities from the prompt rather than genuine feature‑level generalisation. This makes ICL a fragile mechanism for cross‑task generalisation, limited by the representativeness of the prompt data. Future work should investigate how intrinsic knowledge and prompt design interact in natural‑language settings and seek more robust learning paradigms beyond the current ICL paradigm.


Comments & Academic Discussion

Loading comments...

Leave a Comment