CL-bench: A Benchmark for Context Learning
Current language models (LMs) excel at reasoning over prompts using pre-trained knowledge. However, real-world tasks are far more complex and context-dependent: models must learn from task-specific context and leverage new knowledge beyond what is learned during pre-training to reason and resolve tasks. We term this capability context learning, a crucial ability that humans naturally possess but has been largely overlooked. To this end, we introduce CL-bench, a real-world benchmark consisting of 500 complex contexts, 1,899 tasks, and 31,607 verification rubrics, all crafted by experienced domain experts. Each task is designed such that the new content required to resolve it is contained within the corresponding context. Resolving tasks in CL-bench requires models to learn from the context, ranging from new domain-specific knowledge, rule systems, and complex procedures to laws derived from empirical data, all of which are absent from pre-training. This goes far beyond long-context tasks that primarily test retrieval or reading comprehension, and in-context learning tasks, where models learn simple task patterns via instructions and demonstrations. Our evaluations of ten frontier LMs find that models solve only 17.2% of tasks on average. Even the best-performing model, GPT-5.1, solves only 23.7%, revealing that LMs have yet to achieve effective context learning, which poses a critical bottleneck for tackling real-world, complex context-dependent tasks. CL-bench represents a step towards building LMs with this fundamental capability, making them more intelligent and advancing their deployment in real-world scenarios.
💡 Research Summary
The paper introduces CL‑bench, a comprehensive benchmark designed to evaluate a language model’s ability to learn from provided context—a capability the authors term “context learning.” While modern LMs excel at reasoning over prompts using pre‑trained knowledge, real‑world tasks often require the model to ingest novel, domain‑specific information that is absent from its training data, such as new regulations, product manuals, experimental laws, or bespoke rule systems. CL‑bench consists of 500 richly crafted contexts, 1,899 individual tasks, and 31,607 verification rubrics, all authored and vetted by domain experts. Each context contains up to 12 tasks (average 3.8) and an average of 63.2 rubrics per context, with many tasks presented sequentially across multiple interaction turns, thereby mimicking realistic multi‑step workflows.
The benchmark’s taxonomy divides contexts into four high‑level categories—Domain Knowledge Reasoning, Rule System Application, Procedural Task Execution, and Empirical Discovery & Simulation—and further into 18 sub‑categories covering fields such as finance, healthcare, law, game mechanics, programming syntax, technical standards, operational procedures, and simulation environments. Context lengths range from 8 K to 65 K tokens, ensuring that models must handle long‑range dependencies. Crucially, the knowledge embedded in each context is deliberately “contamination‑free”: it is either fictional, a modification of existing material, or sourced from niche, emerging domains, guaranteeing that no pre‑training data contains the required information.
For evaluation, the authors built an automatic rubric‑based scoring system. Ten state‑of‑the‑art LMs—including GPT‑5.1, GPT‑4, Claude‑2, LLaMA‑2, and others—were tested. Across the entire benchmark, the average solve rate was only 17.2 %; the best model, GPT‑5.1, achieved 23.7 %. Performance varied markedly across categories: tasks that required inducing laws from extensive experimental data or simulating complex sandbox environments fell to an average of 11.8 % success. Error analysis revealed three dominant failure modes: (1) outright neglect of the supplied context, (2) insufficient long‑context reasoning, and (3) poor instruction‑following, especially when the context contradicted pre‑trained knowledge.
The authors argue that existing benchmarks—whether focused on in‑context learning, retrieval‑augmented generation, or long‑document QA—do not isolate the ability to acquire and apply new knowledge from context. CL‑bench fills this gap and serves as a diagnostic tool for future research. They propose several avenues to improve context learning: developing more efficient long‑range attention or memory mechanisms, designing training objectives that explicitly reward the incorporation of novel context information, employing meta‑learning or reinforcement‑learning strategies for multi‑step task chains, and creating conflict‑resolution frameworks to handle discrepancies between pre‑trained knowledge and context‑provided facts.
In summary, CL‑bench demonstrates that despite impressive capabilities on traditional benchmarks, current LMs remain far from human‑like context learning. By providing a rigorous, real‑world testbed, the benchmark aims to steer the community toward models that can dynamically ingest, retain, and reason over new information, thereby bridging the gap between laboratory performance and practical, intelligent deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment