LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) are increasingly capable of carrying out long-running, real-world tasks. However, as the amount of context grows, their reliability often deteriorates, a phenomenon known as “context rot”. Existing long-context benchmarks primarily focus on single-step settings that evaluate a model’s ability to retrieve information from a long snippet. In realistic scenarios, however, LLMs often need to act as agents that explore environments, follow instructions and plans, extract useful information, and predict correct actions under a dynamically growing context. To assess language agents in such settings, we introduce LOCA-bench (a benchmark for LOng-Context Agents). Given a task prompt, LOCA-bench leverages automated and scalable control of environment states to regulate the agent’s context length. This design enables LOCA-bench to extend the context length potentially to infinity in a controlled way while keeping the underlying task semantics fixed. LOCA-bench evaluates language agents as a combination of models and scaffolds, including various context management strategies. While agent performance generally degrades as the environment states grow more complex, advanced context management techniques can substantially improve the overall success rate. We open-source LOCA-bench to provide a platform for evaluating models and scaffolds in long-context, agentic scenarios: https://github.com/hkust-nlp/LOCA-bench

💡 Research Summary

LOCA‑bench is a newly introduced benchmark designed to evaluate large language model (LLM) agents under conditions of extreme and controllable context growth. Unlike prior long‑context benchmarks that focus on a single retrieval step from a static long text, LOCA‑bench requires agents to start with a brief task description and limited knowledge of an environment, then iteratively explore that environment using a suite of tools (e.g., email, Canvas, BigQuery, spreadsheets). As the environment description length (EDL) – measured in tokens – is systematically increased (8 K to 256 K tokens), the underlying task semantics remain unchanged, allowing researchers to isolate the effect of context size on performance.

The benchmark comprises 15 seed tasks adapted from the Toolathlon suite, each instantiated across seven EDL levels with five random seeds per level, yielding 525 evaluation instances. Environments are generated automatically from hand‑crafted templates and injected into local mock servers that faithfully emulate real‑world APIs while avoiding authentication and rate‑limit issues. For each task, a rule‑based script checks the final environment state to produce a binary success signal, ensuring reproducible and verifiable evaluation.

Experiments cover both frontier proprietary models (Claude‑4.5‑Opus, GPT‑5.2‑Medium, Gemini‑3‑Flash) and strong open‑source models (DeepSeek‑V3.2‑Thinking, MiniMax‑M2.1, GLM‑4.7, Kimi‑K2‑Thinking). All models are run at their maximum supported context windows (Claude 200 K, GPT 400 K, Gemini 1.05 M tokens). Results show a sharp decline in accuracy as EDL grows: most models drop from >70 % accuracy at 8 K tokens to below 30 % at 128 K–256 K tokens, with the gap between proprietary and open‑source models widening dramatically.

A central contribution of the paper is the systematic evaluation of context‑management strategies. Four engineering techniques are integrated into the evaluation scaffold: (i) programmatic tool calling, (ii) context awareness, (iii) tool‑result clearing, and (iv) removal of intermediate “thinking” content. Applying these strategies yields substantial gains; for example, Gemini‑3‑Flash improves from 21 % to 49 % accuracy at 128 K tokens when all strategies are combined. The analysis identifies four primary failure modes under long context: (1) complex retrieval and joint reasoning overload, (2) forgetting earlier instructions, (3) reduced exploration propensity, and (4) increased hallucination.

The authors also highlight the practical infrastructure: local mock servers for 280 tools, a template‑driven environment generator, and an open‑source toolkit built on the GEM framework that allows researchers to add new tasks, extend context lengths, and plug in additional context‑engineering methods.

In summary, LOCA‑bench provides a controlled, extensible platform to study “context rot” in agentic settings, demonstrates that advanced context‑management can partially mitigate performance loss, and opens avenues for future work on memory‑augmented architectures, automated context compression, and human‑in‑the‑loop error correction to further improve long‑context agent reliability.

LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth

💡 Research Summary

Comments & Academic Discussion

Leave a Comment