CUBE: A Standard for Unifying Agent Benchmarks

The proliferation of agent benchmarks has created critical fragmentation that threatens research productivity. Each new benchmark requires substantial custom integration, creating an "integration tax" that limits comprehensive evaluation. We propose …

Authors: Alex, re Lacoste, Nicolas Gontier

CUBE: A Standard for Unifying Agent Benchmarks
CUBE: A Standard f or Unifying Agent Benchmarks Alexandre Lacoste 1 Nicolas Gontier 1 Oleh Shliazhko 1 Aman Jaiswal 1,3 Kusha Sareen 1,6,7 Shailesh Nanisetty 1 Joan Cabezas Manuel Del V erme 2 Omar G. Y ounis 2 Simone Baratta 2 Matteo A valle 2 Imene Kerboua 7 Xing Han L ` u 6,7 Elron Bandel 4 Michal Shmueli-Scheuer 4 Asaf Y ehudai 4 Leshem Choshen 4 Jonathan Lebensold 5 Sean Hughes 1 Massimo Caccia 1 Alexandre Drouin 1,7 Siv a Reddy 6,7 T ao Y u 8 Y u Su 9 Graham Neubig 10 Dawn Song 11 1 ServiceNow AI Research 2 Silverstream.ai 3 Dalhousie 4 IBM Research 5 Jetty 6 McGill 7 Mila 8 HKU 9 OSU 10 CMU 11 UC Berkeley Abstract The proliferation of agent benchmarks has created critical fragmentation that threatens research pro- ductivity . Each new benchmark requires substan- tial custom integration, creating an “integration tax” that limits comprehensiv e ev aluation. W e propose CUBE (Common Unified Benchmark En vironments), a uni versal pr otocol standard built on MCP and Gym that allows bench- marks to be wrapped once and used every- where . By separating task, benchmark, pack- age, and registry concerns into distinct API lay- ers, CUBE enables any compliant platform to access any compliant benchmark for ev aluation, RL training, or data generation without custom integration. W e call on the community to con- tribute to the de velopment of this standard before platform-specific implementations deepen frag- mentation as benchmark production accelerates through 2026. 1. Introduction The field of Artificial Intelligence (AI) is experiencing a remarkable surge in benchmark de velopment for AI agents. The research community has created an impressi ve ecosys- tem of comple x, interactiv e en vironments designed to test the limits of autonomous agents ( Y ehudai et al. , 2025 ; Mo- hammadi et al. , 2025 ). This div ersity presents tremendous opportunities: researchers can now e valuate agents across . Correspondence to: Alexandre Lacoste < alexan- dre.lacoste@servicenow .com > , Nicolas Gontier < nico- las.gontier@servicenow .com > . Pr eprint. Mar ch 18, 2026. broad task distrib utions and le verage lar ge-scale training on varied en vironments. Howe ver , realizing this potential requires solving a critical integration challenge. The ef- fort required to incorporate these di verse benchmarks into ev aluation and training pipelines has become a significant bottleneck. In practice, this limits which labs can afford to ev aluate across man y benchmarks, and shapes what research gets done as a result ( Bandel et al. , 2026a ). As we mo ve to ward generalist agents, the need for a general framew ork supporting both ev aluation and post-training on div erse task distributions becomes critical. W e seek agents that reason, use tools ( Xu et al. , 2025 ; Shen , 2024 ), and navigate unseen en vironments. W ith proper tools and a con- sistent interf ace, a generalist agent should handle any bench- mark ( Bandel et al. , 2026c ). Y et integrating new bench- marks remains time-consuming, forcing researchers to act more like systems engineers than AI scientists. T o address this integration tax, sev eral platforms hav e emerged, such as NeMo Gym ( NVIDIA , 2025 ), Har- bor ( Shaw , 2025 ), HAL ( Kapoor et al. , 2025 ), and OpenEn v ( Meta PyT orch T eam and Hugging Face , 2025 ), among oth- ers. These platforms integrate e xisting benchmarks and pro- vide tooling to author new en vironments, but each proposes its o wn en vironment interface. As a result, maintainers must build connectors to mo ve benchmarks across them, a growing b urden as benchmark production accelerates. Rec- ognizing this, the community has also begun proposing open benchmark interface standards: the Agentified Agent Assessment (AAA) paradigm in AgentBeats ( AgentBeats T eam and Berkeley RDI , 2025 ) and CUBE (this paper) are concurrent efforts sharing this motiv ation, though dif fering in scope and design choices, as discussed in Section 4 . Competition between platforms is a v aluable market force and drives innov ation. The y should compete on features, 1 CUBE: A Standard for Unifying Agent Benchmarks usability , scalability , and other value metrics. Y et without a common benchmark interface, competition ske ws to ward the size of their integration catalog instead. The Core P osition The community needs a standard that allows practitioners to wrap agentic benchmarks once and hav e it work everywher e, for e valuation and for training, at scale. Any platform that implements the standard w ould instantly gain access to ev ery compliant benchmark. This would eliminate redundant integration work and let us focus on understanding and improving agent beha vior . By mo ving to- ward a uni versal standard, we can turn the current landscape of isolated silos into a thriving, interoperable ecosystem. The importance of multi-benchmarking cannot be ov erstated. There are currently ov er 300 agentic benchmarks av ailable, many of which are highly innov ative but remain largely unknown because the y are too difficult to set up. By simpli- fying and scaling cross-benchmark ev aluation, we can gain a much deeper understanding of agent capabilities across a vast distrib ution of tasks. Given the rapid rise of coding agents ( Y ang et al. , 2024 ; W ang et al. , 2024b ; Jimenez et al. , 2024 ; Deng et al. , 2025 ) and the increasing interest in post- training and RL on di verse tasks ( Khatri et al. , 2025 ; Park et al. , 2025 ), we forecast this number to double by the end of 2026. Without a standard, the field is destined for unman- ageable fragmentation as benchmark production outstrips integration capacity . 2. The Current Landscape and its Challenges T o understand why a standard is necessary , consider the engineer bridging a benchmark and a generalist agent, trans- lating ra w implementations into agent-compatible interfaces through technical hurdles rarely discussed in papers. Unlike static datasets, agentic benchmarks require li ve, in- teractiv e environments ( T ri vedi et al. , 2024 ; Merrill et al. , 2026 ) with di verse infrastructure needs spanning web na v- igation ( Drouin et al. , 2024 ; Zhou et al. , 2024 ; K oh et al. , 2024 ), software engineering ( Jimenez et al. , 2024 ; Li et al. , 2025 ), operating systems ( Xie et al. , 2024 ; Bonatti et al. , 2025 ), and mobile devices ( Ra wles et al. , 2025 ; Chen et al. , 2025 ). Some require shared servers common to all tasks, prev enting simple per -task deployment. T o make it concrete, T able 1 summarizes the shape and challenges of 4 popular benchmarks. Furthermore, deployment on high-performance computing platforms often runs into port configuration conflicts. De- fault ports are frequently blocked, requiring custom reconfig- uration for ev ery single deployment. Resource scaling also varies wildly , from lightweight scripts to heavy simulation en vironments that demand massi ve amounts of RAM and disk I/O. The result is an en vironment where reproducibil- ity is hampered by the sheer complexity of the underlying stack. The di versity of these requirements creates what we call the Integration T ax. When a researcher wants to ev aluate an agent on fi ve different benchmarks, they often have to write fiv e unique “dri vers” or “wrappers. ” If they decide to switch from one e valuation frame work to another , they must often start this work from scratch. This N-to-M mapping of agents to benchmarks is a massiv e waste of human capital. 3. The CUBE Standard Proposal W e propose CUBE (Common Unified Benchmark En vi- ronments), a protocol standard designed to unify the ML Community by establishing a universal interface between benchmarks and e valuation frame works. 1 The core insight is simple: if we define a consistent API contract, any CUBE- compliant benchmark becomes immediately usable by any CUBE-compliant platform. Whether for e v aluation or post- training, with minimal custom inte gration work. Most im- portantly , we communicate this early to allo w the commu- nity to steer the standard before it reaches a more rigid state. For generality and clarity , we present CUBE’ s interface us- ing RPC (Remote Procedure Call) notation, b ut the standard supports both a Python and RPC interface (see Sec. 3.5 and Fig. 1 ). The RPC layer enables process isolation and cross- language communication, while the direct Python interface reduces cross-process serialization ov erhead and may lead to necessary speedup for some post-training applications. Four -layer Schema: T o achie ve a standar d where a general- ist agent could interact with a ne w benchmark with minimal to no human inv olvement, we are mindful of 4 different lev els of interaction: 1. T ask level: W e define ho w agents interact with indi- vidual task instances, ho w the y observ e state, e xecute actions, and receiv e feedback (Sec. 3.1 ). 2. Benchmark le vel: W e specify ho w e valuation har- nesses discover a v ailable tasks and spawn new instances (Sec. 3.2 ). 3. Package le vel: W e standardize installation and paral- lelization across compute infrastructures (Sec. 3.3 ). 4. Registry level: W e provide a centralized metadata cata- log for discov ery and filtering (Sec. 3.4 ) Each layer can be accessed either via direct Python calls 1 Reference implementation: github.com/The-AI-Alliance/cube-standard ( Lacoste et al. , 2026 ) 2 CUBE: A Standard for Unifying Agent Benchmarks T able 1. A comparativ e analysis of the infrastructure and operational requirements across current agentic benchmarks. The lack of uniformity in how these en vironments are hosted, controlled, and reset creates a significant overhead for cross-benchmark e valuation. Featur e W E B A R E N A ( Zhou et al. , 2024 ) S W E - B E N C H ( Jimenez et al. , 2024 ) O S W O R LD ( Xie et al. , 2024 ) G A I A ( Mialon et al. , 2024 ) En vironment T ype Simulated W eb : A pri vate micro-internet (GitLab, Reddit clone, etc.). Coding W orkspace : A specific software repository and dev toolchain. Desktop OS : A full graphical operating system (Ubuntu/W indows). The Real W orld : Live internet access and local document sets. Hosting Format Benchmark-level VM : One shared world per task run. High A WS costs. T ask-level Container : Lightweight, ephemeral Docker images per task. Benchmark-level VM : OS VM with T ask-level Snapshots. Static Files : T ask files on HF Hub; users provide tool implementations. Action Space (T ooling) Flexible/Provided : Playwright basics provided, but usually swapped for custom abstractions (e.g., BrowserGym). Fixed Shell : Standard terminal-based actions (Bash/Git). Native GUI : Hard-coded mouse/keyboard coordinate actions. None (BYO T) : Users must build tools (Search, PDF reader) from scratch. Integration Effort High : Requires solving rigid networking (port mapping) and building a perception bridge to the HTML. Moderate : Standard Docker orchestration; agents speak T erminal natively . Moderate : Authors provide the VM/API, but vision-to-action grounding is a high hurdle. High : The researcher must provide a broad range of tools. Scalability Bottleneck State Reset Latency: VMs require high resources. Benchmark recommends sequential ev aluation of all tasks. Disk I/O Churn : Constant building/pulling of unique task images saturates the filesystem. RAM & Snapshots : 20 GB+ memory per agent and heavy disk I/O for state resets. API & Rate Limits : Capped by external search quotas and LLM token costs. for same-process ex ecution or via RPC for distrib uted and cross-platform scenarios. This separation of concerns is deliberate. A benchmark au- thor implements their environment once in Python, e xposing the required methods as a standard class. The CUBE frame- work automatically provides the RPC wrapper , meaning the benchmark becomes immediately usable both locally and remotely without additional work. A platform developer can point their harness at any CUBE benchmark and im- mediately begin e v aluation, choosing between local Python instantiation for performance or RPC connection for flexi- bility . A researcher can filter and install benchmarks based on their av ailable compute resources, without reading docu- mentation or rev erse-engineering setup scripts. 3.1. T ask-Level Interface The task-lev el interface defines ho w agents interact with individual task instances. At its core, an agent needs to observe the en vironment’ s state, ex ecute actions, and recei ve feedback on its progress. The Gym interface ( T owers et al. , 2024 ; Brockman et al. , 2016 ) established this pattern for RL, but modern agent benchmarks introduce new requirements that demand extensions to the traditional blocking step function. The Async Problem: Consider an agent navigating a re- search task that requires web search. When the agent calls a search API, waiting 3 seconds for results while blocking all other operations is inef ficient. The agent should be able to plan its next move, process partial results, or manage mul- tiple concurrent tool calls. Benchmarks like ARE ( Froger et al. , 2025 ) and GAIA-2 ( Lab et al. , 2026 ) explicitly re- quire this asynchronous capability , where agents coordinate multiple long-running operations without blocking. The traditional Gym step function, designed for synchronous state transitions, cannot support these patterns. MCP + Gym Fusion: CUBE addresses this by building on the Model Conte xt Protocol (MCP), which already defines a non-blocking tools/call API for asynchronous action ex ecution. MCP also handles automatic action space disco v- ery through tools/list 2 , eliminating the need for agents to kno w tool signatures in advance. W e augment MCP with Gym-style ev aluation semantics: cube/evaluate returns rew ard and termination status, cube/reset reini- tializes tasks with optional seeding, and cube/close han- dles cleanup. This fusion creates a superset where both established APIs can be reco vered, and a ne w async-Gym API supports non-blocking interaction patterns. T ool Configuration: Many benchmarks are designed ag- nostically to the specific tools used for task completion. For example, W ebArena ( Zhou et al. , 2024 ) defines the environ- ment, task descriptions, and ev aluation functions, but lea ves browser interaction mechanisms to the agent designer . This design choice enables comparative e v aluation of different browser automation tools. Howe ver , without specified tools, benchmarks cannot expose Gym or MCP interfaces. CUBE resolves this by requiring benchmark authors to wrap their en vironments with default tools, enabling immediate usabil- ity . T o preserve tool variability for research, benchmarks accept a tool config parameter at initialization (see Sec- tion 3.3 ), allowing researchers to substitute alternati ve tool implementations without modifying the benchmark code. Benchmarks provide domain-appropriate tools matched to 2 Gym also provides action space , b ut MCP’ s format is more aligned with LLM expectations 3 CUBE: A Standard for Unifying Agent Benchmarks their en vironment types. Browser-based benchmarks ship with web automation tools, coding benchmarks provide shell access, and GUI benchmarks of fer mouse and ke yboard con- trol. Researchers can reconfigure these tools at benchmark initialization through the tool config parameter (see Section 3.3 ), enabling experiments with different tool im- plementations without modifying the benchmark itself. Privileged Information f or Evaluation: Beyond basic task ex ecution, CUBE supports scalable e valuation infrastruc- ture through pri vileged information. As task repositories expand, manual analysis of agent behavior becomes infeasi- ble, leading practitioners to rely on automated judge-based ev aluation for large-scale failure analysis. Howe ver , LLM- based judges suf fer from well-documented limitations ( L ` u et al. , 2025 ) and frequently misidentify f ailure root causes. T o enhance judge accuracy , we standardize the communi- cation of privile ged information at both the task and step lev els through the info field. This information may in- clude the ev aluation function’ s source code, ground-truth answers, or concise summaries of the en vironment’ s internal state. Benchmark designers curate this optional field to fa- cilitate more accurate failure diagnosis. Beyond e valuation, privile ged information enables privile ged policy distillation during training, where a student polic y learns from a teacher with access to additional context, and aids in identifying benchmarks with erroneous ev aluation logic or ambiguous task specifications. T able 2 details the complete API. The MCP methods handle action e xecution and tool discovery . The CUBE-specific methods e xtend this with ev aluation semantics and privi- leged information access. RPC separation T ask To o l s MCP CUBE - T ask RPC - Client Gym - API Environment T ask To o l s Gym - API Environment F igur e 1. T ask-lev el diagram for CUBE’ s API. Left: Separation between tasks and tools and the possibility to reconfigure the tools. Only the Gym-API is exposed. Right: By implementing base classes, CUBE’ s implementation can automatically expose an RPC layer , including the well-kno wn MCP API. On the client side, the Gym-API is exposed. 3.2. Benchmark-Level Interface The benchmark-level interface (also available as both Python methods and RPC endpoints) manages shared infras- tructure and orchestration. While the task-level interface handles individual agent-environment interactions, many benchmarks require shared infrastructure that spans multiple tasks. W ebArena ( Zhou et al. , 2024 ), for instance, deploys a persistent set of web services (GitLab, e-commerce sites, forums) that form a coherent “micro-internet”. OSW orld ( Xie et al. , 2024 ) maintains a full desktop operating system with pre-installed applications. These benchmark-le vel re- sources are expensi ve to initialize and are designed to be reused across many task instances. The benchmark-lev el interface e xists to manage this shared infrastructure and to provide discov ery and orchestration capabilities. T able 3 specifies the required methods. The cube/info endpoint returns metadata about the bench- mark, including its name, version, and resource require- ments. The cube/tasks method lists available tasks, supporting pagination and filtering for benchmarks with large task sets. T ask instantiation occurs through cube/spawn , which ac- cepts a task identifier and an optional random seed. The seed parameter is required for benchmarks with stochas- tic task generation or variable initial states. For instance, a benchmark might generate synthetic websites with ran- domized layouts, requiring seed control for reproducibility . Upon spawning, the benchmark returns a URL endpoint for the new task instance. This endpoint exposes the task- lev el API described in Section 3.1 . The separation between benchmark and task endpoints enables efficient resource sharing. A single benchmark server can manage dozens of concurrent task instances, each with its o wn isolated state, while sharing the underlying infrastructure. The task-level API naturally supports asynchronous e xecution patterns and multi-agent scenarios: multiple agents can interact with the same task instance or separate instances simultaneously , with the benchmark coordinating state updates and turn- taking as needed. The cube/status method pro vides health monitoring for activ e tasks, reporting resource usage and connection status. Finally , cube/shutdown handles cleanup, accepting an optional session identifier to terminate specific tasks or , if omitted, shutting down all acti ve instances. 3.3. Package-Lev el Standard The main responsibility of the package-le vel standard is to expose a hook for starting the RPC server (if requested) and initializing the common resources. After package in- stallation, users or harnesses can use the follo wing Python API: 4 CUBE: A Standard for Unifying Agent Benchmarks T able 2. T ask-Lev el API. Methods av ailable as both Python class methods (e.g., task.reset() ) and RPC endpoints (e.g., cube/reset ). The first part is the well-known MCP protocol. The second part adds observation, step, and ev aluation functions aligned with the Gym API. Method Namespace For Arguments Returns Description tools/list M C P Agent none Tool[] List actions tools/call M C P Agent { name, args } { content[], isError } Execute action resources/list M C P Agent none Resource[] List resources resources/read M C P Agent { uri } { contents[] } Read obs/task cube/evaluate C U B E Harness none { obs, reward, terminated, truncated, info } Eval state cube/reset C U B E Harness { seed? } { obs, info } Reset task cube/step C U B E Agent { action } { obs, reward, terminated, truncated, info } Execute action and ev aluate cube/close C U B E Harness none void Cleanup cube/privileged info C U B E Harness none String Privile ged context for judge-based ev aluation T able 3. Benchmark-Level API. Methods a vailable as both Python class methods and RPC endpoints for discovering benchmarks, listing tasks, and orchestrating task instances. Each spawned task exposes its o wn T ask-Lev el API endpoint. Method Namespace For Arguments Returns Description cube/info C U B E Harness none BenchmarkInfo Benchmark metadata cube/tasks C U B E Harness { filter? } TaskList List av ailable tasks cube/spawn C U B E Harness { task id, seed? } url Start task, return end- point cube/status C U B E Harness none TaskStatus[] Health of running tasks cube/shutdown C U B E Harness { session id? } void Cleanup all or specific import my_cube benchmark = my_cube.Benchmark() benchmark.start(available_ports, tool_config) A corresponding command-line interface is also exposed for non-Python access. Separating What from Ho w: CUBE separates what a benchmark requires from how those resources are pro- visioned. Benchmark authors declare resource require- ments through typed configuration objects ( VMConfig , ContainerConfig ), while harness operators supply a matching backend that handles provisioning. Pluggable backends cover the full range of compute en vironments, from local development to cloud containers, major cloud VM providers, and SLURM-based HPC clusters. Switching from local to cloud requires changing a single backend con- figuration object, with no benchmark code changes. CUBE also distinguishes benchmark-level shared resources (e.g., the persistent web server that W ebArena shares across all task instances) from task-level resources (e.g., a per-task container). Benchmark authors initialize shared resources once in Benchmark.setup() , making them a vailable through a RuntimeContext passed to each task. Debug T asks and Debug Agent: T o ensure correctness and enable continuous integration, every CUBE package must expose two additional elements at the Python lev el: get debug task configs() , which returns a small set of representativ e task configurations with known correct behavior , and make debug agent(task id) , which returns a scripted agent guaranteed to solve a gi ven debug task. These primiti ves allow any consumer of the bench- mark, or the CUBE compliance suite itself, to run a full episode end-to-end and assert that the reward reaches 1.0, without requiring a li ve language model. This mak es CUBE benchmarks testable in standard CI pipelines. Stress T esting and Compliance: Beyond basic correctness, CUBE defines a stress test suite that validates benchmark behavior under parallel load and resource constraints. This includes verifying that task resets are idempotent, that con- current task instances remain isolated, and that resource usage stays within declared bounds. A benchmark that 5 CUBE: A Standard for Unifying Agent Benchmarks passes the stress suite earns a compliance badge visible in the registry , giving platform de velopers confidence before large-scale inte gration. 3.4. CUBE Registry The CUBE Registry serves as a centralized discovery mech- anism for av ailable benchmarks. W ithout a registry , re- searchers must rely on word-of-mouth, social media an- nouncements, or manual literature searches to find relev ant ev aluation environments. This creates a significant barrier for newer benchmarks, which may remain unknown despite their technical merit simply because the y lack visibility in the community . T able 4 specifies the metadata required for each re gistered benchmark. Beyond basic identification fields like name and version, the re gistry captures critical operational in- formation that enables automated filtering. The runtime field specifies the deployment model (Docker , Apptainer , VM, or live internet access), allo wing researchers to im- mediately exclude benchmarks incompatible with their in- frastructure. The hardware object details resource re- quirements, prev enting researchers from attempting to run memory-intensiv e benchmarks on constrained systems. The registry also addresses legal and compliance con- cerns that often block benchmark adoption. The package license and benchmark license fields distinguish between the wrapper code license and the un- derlying task data license, as these often differ . The content notice field warns about special considera- tions, such as benchmarks containing cloned websites or copyrighted materials. Crucially , the re gistry does not host benchmark code or data. It simply index es metadata and points to standard distrib u- tion platforms like PyPI via the package field. This design keeps the re gistry lightweight and ensures that benchmark authors retain full control over their distrib utions. T o ensure quality , once a benchmark is registered, it triggers a GitHub job to verify compliance. This automated discov ery mechanism democratizes bench- mark visibility . A graduate student publishing a nov el bench- mark can register it once and immediately make it discov- erable to the entire community , without requiring social media threads, blog posts, or conference presentations to gain adoption. This enables broader ev aluation across di- verse benchmark suites ( Liu et al. , 2023 ; Ma et al. , 2024 ; W ang et al. , 2024a ) and tool-use scenarios ( W ang et al. , 2025 ; Lei et al. , 2025 ; Luo et al. , 2025 ). 3.5. Python-First Design with RPC Fallback CUBE supports both local (same-process) and remote (RPC) ex ecution through a unified interface. Benchmark authors implement Python classes; CUBE auto-generates RPC servers e xposing identical methods o ver HTTP . Switching modes requires only connection changes. Local execution eliminates serialization, reducing latency for high-frequency RL loops and enabling unified debug- ging. Remote execution handles: (1) non-containerizable production benchmarks, (2) cross-platform scenarios, (3) multi-language ecosystems, and (4) fault isolation. 3.6. Adoption Strategy CUBE faces a classic two-sided adoption challenge: plat- forms will hesitate to implement the standard without a crit- ical mass of compliant benchmarks, and benchmark authors will hesitate to wrap their en vironments without platform demand. W e break this deadlock by recruiting an initial consortium of early platform supporters who are committed to implementing reference connectors, while simultaneously wrapping a high-value corpus that provides those platforms with immediate utility . Call for Collaboration The authorship of this proposal spans major technology companies, academic laboratories, and startups, organizations that independently con ver ged on the Integration T ax as a bottleneck to their o wn research. This con vergence is itself evidence that the problem de- mands a coordinated response rather than competing propri- etary solutions. Initial implementation W e propose to deli ver a reference architecture of the CUBE standard as a starting point for community feedback, alongside an initial corpus of wrapped benchmarks spanning web na vigation, softw are engineering, and desktop en vironments. By building reference connec- tors for platforms such as NVIDIA NeMo Gym and Ope- nEn v , we ensure that this corpus has immediate consumabil- ity in existing training and ev aluation infrastructure. This dual initiati ve—benchmarks and connectors—is designed to demonstrate the standard’ s v alue and enable immediate collaboration with the broader community . Direct Benchmark Cr eator Outreach W e will directly engage authors of recently published benchmarks, of fering integration support and highlighting the viability benefits of registry inclusion. For benchmark authors, CUBE solves the discov ery problem: wrap once, and gain immediate access to the ecosystems of e very training and evolution platform with CUBE-support. W e will provide inte gration templates and registry submission assistance to minimize barriers, targeting critical mass by the end of 2026. 6 CUBE: A Standard for Unifying Agent Benchmarks T able 4. Registry Fields. Each registered benchmark exposes this metadata for discov ery , installation, and compliance verification. Field T ype Description id string Unique identifier (e.g., webarena-verified ) name string Human-readable name version string Semantic version (e.g., 1.2.0 ) authors string[] Package authors paper string? Related paper URL (if any) package string PyPI package name for pip install benchmark license string Benchmark data/tasks license (e.g., CC-BY-NC-4.0 ) content notice string? Copyright warning (e.g., "Contains cloned websites" ) compliance string[] Compliance badges (e.g., ["no-docker-root", "task-isolated"] ) runtime enum docker | apptainer | vm | docker-root | docker-in-docker | live hardware object { ram gb, gpu, disk gb } task count int Number of tasks in benchmark T able 5. Comparison of CUBE with existing agent benchmark frame works, based on features and documentation at the time of writing. Featur e CUBE NeMo Gym AgentBeats OpenEn v Harbor Primary Focus Protocol standard for wrapping benchmarks once, usable for ev aluation, RL training, and data generation Infrastructure to dev elop en vironments and scale rollout collection, battle-tested in Nemotron 3 Evaluation orchestration; benchmarks become judge agents assessing subject agents via A2A and MCP Framework for creating and sharing new RL en vironments; standardizes the en vironment side of post-training Evaluation and RL rollout framework; adapts e xisting benchmarks into a standard container-based format Coverage 9 CUBEs (early stage); wraps any benchmark type including shared-infrastructure and VM-based ones 40+ en vironments across math, coding, tool use, and safety , and integrations with other en vironment libraries and benchmarks 250+ benchmarks across 17 domains, covering both static datasets (GAIA, SWE-Bench) and interactiv e environments 30+ en vironments across coding, games, web, and simulation; designed for new en vironment creation 46+ adapters for established benchmarks (SWE-Bench, T erminal-Bench, GPQA, ARC-A GI-2) with parity validation Agent Interface MCP tools/call for actions; Gym-style cube/reset , cube/step , cube/evaluate T ools contract through OpenAI Responses API spec; flexible Agent scaffolding, easy to plug in any en vironment and Gym-like APIs implementable through HTTP servers A2A for task delegation; MCP for tool access; judge + subject + delegator roles HTTP Gym API ( reset / step / state ); MCP tool-calling interface for agent-en vironment interaction Wraps full coding agents (Claude Code, OpenHands, etc.); MCP servers configurable per task; A TIF trajectory format RL Training Gym-compatible evaluate / reset / step ; Python in-process for low-latenc y loops Ray-powered rollout; NeMo RL, OpenRLHF , Unsloth integrations Gym-to-MCP bridge av ailable; primarily an ev aluation platform TRL, T orchForge, Unsloth, SkyRL integrations; composable Rubric reward system QueueOrchestrator for dynamic RL loops; A TIF captures token IDs and logprobs Adding a New Benchmark Implement a Python class once; works across all CUBE-compatible platforms via thin connectors Supports integrating benchmarks and creating new en vironments via separation of concerns between the Agent Harness and En vironment Resources Implement a judge agent (A2A + MCP) Design a Docker en vironment with FastAPI and MCP tools; supports both new en vironments and wrappers around existing frameworks Write task.toml + instruction.md + Dockerfile + test.sh ; Harbor-specific Scale & Deployment CUBE-harness uses Ray for parallel rollout; benchmarks declare resource requirements ( what ), while pluggable backends dispatch to local or cloud infrastructure ( how ) Async first, server-based design; Ray for thousands of parallel rollouts Fiv e operation modes from local dev elopment to hosted deployment; GitHub Actions CI used for competition leaderboards Docker locally; HuggingFace Spaces for sharing; Kubernetes for scaling Docker , Daytona, Modal, E2B, GKE; QueueOrchestrator for parallel trials Registry & Discovery Structured metadata: licenses, compliance badges, hardware requirements, runtime type Structured metadata: licenses, en vironment profiling, description, v alue; en vironments on GitHub, datasets on HuggingFace Registry for judge and subject agents; leaderboard and web UI HuggingFace Hub with from hub() disco very; tool registry planned harborframework.com ; dataset@version versioning; 46+ curated adapters 4. Related W ork The problem of benchmark fragmentation and the need for a unified ev aluation infrastructure ha ve been recognized by sev eral communities, leading to a v ariety of platforms that attempt to address different aspects of this challenge ( Bandel et al. , 2026b ). W e organize our discussion by first examin- ing domain-specific platforms that unify benchmarks within 7 CUBE: A Standard for Unifying Agent Benchmarks a particular task category , then surveying broader platforms that attempt cross-domain cov erage, before presenting a de- tailed comparison of how these approaches relate to CUBE. 4.1. Domain-Specific Unification Efforts Sev eral platforms hav e emerged to standardize e v aluation within specific task categories. BrowserGym ( de Chezelles et al. , 2025 ) provides a unified Gym-like environment for web agent research, inte grating multiple web agent bench- marks under a common observation and action space. This ecosystem has significantly reduced fragmentation within the web agent community and demonstrated the value of standardized interfaces. Howe ver , BrowserGym is inher- ently limited to browser -based tasks and does not extend to other agent modalities such as terminal en vironments, desktop GUI control, or multi-modal reasoning tasks. Simi- larly , CU A-Bench ( T eam , 2025 ) and related computer -use benchmarks focus specifically on desktop GUI interactions, while coding agent benchmarks like SWE-bench ( Jimenez et al. , 2024 ) maintain their own e v aluation harnesses. Each of these efforts represents valuable progress within its re- spectiv e domains, b ut the proliferation of domain-specific standards compounds rather than resolves the ov erall frag- mentation problem. 4.2. Broader Platf orm Eff orts Sev eral platforms provide infrastructure for agent e valuation or training with some lev el of generality . NeMo Gym NVIDIA ’ s NeMo Gym ( NVIDIA , 2025 ) pro- vides a high-performance suite of RL training en vironments for domain-specific LLM tasks, including mathematics, sci- ence, coding, and tool-use. Rather than a specific en vi- ronment interface, it introduces a highly scalable architec- ture that enforces a separation of concerns between the Agent Scaffolding/Harness , the Environment Resources , and the LLM Model APIs . By decoupling these components, NeMo Gym allows for independent scal- ing of compute-intensiv e models and complex en vironment simulations. NeMo Gym’ s contribution to standardization is mainly at the training harness layer, making it comple- mentary to CUBE’ s focus on benchmark packaging and cross-platform portability . AgentBeats AgentBeats ( AgentBeats T eam and Berke- ley RDI , 2025 ) proposes an Agentified Agent Assessment (AAA) paradigm where benchmarks are realized as judge agents that ev aluate subject agents through standardized A2A and MCP protocols, reducing N × M agent-benchmark integrations to N+M protocol-le vel ones. The platform sup- ports fi ve operation modes from local development to hosted deployment, and hosts a public registry of assessments span- ning coding, web, and multi-agent domains. These layers are composable: an AgentBeats judge agent could consume a CUBE-compliant benchmark through a thin connector, combining CUBE’ s portable infrastructure packaging with AgentBeats’ ev aluation protocol. OpenEn v Meta and Hugging Face’ s OpenEn v ( Meta PyT orch T eam and Hugging Face , 2025 ) provides a Gymnasium-style framework for creating and sharing RL training en vironments via a centralized HuggingF ace Hub, with nativ e integrations into TRL, T orchForge, Unsloth, SkyRL, and other training frameworks. An MCP tool- calling interface is av ailable for agent-environment inter- action, and a composable Rubric re ward system is in ac- tiv e development. OpenEn v covers both ne w environ- ments and wrappers around existing frame works, including BrowserGym benchmarks (W ebArena, V isualW ebArena, W orkArena). Benchmarks requiring shared infrastructure must be provisioned externally by the user via en vironment variables; OpenEn v provides no lifecycle management for shared services or VM snapshots. Harbor Harbor ( Shaw , 2025 ) emerged from T erminal- Bench as a framework for ev aluating agents and generat- ing RL rollouts in container en vironments, with cloud de- ployment via Daytona, Modal, E2B, GKE, and Runloop. The frame work inte grates benchmarks through adapters validated via parity experiments, and introduces the A TIF trajectory format capturing token IDs, logprobs, and tool definitions for RL and SFT pipelines; a QueueOrchestrator supports dynamic parallel rollout loops. MCP servers are configurable per task, and Harbor is designed to wrap full coding agents (Claude Code, OpenHands, Codex CLI, and others) rather than exposing a raw e xecution API. The per - task container model does not include lifecycle management for persistent shared infrastructure across tasks, which is reflected in the current adapter catalog not including bench- marks such as W ebArena or OSW orld. HAL The Holistic Agent Leaderboard (HAL) ( Kapoor et al. , 2025 ) from Princeton provides cost-controlled bench- marking across coding, web, science, and customer ser- vice domains. Its three-dimensional analysis (models, scaf- folds, benchmarks) and LLM-aided log inspection of fer val uable insights. Howe ver , it serv es as an e v aluation leader - board rather than a training infrastructure, without Gym- compatible semantics, MCP-nativ e tools, or support for RL rollout generation. Exgentic Exgentic, introduced in General Agent Evalua- tion , is a practical frame work for e valuating general agents across heterogeneous benchmarks through a Unified Proto- col that mediates between agent interfaces and benchmark protocols. It is designed so that any supported agent can be run on an y supported benchmark task while preserving 8 CUBE: A Standard for Unifying Agent Benchmarks nati ve agent and benchmark beha vior through external adap- tors rather than intrusi ve modifications. The frame work emphasizes scalable, reproducible ev aluation, with support for parallel ex ecution, isolated runs, standardized trajecto- ries and cost reports, and an open general-agent leaderboard ( Bandel et al. , 2026a ; c ; b ). While Exgentic focuses on trans- lating between heterogeneous e xisting protocols to enable unified e valuation, it does not primarily aim to specify the benchmark interface standard that should be adopted going forward; this is the layer CUBE tar gets. 4.3. Comparison Summary T able 5 presents a detailed comparison of these platforms across ke y dimensions. The fundamental insight is that ex- isting platforms ha ve e volv ed from specific niches: NeMo Gym from RL training, AgentBeats from competition in- frastructure, OpenEn v from the HuggingFace ecosystem, Harbor from SWE ev aluation, and HAL from academic benchmarking. Each serv es its origin community well, and together they address the agentic stack from complementary angles. The benchmark packaging and infrastructure lifecy- cle layer (ho w benchmarks declare resource requirements, manage shared services across tasks, and expose a portable interface independent of an y particular harness) is the spe- cific gap CUBE addresses by defining a minimal interf ace contract that any platform can implement, enabling bench- marks to be wrapped once and used everywhere, re gardless of whether the do wnstream application is e valuation, train- ing, or data generation. 5. Alternati ve V iews The Status Quo: Let Market For ces Decide The most common alternativ e maintains the current compet- itiv e landscape, letting natural selection determine dom- inance. Howev er , platforms are already fragmenting by focus—ev aluation vs. training, or domain-specific vs gen- eral—suggesting no single winner will emerge but rather multiple platforms adopted by research area. As noted in Section 1, without a shared interface, this competition ske ws tow ards inte gration catalog size, distracting from inno v ation while producing a di vided ecosystem where platform choice determines benchmark access rather than technical merit. The “wait for a winner” strate gy may simply result in per- manent fragmentation or ganized by subfield rather than true consolidation. Lighter -W eight Alternativ es Rather than a comprehen- siv e standard, the community could adopt lighter-weight solutions such as con verter libraries that translate between existing platform formats or middle ware layers that provide adapters without requiring changes to underlying bench- marks. Alternatively , the focus could shift from integration to curation, where the community selects a small canonical set of benchmarks that provide suf ficient coverage of agent capabilities, eliminating the integration scaling problem by simply reducing the number of targets. This approach acknowledges that not all benchmarks need to be equally accessible and that research progress may be better serv ed by depth on a few well-understood tasks rather than breadth across hundreds of environments. Y et conv erter libraries still require someone to write N-to-M translation layers, merely shifting the integration b urden rather than elimi- nating it, and they introduce additional failure modes and maintenance ov erhead. Benchmark curation, while valu- able, conflicts with the goal of b uilding generalist agents that should succeed across di verse task distrib utions. Re- stricting ev aluation to a small set risks overfitting our agent designs to those specific en vironments, and history suggests that canonical benchmark sets calcify and become div orced from real-world performance as the field adv ances. Alternativ e T echnical Designs CUBE’ s specific design choices are debatable. Critics might prefer explicit async primitiv es as first-class features, a simpler standard with fewer abstraction layers, pure Gym without MCP , or message-passing ov er RPC. The benchmark/task interf ace separation and centralized registry add complexity that may be unnecessary for some use cases. These concerns are valid. Howe ver , the key insight is not that CUBE’ s design is optimal, b ut that some standard is necessary . Building on established protocols like MCP and Gym minimizes the learning curve. The design should e volv e through commu- nity feedback, b ut waiting for perfection ensures we ne ver escape current fragmentation. 6. The Path F orward: A Call to Action The transition to a standard is ne ver easy . It requires a collec- tiv e agreement to prioritize interoperability ov er individual framew ork growth. Howe ver , the current trajectory of agent research is leading toward a fragmentation that will even- tually stifle progress. W e are spending too much time on DevOps and not enough time on AI. W e call on the authors of ne w e v aluation platforms and benchmarks to join this ef fort. By adopting a standard lik e CUBE, we can ensure that e very ne w benchmark is imme- diately av ailable to the entire research community . W e can create a world where a ne w breakthrough in agent architec- ture can be tested against hundreds of diverse en vironments within hours, rather than weeks. The draft proposal presented here is just the beginning. W e need a community-driv en process to refine the API, de- fine compliance lev els, and build the registry . W e invite researchers, de velopers, and platform o wners to contribute to this discussion. The goal is shared infrastructure that 9 CUBE: A Standard for Unifying Agent Benchmarks lowers the barrier for e veryone, not a mandate on how to do research. References AgentBeats T eam and Berkele y RDI. Agentbeats: T o- wards agentified agent assessment (aaa). https:// agentbeats.org , 2025. Platform for standardized agent e valuation using the Agentified Agent Assessment paradigm. Bandel, E., Y ehudai, A., Eden, L., Sagron, Y ., Perlitz, Y ., V enezian, E., Razinkov , N., Ergas, N., Ifer gan, S. S., Shlomov , S., Jacovi, M., Choshen, L., Ein-Dor, L., Katz, Y ., and Shmueli-Scheuer, M. General agent ev al- uation, 2026a. URL 2602.22953 . Bandel, E., Y ehudai, A., Eden, L., Sagron, Y ., Perlitz, Y ., V enezian, E., Razinkov , N., Ergas, N., Shachor Ifergan, S., Shlomov , S., Jacovi, M., Choshen, L., Ein-Dor , L., Katz, Y ., and Shmueli-Scheuer , M. General agent ev aluation. In ICLR Blogposts , 2026b. URL https: //iclr- blogposts.github.io/2026/blog/ 2026/general- agent- evaluation/ . Bandel, E., Y ehudai, A., Lacoste, A., Ghosh, A., Neubig, G., Mitchell, M., Shmueli-Scheuer , M., and Choshen, L. Po- sition: Agentic systems should be general. In ICLR 2026 W orkshop on Agents in the W ild , 2026c. URL https: //openreview.net/forum?id=CbJpizP0vJ . Bonatti, R., Zhao, D., Bonacci, F ., Dupont, D., Abdali, S., Li, Y ., Lu, Y ., W agle, J., K oishida, K., Bucker , A., Jang, L. K., and Hui, Z. W indo ws agent arena: Evaluating multi- modal OS agents at scale. In F orty-second International Confer ence on Machine Learning , 2025. URL https: //openreview.net/forum?id=W9s817KqYf . Brockman, G., Cheung, V ., Pettersson, L., Schneider, J., Schulman, J., T ang, J., and Zaremba, W . Openai gym. ArXiv , abs/1606.01540, 2016. URL https://api. semanticscholar.org/CorpusID:16099293 . Chen, J., Y uen, D., Xie, B., Y ang, Y ., Chen, G., W u, Z., Y ix- ing, L., Zhou, X., Liu, W ., W ang, S., Zhou, K., Shao, R., Nie, L., W ang, Y ., HA O, J., W ang, J., and Shao, K. SP A- BENCH: A COMPREHENSIVE BENCHMARK FOR SMAR TPHONE A GENT EV ALU A TION. In The Thir- teenth International Confer ence on Learning Repr esen- tations , 2025. URL https://openreview.net/ forum?id=OZbFRNhpwr . de Chezelles, T . L. S., Gasse, M., Lacoste, A., Caccia, M., Drouin, A., Boisvert, L., Thakkar, M., Marty , T ., Assouel, R., Shayegan, S. O., Jang, L. K., L ` u, X. H., Y oran, O., Kong, D., Xu, F . F ., Reddy , S., Neubig, G., Cappart, Q., Salakhutdinov , R., and Chapados, N. The browser gym ecosystem for web agent research. T ransac- tions on Machine Learning Resear ch , 2025. ISSN 2835- 8856. URL https://openreview.net/forum? id=5298fKGmv3 . Expert Certification. Deng, X., Da, J., P an, E., He, Y ., Ide, C., Garg, K., Lauf- fer , N., Park, A., Pasari, N., Rane, C., Sampath, K., Kr- ishnan, M., K undurthy , S., Hendryx, S. M., W ang, Z., Zhang, C. B. C., Jacobson, N., Liu, B., and K enstler , B. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? ArXiv , abs/2509.16941, 2025. URL https://api.semanticscholar. org/CorpusID:281421060 . Drouin, A., Gasse, M., Caccia, M., Laradji, I. H., Del V erme, M., Marty , T ., V azquez, D., Chapados, N., and Lacoste, A. W orkarena: Ho w capable are web agents at solving common knowledge work tasks? In International Con- fer ence on Machine Learning , pp. 11642–11662. PMLR, 2024. Froger , R., Do, V ., Garreau, E., Gaya, J.-B., Lauren c ¸ on, H., Lecanu, M., Malkan, K., Mekala, D., Mialon, G., M ´ enard, P ., Bertran, G. M.-T ., Piterbar g, U., Rita, M., Rusakov , A., Scialom, T ., W ang, M., et al. Are: Scaling up agent en vironments and ev aluations, 2025. URL https:// arxiv.org/abs/2509.17158 . Jimenez, C. E., Y ang, J., W ettig, A., Y ao, S., Gaunt, K., Juefe-Xu, F ., and Narasimhan, K. Swe-bench: Can lan- guage models resolve real-world github issues? In The T welfth International Conference on Learning Repr esen- tations (ICLR) , 2024. Kapoor , S., Stroebl, B., Kirgis, P ., Nadgir , N., Siegel, Z. S., W ei, B., Xue, T ., Chen, Z., Chen, F ., Utpala, S., Ndzomga, F ., Oruganty , D., Luskin, S., Liu, K., Y u, B., Arora, A., Hahm, D., T riv edi, H., Sun, H., Lee, J., Jin, T ., Mai, Y ., Zhou, Y ., Zhu, Y ., Bommasani, R., Kang, D., Song, D., Henderson, P ., Su, Y ., Liang, P ., and Narayanan, A. Holistic agent leaderboard: The missing infrastructure for ai agent e v aluation, 2025. URL https://arxiv. org/abs/2510.11977 . Khatri, D., Madaan, L., T iwari, R., Bansal, R., Duvvuri, S. S., Zaheer , M., Dhillon, I. S., Brandfonbrener , D., and Agarwal, R. The art of scaling reinforcement learning compute for llms. ArXiv , abs/2510.13786, 2025. URL https://api.semanticscholar. org/CorpusID:282102889 . K oh, J. Y ., Lo, R., Jang, L., Duvvur, V ., Lim, M. C., Huang, P .-Y ., Neubig, G., Zhou, S., Salakhutdinov , R., and Fried, D. V isualwebarena: Evaluating multimodal agents on realistic visual web tasks. ArXiv , abs/2401.13649, 10 CUBE: A Standard for Unifying Agent Benchmarks 2024. URL https://api.semanticscholar. org/CorpusID:267199749 . Lab, M. A. et al. Gaia2: Benchmarking llm agents on dynamic and asynchronous en vironments. In Interna- tional Confer ence on Learning Representations (ICLR) , 2026. URL https://openreview.net/forum? id=9gw03JpKK4 . Lacoste, A. et al. CUBE: Common unified bench- mark environments – reference implementation. https://github.com/The- AI- Alliance/ cube- standard , 2026. Lei, F ., Y ang, Y ., Sun, W ., and Lin, D. Mcpverse: An expansi ve, real-world benchmark for agentic tool use. ArXiv , abs/2508.16260, 2025. URL https: //api.semanticscholar.org/CorpusID: 280709049 . Li, B., W u, W ., T ang, Z., Shi, L., Y ang, J., Li, J., Y ao, S., Qian, C., Hui, B., Zhang, Q., Y u, Z., Du, H., Y ang, P ., Lin, D., Peng, C., and Chen, K. Prompting large language models to tackle the full software dev elopment lifecycle: A case study . In COLING , pp. 7511–7531, 2025. URL https://aclanthology.org/2025. coling- main.502/ . Liu, X., Y u, H., Zhang, H., Xu, Y ., Lei, X., Lai, H., Gu, Y ., Gu, Y ., Ding, H., Men, K., Y ang, K., Zhang, S., Deng, X., Zeng, A., Du, Z., Zhang, C., Shen, S., Zhang, T ., Shen, S., Su, Y ., Sun, H., Huang, M., Dong, Y ., and T ang, J. Agent- bench: Evaluating llms as agents. ArXiv , abs/2308.03688, 2023. URL https://api.semanticscholar. org/CorpusID:260682249 . Luo, Z., Shen, Z., Y ang, W ., Zhao, Z., Jwala- puram, P ., Saha, A., Sahoo, D., Sav arese, S., Xiong, C., and Li, J. Mcp-univ erse: Benchmark- ing large language models with real-world model context protocol servers. ArXiv , abs/2508.14704, 2025. URL https://api.semanticscholar. org/CorpusID:280691759 . L ` u, X. H., Kazemnejad, A., Meade, N., Patel, A., Shin, D., Zambrano, A., Sta ´ nczak, K., Shaw , P ., Pal, C. J., and Reddy , S. Agentre wardbench: Evaluating automatic e valuations of web agent trajectories, 2025. URL https: //arxiv.org/abs/2504.08942 . Ma, C., Zhang, J., Zhu, Z., Y ang, C., Y ang, Y ., Jin, Y ., Lan, Z., K ong, L., and He, J. Agentboard: An analytical e valuation board of multi-turn llm agents. ArXiv , abs/2401.13178, 2024. URL https://api.semanticscholar. org/CorpusID:267199917 . Merrill, M. A., Shaw , A. G., Carlini, N., Li, B., Raj, H., Bercovich, I., Shi, L., Shin, J. Y ., W alshe, T ., Buchanan, E. K., Shen, J., Y e, G., Lin, H., Poulos, J., W ang, M., Nezhurina, M., Jitsev , J., Lu, D., Mastromichalakis, O. M., Xu, Z., Chen, Z., Liu, Y ., Zhang, R., Chen, L. L., Kashyap, A., Uslu, J.-L., Li, J., W u, J., Y an, M., Bian, S., Sharma, V ., Sun, K., Dillmann, S., Anand, A., Lan- pouthakoun, A., K oopah, B., Hu, C., Guha, E., Dreiman, G. H. S., Zhu, J., Krauth, K., Zhong, L., Muennighoff, N., Amanfu, R., T an, S., Pimpalgaonkar, S., Aggarwal, T ., Lin, X., Lan, X., Zhao, X., Liang, Y ., W ang, Y ., W ang, Z., Zhou, C., Heineman, D., Liu, H., Tri vedi, H., Y ang, J., Lin, J., Shetty , M., Y ang, M., Omi, N., Raoof, N., Li, S., Zhuo, T . Y ., Lin, W ., Dai, Y ., W ang, Y ., Chai, W ., Zhou, S., W ahdany , D., She, Z., Hu, J., Dong, Z., Zhu, Y ., Cui, S., Saiyed, A., K olbeinsson, A., Hu, J., Rytting, C. M., Marten, R., W ang, Y ., Dimakis, A., K onwinski, A., and Schmidt, L. T erminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces, 2026. URL . Meta PyT orch T eam and Hugging Face. Openen v: An interface library for rl post training with en vi- ronments, 2025. URL https://github.com/ meta- pytorch/OpenEnv . GitHub repository . Mialon, G., Fourrier , C., Peladere, G., Goulian, T ., W olf, T ., Joulin, A., and Scialom, T . Gaia: a benchmark for general ai assistants. In The T welfth International Confer ence on Learning Repr esentations (ICLR) , 2024. Mohammadi, M., Li, Y ., Lo, J., and Y ip, W . Eval- uation and benchmarking of llm agents: A sur- ve y . Pr oceedings of the 31st ACM SIGKDD Confer- ence on Knowledge Discovery and Data Mining V .2 , 2025. URL https://api.semanticscholar. org/CorpusID:280337602 . NVIDIA. Nemo gym: An open source library for scaling reinforcement learning en vironments for llm. https:// github.com/NVIDIA- NeMo/Gym , 2025. GitHub repository . Park, C., Han, S., Guo, X., Ozdaglar , A. E., Zhang, K., and Kim, J.-K. Maporl: Multi-agent post- co-training for collaborati ve large language mod- els with reinforcement learning. In Annual Meet- ing of the Association for Computational Linguistics , 2025. URL https://api.semanticscholar. org/CorpusID:276580906 . Rawles, C., Clinckemaillie, S., Chang, Y ., W altz, J., Lau, G., Fair , M., Li, A., Bishop, W . E., Li, W ., Campbell- Ajala, F ., T oyama, D. K., Berry , R. J., T yamagundlu, D., Lillicrap, T . P ., and Riv a, O. Androidworld: A dynamic benchmarking environment for autonomous agents. In 11 CUBE: A Standard for Unifying Agent Benchmarks The Thirteenth International Conference on Learning Repr esentations , 2025. URL https://openreview. net/forum?id=il5yUQsrjC . Shaw , A. Harbor Framework, November 2025. URL https://github.com/laude- institute/ harbor . Shen, Z. Llm with tools: A surv ey . ArXiv , abs/2409.18807, 2024. URL https://api.semanticscholar. org/CorpusID:272968969 . T eam, C. A. Cua-bench: T echnical report. 2025. URL https://cuabench.ai . T owers, M., Kwiatk owski, A., T erry , J., Balis, J. U., De Cola, G., Deleu, T ., Goul ˜ ao, M., Kallinteris, A., Krimmel, M., KG, A., et al. Gymnasium: A standard interface for reinforcement learning environments. arXiv preprint arXiv:2407.17032 , 2024. T riv edi, H., Khot, T ., Hartmann, M., Manku, R., Dong, V ., Li, E., Gupta, S., Sabharwal, A., and Balasubramanian, N. Appworld: A controllable world of apps and people for benchmarking interacti ve coding agents. In Proceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , 2024. W ang, J., Ma, Z., Li, Y ., Zhang, S., Chen, C., Chen, K., and Le, X. Gta: A benchmark for general tool agents. ArXiv , abs/2407.08713, 2024a. URL https://api.semanticscholar. org/CorpusID:271097480 . W ang, X., W ang, Z., and team, O. Openhands: An open platform for ai software dev elopers as generalist agents. arXiv pr eprint arXiv:2407.16741 , 2024b. W ang, Z., Chang, Q., Patel, H., Biju, S., Wu, C.- E., Liu, Q., Ding, A., Rezazadeh, A., Shah, A., Bao, Y ., and Siow , E. Mcp-bench: Benchmark- ing tool-using llm agents with complex real-world tasks via mcp servers. ArXiv , abs/2508.20453, 2025. URL https://api.semanticscholar. org/CorpusID:280949860 . Xie, T ., Zhang, F ., Chen, Z., Y e, Z., Xia, S., Liu, W ., Liang, Z., Shang, M., Miao, S., Cheng, D., et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real world computer systems. In The Thirty-eighth Annual Conference on Neural Information Pr ocessing Systems (NeurIPS) , 2024. Xu, W ., Huang, C., Gao, S., and Shang, S. Llm-based agents for tool learning: A survey . Data Science and Engineering , 10:533 – 563, 2025. URL https: //api.semanticscholar.org/CorpusID: 279637071 . Y ang, J., Jimenez, C. E., et al. Swe-bench v erified: En- hancing software engineering ev aluation with human- annotated unit tests. OpenAI Blog / T echnical Re- port , 2024. URL https://openai.com/index/ introducing- swe- bench- verified/ . Y ehudai, A., Eden, L., Li, A., Uziel, G., Zhao, Y ., Bar- Haim, R., Cohan, A., and Shmueli-Scheuer , M. Surve y on ev aluation of llm-based agents, 2025. URL https: //arxiv.org/abs/2503.16416 . Zhou, S., Xu, F . F ., Zhu, H., Zhou, X., Lo, R., Sridhar , A., Cheng, X., Ou, T ., Bisk, Y ., Fried, D., Alon, U., and Neubig, G. W ebarena: A realistic web environ- ment for building autonomous agents. In The T welfth International Confer ence on Learning Repr esentations , 2024. URL https://openreview.net/forum? id=oKn9c6ytLx . 12

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment