Towards Self-Evolving Benchmarks: Synthesizing Agent Trajectories via Test-Time Exploration under Validate-by-Reproduce Paradigm
Recent advances in large language models (LLMs) and agent system designs have empowered agents with unprecedented levels of capability. However, existing agent benchmarks are showing a trend of rapid ceiling-hitting by newly developed agents, making it difficult to meet the demands for evaluating agent abilities. To address this problem, we propose the Trajectory-based Validated-by-Reproducing Agent-benchmark Complexity Evolution (TRACE) framework. This framework takes an original task from an existing benchmark and encourages agents to freely explore and evolve it into a new task with higher difficulty while recording validatable agent trajectories. The framework proceeds in three stages: (1) evolutionary proposal mining, which provides task evolution proposals through preliminary exploration and divergent thinking; (2) problem formation and free exploration, where proposals are conceptualized into feasible problem candidates and the agents then explore them freely while recording their execution trajectories; and (3) multi-level validation, which ensures that the evolved tasks are accompanied by validatable and reproducible trajectories. Experiments on the GAIA benchmark demonstrate that the TRACE framework consistently enhances task complexity while improving the reliability of correctness through validatable execution trajectories. In addition, our framework can successfully adapt to and improve reasoning datasets represented by AIME-2024. This work marks a paradigm shift from static, manually curated benchmarks to dynamic, self-evolving evaluation systems, providing a sustainable and challenging runway for agent development
💡 Research Summary
The paper addresses the rapid saturation of existing agent benchmarks as large language models (LLMs) and sophisticated agent systems achieve near‑human performance on tasks such as GAIA. To keep evaluation meaningful, the authors propose TRACE (Trajectory‑based Validated‑by‑Reproducing Agent‑benchmark Complexity Evolution), a framework that automatically evolves benchmark tasks at test time while recording the full execution trajectory of the agent.
TRACE treats an agent’s work‑flow as a directed acyclic graph (DAG) where each node Sᵢ consists of (cᵢ₋₁, rᵢ, aᵢ, oᵢ): the prior context, internal reasoning state, the external action (e.g., a tool call or code snippet), and the environment’s observation. A trajectory τ = ⟨S₁,…,S_T⟩ therefore captures every reasoning step, tool usage, and feedback. Unlike static benchmarks that only verify the final answer o_T, TRACE validates the entire τ for logical coherence and reproducibility, making the trajectory a first‑class artifact.
The framework proceeds in three stages:
-
Evolutionary Proposal Mining – The Evolutionary Proposer analyses a seed task and its solution trajectory to identify the primary capability being tested (planning, reasoning, tool use) and the bottleneck where solvers tend to fail. It then generates diverse evolution proposals, e.g., extending a single‑hop factual query into a multi‑hop cross‑domain reasoning chain.
-
Problem Construction & Free Exploration – The Exploration Executor receives a proposal, instantiates it into a concrete problem, and conducts test‑time exploration in real web or API environments. Rather than solving the task, the executor builds a new execution trajectory τ′ that incorporates the proposed modifications (additional constraints, new tool calls, domain shifts). The resulting evolved task is reconstructed from τ′ and is intrinsically more complex.
-
Multi‑Level Validation – The Trajectory Validator re‑executes τ′ to ensure deterministic reproduction, checks logical consistency of the step sequence, and enforces safety constraints. Only (evolved problem, validated trajectory) pairs that pass all checks are added to the benchmark.
Experiments on the GAIA benchmark and the AIME‑2024 reasoning dataset demonstrate that each evolution round raises task difficulty as measured by a bottleneck‑aware difficulty judge. Model performance (Pass@1) declines steadily across rounds, confirming that agents are indeed facing harder problems. Notably, the framework exhibits a “Seed‑to‑Spark” phenomenon: evolved tasks can migrate into entirely new capability domains (e.g., from information retrieval to mathematics or coding), greatly expanding benchmark diversity.
Key contributions are: (1) introducing execution trajectories as verifiable artifacts for benchmark evolution, (2) leveraging agent‑driven exploration rather than rule‑based perturbations to increase procedural and logical complexity, and (3) providing a scalable, self‑evolving evaluation pipeline that mitigates benchmark saturation. Limitations include dependence on live web environments (raising reproducibility costs), potential generation of overly difficult or unrealistic tasks, and the computational overhead of multi‑level validation. Future work suggests building cost‑effective simulation sandboxes, automated difficulty calibration, and human‑in‑the‑loop quality assurance.
Overall, TRACE marks a paradigm shift from static, manually curated benchmarks to dynamic, self‑evolving evaluation systems, offering a sustainable runway for continual advancement of autonomous agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment