Agentic Test-Time Scaling for WebAgents
Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step tasks remains less well-understood: small per-step errors can compound over long horizons; and we find that naive policies that uniformly increase sampling show diminishing returns. In this work, we present CATTS, a simple technique for dynamically allocating compute for multi-step agents. We first conduct an empirical study of inference-time scaling for web agents. We find that uniformly increasing per-step compute quickly saturates in long-horizon environments. We then investigate stronger aggregation strategies, including an LLM-based Arbiter that can outperform naive voting, but that can overrule high-consensus decisions. We show that uncertainty statistics derived from the agent’s own vote distribution (entropy and top-1/top-2 margin) correlate with downstream success and provide a practical signal for dynamic compute allocation. Based on these findings, we introduce Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious. CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling, providing both efficiency gains and an interpretable decision rule.
💡 Research Summary
This paper investigates how to allocate test‑time compute for large‑language‑model (LLM) based web agents that must perform multi‑step tool‑using tasks. While test‑time scaling—generating multiple samples and aggregating them—has proven effective for single‑shot reasoning, naïvely applying the same strategy to agents leads to two problems: (1) many steps are trivial, so uniformly sampling many candidate actions wastes tokens; (2) on difficult steps the candidate actions are highly diverse, making simple majority voting unreliable.
The authors first evaluate a baseline that samples N candidates at every step and selects the most frequent (majority voting). Experiments on WebArena‑Lite (165 tasks) and GoBrowse (341 tasks) show diminishing returns: increasing N from 1 to 10 improves success only modestly, while doubling the token budget yields less than 0.3 % absolute gain.
Next they introduce an “Arbiter” model that receives the current observation and the set of candidate actions, then reasons over them to pick the best action. Arbiter improves performance over majority voting by about 1–2 % points, but can also overthink: when the vote distribution is already highly consensual, the Arbiter may overturn the correct choice, hurting reliability.
The key insight is that the distribution of votes itself provides a useful uncertainty signal. The authors compute two statistics per step: entropy Hₜ of the vote distribution and the margin Δₜ between the top‑1 and top‑2 vote counts. When either exceeds a pre‑defined threshold τ, they invoke the Arbiter; otherwise they fall back to cheap majority voting. This dynamic policy, named Confidence‑Aware Test‑Time Scaling (CATTS), concentrates extra compute only on contentious steps.
CATTS achieves up to a 9.1 % absolute increase in success over the standard React baseline while using up to 2.3 × fewer tokens compared with uniform scaling. It also reduces latency because the Arbiter is called only on a small subset of steps. Additional contributions include a lightweight semantic deduplication step to cluster semantically identical actions before voting, and an analysis of how varying N (candidate count) and K (number of Arbiter calls) affect the trade‑off between cost and performance.
Overall, the work demonstrates that adaptive, uncertainty‑driven allocation of test‑time resources is far more efficient for agentic, long‑horizon tasks than blanket scaling, offering a practical, interpretable solution for real‑world deployment of LLM‑driven web agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment