Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation
Real-world digital environments are highly diverse and dynamic. These characteristics cause agents to frequently encounter unseen scenarios and distribution shifts, making continual learning in specific environments essential for computer-use agents (CUAs). However, a key challenge lies in obtaining high-quality and environment-grounded agent data without relying on costly human annotation. In this work, we introduce ACuRL, an Autonomous Curriculum Reinforcement Learning framework that continually adapts agents to specific environments with zero human data. The agent first explores target environments to acquire initial experiences. During subsequent iterative training, a curriculum task generator leverages these experiences together with feedback from the previous iteration to synthesize new tasks tailored for the agent’s current capabilities. To provide reliable reward signals, we introduce CUAJudge, a robust automatic evaluator for CUAs that achieves 93% agreement with human judgments. Empirically, our method effectively enables both intra-environment and cross-environment continual learning, yielding 4-22% performance gains without catastrophic forgetting on existing environments. Further analyses show highly sparse updates (e.g., 20% parameters), which helps explain the effective and robust adaptation. Our data and code are available at https://github.com/OSU-NLP-Group/ACuRL.
💡 Research Summary
The paper tackles the problem of enabling computer‑use agents (CUAs) to continually adapt to real‑world digital environments without any human‑generated training data. Existing large‑language‑model‑based agents achieve high scores on static benchmarks, yet they quickly degrade when deployed to the vast and ever‑changing landscape of desktop and web applications. To bridge this gap, the authors introduce ACuRL (Autonomous Curriculum Reinforcement Learning), a fully self‑supervised framework that combines autonomous environment exploration, a curriculum‑driven task generator, and a novel automatic evaluator called CUAJudge.
Core workflow
- Environment exploration & context collection – The agent first crawls the web to gather diverse user‑generated artifacts (documents, emails, slides, etc.) that serve as realistic starting points. It then freely interacts with the target application, recording sequences of observations (screenshots, UI text) and actions (clicks, keystrokes). These trajectories (τ_exp) and the context‑review trajectories (τ_ctx) constitute the knowledge base for later stages.
- Iterative curriculum reinforcement learning – Training proceeds in N iterations. Each iteration consists of a fixed number of policy‑optimization steps followed by a capability evaluation. For every task in the current set T⁽ⁿ⁾ the agent performs m roll‑outs; the mean success rate s_k⁽ⁿ⁾ is computed. Based on s_k⁽ⁿ⁾, tasks are classified as Easy, Medium, or Hard using two thresholds (δ_high, δ_low).
- Curriculum task generator – Using τ_exp, τ_ctx, the previous task set, and the success‑rate feedback, the generator synthesizes a new task set T⁽ⁿ⁺¹⁾.
- Easy tasks are made harder by adding extra skill requirements or extending the horizon.
- Medium tasks keep core skills but vary the surrounding scenario or persona to increase diversity.
- Hard tasks are decomposed into meaningful subtasks; each sub‑skill becomes a stand‑alone task, providing intermediate reward signals.
The first iteration (Iteration 0) starts from a synthetic task pool derived solely from the collected experiences, eliminating the need for any human‑provided seed tasks. Human validation of 144 sampled tasks shows a 94 % validity rate, confirming that the generator produces executable, sensible tasks.
- Automatic evaluator (CUAJudge) – CUAJudge determines task success by comparing the initial and final environment states and by verifying key evidence (screenshots, action logs). It achieves 93 % agreement with human judgments, delivering reliable reward signals for RL without manual supervision.
System engineering – To scale RL to realistic environments, the authors design a lightweight unified environment management protocol that supports batched creation/deletion, asynchronous pre‑loading, and fault tolerance. This dramatically speeds up training throughput.
Empirical evaluation – Experiments span six representative environments (web browsers, presentation software, spreadsheets, email clients, file explorers, code editors). Two continual‑learning scenarios are examined:
- Intra‑environment: tasks evolve within a single application, requiring the agent to acquire new skills while retaining earlier ones.
- Cross‑environment: tasks span multiple heterogeneous applications, testing knowledge transfer and forgetting mitigation.
Across both settings, ACuRL yields 4 %–22 % absolute performance gains over strong baselines, while preserving or even improving performance on previously learned environments (minimal catastrophic forgetting). Parameter‑change analysis reveals that only about 20 % of the model’s weights are substantially updated during continual learning. Updates concentrate in the higher layers of the language model backbone and diminish with depth in the vision encoder, explaining why the agent can adapt to new UI dynamics while keeping prior knowledge intact.
Insights and limitations – The framework demonstrates that (1) high‑quality, environment‑grounded tasks can be generated autonomously from raw interaction data, (2) reliable reward signals can be obtained without human annotation, and (3) sparse, targeted parameter updates enable stable continual learning. However, the quality of the initial exploration data is crucial; insufficiently diverse exploration may lead to low‑quality tasks. Moreover, CUAJudge currently relies on visual state differences and may struggle with non‑visual feedback (e.g., background processes, log files). Future work could incorporate curiosity‑driven or meta‑exploration strategies to improve data efficiency and extend the evaluator to multimodal evidence beyond screenshots.
Conclusion – ACuRL provides a practical, fully autonomous pipeline for continual learning of computer‑use agents in dynamic digital environments. By eliminating the need for human‑labeled data and by coupling curriculum generation with a robust automatic evaluator, the approach achieves scalable, data‑efficient adaptation while mitigating catastrophic forgetting. This opens the door to deploying self‑improving agents in real‑world settings such as enterprise software automation, personalized desktop assistants, and adaptive UI testing.
Comments & Academic Discussion
Loading comments...
Leave a Comment