컴퓨터 사용 에이전트를 위한 대규모 합성 데이터 생성 시스템 FaraGen과 소형 고성능 모델 Fara7B

Reading time: 6 minute
...

📝 Abstract

Progress in computer use agents (CUAs) has been constrained by the absence of large and high-quality datasets that capture how humans interact with a computer. While LLMs have thrived on abundant textual data, no comparable corpus exists for CUA trajectories. To address these gaps, we introduce FaraGen, a novel synthetic data generation system for multi-step web tasks. FaraGen can propose diverse tasks from frequently used websites, generate multiple solution attempts, and filter successful trajectories using multiple verifiers. It achieves high throughput, yield, and diversity for multi-step web tasks, producing verified trajectories at approximately $1 each. We use this data to train Fara-7B, a native CUA model that perceives the computer using only screenshots, executes actions via predicted coordinates, and is small enough to run on-device. We find that Fara-7B outperforms other CUA models of comparable size on benchmarks like WebVoyager, Online-Mind2Web, and WebTailBench-our novel benchmark that better captures under-represented web tasks in pre-existing benchmarks. Furthermore, Fara-7B is competitive with much larger frontier models, illustrating key benefits of scalable data generation systems in advancing small efficient agentic models. We are making Fara-7B open-weight on Microsoft Foundry and HuggingFace, and we are releasing WebTailBench.

💡 Analysis

Progress in computer use agents (CUAs) has been constrained by the absence of large and high-quality datasets that capture how humans interact with a computer. While LLMs have thrived on abundant textual data, no comparable corpus exists for CUA trajectories. To address these gaps, we introduce FaraGen, a novel synthetic data generation system for multi-step web tasks. FaraGen can propose diverse tasks from frequently used websites, generate multiple solution attempts, and filter successful trajectories using multiple verifiers. It achieves high throughput, yield, and diversity for multi-step web tasks, producing verified trajectories at approximately $1 each. We use this data to train Fara-7B, a native CUA model that perceives the computer using only screenshots, executes actions via predicted coordinates, and is small enough to run on-device. We find that Fara-7B outperforms other CUA models of comparable size on benchmarks like WebVoyager, Online-Mind2Web, and WebTailBench-our novel benchmark that better captures under-represented web tasks in pre-existing benchmarks. Furthermore, Fara-7B is competitive with much larger frontier models, illustrating key benefits of scalable data generation systems in advancing small efficient agentic models. We are making Fara-7B open-weight on Microsoft Foundry and HuggingFace, and we are releasing WebTailBench.

📄 Content

Large Language Models (LLMs) are rapidly evolving from conversational tools into general-purpose agents capable of acting on behalf of users. Among the emerging agentic capabilities, Computer Use Agents (CUAs) that can perceive and take actions on the user’s computer stand out for their immediate potential (Anthropic, 2024;DeepMind, 2025;OpenAI, 2025c). They can navigate websites, fill forms, retrieve information, and generally improve productivity. A capable CUA can reduce tedious multi-step tasks to a single naturallanguage instruction, paving the way for ubiquitous personal digital assistants. However, the transition from “chat” to “agency” is stifled by a data bottleneck. Training a CUA model requires human-computer interaction data that reflects how humans plan and execute tasks on a computer -where to click, how to interpret visual state, how to recover from errors, and how to accomplish goals using noisy and ever-changing GUIs. While the internet provides a near-infinite corpus of text training data for chat LLMs, there is no comparable data for CUA. Collecting such data with human annotators can be prohibitively expensive and slow. Synthetic data generation presents an interesting alternative, but presents its own challenges due to the lack of strong pre-existing CUA models, and programmatic alternatives are brittle in the face of ambiguities and dynamic nature of the open web.

To bridge this gap, we introduce FaraGen, a scalable synthetic data generation engine for CUA, designed specifically for web-based tasks. It employs a collaborative multi-agent architecture that simulates the full lifecycle of digital workflows. FaraGen orchestrates three specialized components to simultaneously maximize the quality, quantity, and diversity of generated trajectories:

• Task Proposal: Analyzes diverse, live website to produce realistic, human-relevant tasks.

• Task Solving: Employs agents to collaboratively attempt the proposed tasks, generating a broad collection of candidate trajectories. A user simulator agent provides feedback or follow-up tasks to increase trajectory complexity and realism.

• Trajectory Verification: Serves as an automated quality assurance layer. We use LLM verifiers to validate trajectory outcomes against the original intent, filtering out hallucinations or execution errors to ensure high data fidelity.

This closed-loop system allows FaraGen to generate verified web trajectories for roughly $1 per completed task, enabling large-scale dataset creation at a cost previously infeasible for CUA research. Our resulting data covers a wide range of modern website layouts, realistic user intents, dynamic content, and multi-turn reasoning -all essential ingredients for robust agentic behavior.

Fara-7B breaks ground on a new Pareto frontier (see Figure 1), showing that high-quality synthetic data can unlock agentic capabilities in even small models.

A small native CUA model. We use FaraGen to generate a dataset of 145K trajectories, spanning multiple task segments like shopping, searching for information, and making reservations. Using this data, we train a compact CUA model specialized for web-based computer use, Fara-7B. The web as it stands today is optimized for human consumption, and we believe navigating it as humans do will lead to the best results. As a result, Fara-7B adopts a “pixel-in, action-out” formulation: it perceives the computer screen directly through raw screenshots, formulates intermediate reasoning steps, and predicts atomic actions at a low-level interface (clicks, scrolls, keystrokes). This avoids dependence on brittle DOM parsing, and is consistent with recent findings that vision-centric CUAs exhibit stronger cross-site generalization (Yutori, 2025).

We evaluate Fara-7B across existing web-based CUA benchmarks as well as a new benchmark we introduce, WebTailBench, which is designed to cover real-world web tasks often under-represented in current metrics. As illustrated in Figure 1, Fara-7B not only achieves state of the art results for a model of its size, but is also competitive with much larger frontier models.

Figure 3: FaraGen -Distributional differences between two publicly available sources of seed URLs: Tranco and Clueweb22. We find that Clueweb22 is a more valuable source of task data because it contains a lower fraction of corporate landing pages which tend to have a narrower scope of actionable tasks achievable on those pages.

what users would employ a CUA for, Task Solving which extends the Magentic-One and Magentic-UI agents (Fourney et al., 2024;Mozannar et al., 2025) to solve tasks, and Task Verification to filter which candidate trajectories successfully completed the task.

We generate a broad set of synthetic tasks with the primary objective of reflecting the distribution of tasks users commonly perform on the web, targeting two broad categories: information seeking questions (e.g., looking up product specifications, finding event details) and a

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut