AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios
The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow, demonstrating exceptional performance in coding, deep research, and complex problem-solving evaluations. However, in daily scenarios, the perception of these advanced AI capabilities among general users remains limited. We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks necessary to cover the daily work, life, and learning activities of a broad demographic. To address this, we propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks. These tasks require not only solving problems through dialogue but also understanding various attachment types and delivering tangible file-based results. The benchmark is structured around three user-centric categories: Open Workflow Execution, which assesses adherence to explicit and complex workflows; Latent Instruction, which requires agents to infer implicit instructions from attachments; and Iterative Refinement, which involves modifying or expanding upon ongoing work. We employ instance-level rubrics and a refined evaluation pipeline that aligns LLM-based verification with human judgment, achieving an 80.1% agreement rate using Gemini-3-Pro. AgentIF-OneDay comprises 104 tasks covering 767 scoring points. We benchmarked four leading general AI agents and found that agent products built based on APIs and ChatGPT agents based on agent RL remain in the first tier simultaneously. Leading LLM APIs and open-source models have internalized agentic capabilities, enabling AI application teams to develop cutting-edge Agent products.
💡 Research Summary
AgentIF‑OneDay is a newly introduced benchmark that evaluates whether general‑purpose AI agents can help everyday users complete a wide variety of daily tasks through natural‑language instructions. The authors argue that existing instruction‑following and agent benchmarks focus primarily on increasing task difficulty or on narrow vertical domains, leaving a gap in assessing the diversity of tasks that ordinary people actually encounter in work, life, and learning. To fill this gap, the paper presents a dataset of 104 tasks (totaling 767 scoring points) organized around three user‑centric categories: Open Workflow Execution, Latent Instruction Inference, and Iterative Refinement.
Open Workflow Execution tasks require agents to follow a detailed, step‑by‑step procedure supplied by the user (e.g., steps ①‑⑤). Success depends on long‑context handling, avoidance of “instruction forgetting,” and suppression of hallucinations. Latent Instruction Inference tasks hide the core instruction inside attachments; the agent must extract implicit rules, constraints, or pricing logic from PDFs, spreadsheets, or other files and apply them to a new decision‑making problem. Iterative Refinement tasks simulate multi‑turn collaboration: the agent receives an existing artefact (e.g., an SVG floor plan) together with a set of constraints in an Excel file and must modify the artefact while preserving readability and walkability.
Evaluation is performed with instance‑level rubrics that separate bonus points (for meeting key requirements) from penalty points (for critical errors). Each rubric item is binary (satisfied / not satisfied), and the final normalized score for a problem is computed as the net bonus‑penalty divided by the maximum possible points, clamped at zero. The overall benchmark score is the mean of these normalized scores across all problems.
A key contribution is the automated evaluation pipeline that uses a large multimodal language model (Gemini‑3‑Pro) as the “judge.” The pipeline incorporates visual parsing for PPT and HTML files, vision‑language models for image‑based verification, and web search for time‑sensitive factual checks. By aligning LLM‑based verification with human scoring, the authors achieve an 80.1 % agreement rate, demonstrating that LLM‑as‑judge can reliably replace human annotators for large‑scale benchmarking.
Task generation combines human‑annotated seed tasks with a “File‑centered Automated Agentic Pipeline.” The authors employ a ChatGPT‑based agent to collect information‑dense attachments (e.g., PDFs, spreadsheets) that are rich in question‑generation potential. These attachments are then combined with extracted workflow frameworks to synthesize logically coherent evaluation items. This methodology is extensible, allowing future expansion to new domains or additional task types.
The benchmark was used to evaluate four leading AI agents, including API‑based agents and ChatGPT agents trained with reinforcement learning for tool use. All four agents ranked in the top tier, indicating that contemporary LLM APIs and open‑source models have already internalized substantial agentic capabilities. However, performance varied across categories: agents performed well on straightforward workflow execution but showed notable gaps on latent instruction inference and iterative refinement, especially when complex multimodal reasoning or constraint optimization was required.
In summary, AgentIF‑OneDay provides the first large‑scale, multimodal benchmark that measures end‑to‑end task completion for general‑purpose AI agents in realistic daily scenarios. Its three‑category design captures explicit procedural execution, implicit reasoning from attachments, and multi‑turn refinement, reflecting how users actually interact with AI assistants. The instance‑level rubric and LLM‑as‑judge pipeline dramatically reduce human evaluation costs while maintaining high reliability. The benchmark not only reveals current strengths and weaknesses of commercial agents but also offers a valuable dataset for future reinforcement‑learning‑based instruction‑following research. By focusing on real‑world utility rather than abstract difficulty, AgentIF‑OneDay sets a new standard for evaluating the practical value of AI agents to everyday users.
Comments & Academic Discussion
Loading comments...
Leave a Comment