A11y-CUA Dataset: Characterizing the Accessibility Gap in Computer Use Agents
Computer Use Agents (CUAs) operate interfaces by pointing, clicking, and typing – mirroring interactions of sighted users (SUs) who can thus monitor CUAs and share control. CUAs do not reflect interactions by blind and low-vision users (BLVUs) who use assistive technology (AT). BLVUs thus cannot easily collaborate with CUAs. To characterize the accessibility gap of CUAs, we present A11y-CUA, a dataset of BLVUs and SUs performing 60 everyday tasks with 40.4 hours and 158,325 events. Our dataset analysis reveals that our collected interaction traces quantitatively confirm distinct interaction styles between SU and BLVU groups (mouse- vs. keyboard-dominant) and demonstrate interaction diversity within each group (sequential vs. shortcut navigation for BLVUs). We then compare collected traces to state-of-the-art CUAs under default and AT conditions (keyboard-only, magnifier). The default CUA executed 78.3% of tasks successfully. But with the AT conditions, CUA’s performance dropped to 41.67% and 28.3% with keyboard-only and magnifier conditions respectively, and did not reflect nuances of real AT use. With our open A11y-CUA dataset, we aim to promote collaborative and accessible CUAs for everyone.
💡 Research Summary
The paper introduces A11y‑CUA, a multimodal dataset designed to expose the accessibility gap of modern Computer Use Agents (CUAs). While CUAs such as OpenAI’s Operator, Anthropic’s Computer Use Tool, Microsoft Copilot, and DeepMind’s Project Astra operate by interpreting screen pixels and performing low‑level actions (clicks, scrolls, typing), they inherently assume a sighted user model. Consequently, blind and low‑vision users (BLVUs) who rely on assistive technologies (AT) like screen readers or magnifiers cannot monitor or collaborate with these agents.
To quantify this gap, the authors recruited eight BLVUs and eight sighted users (SUs) to complete 60 everyday computing tasks spanning web browsing, media playback, workflow automation, document editing, and system operations. The tasks were carefully scripted with contextual instructions and explicit end‑states, allowing objective success measurement. A custom computer‑use recorder captured a dense, time‑synchronized stream of data: screen video, system audio, OS‑level input events (keystrokes, mouse clicks, scrolls), window and element metadata, accessibility settings, periodic UI Automation snapshots, and, for web tasks, full DOM and accessibility‑tree dumps. In total the dataset comprises 40.4 hours of interaction and 158 325 logged events.
Analysis of the traces reveals clear between‑group differences: SUs are mouse‑dominant (≈62 % of events are mouse actions) and complete tasks with fewer steps, while BLVUs are keyboard‑dominant (≈78 % of events are keystrokes) and often employ longer, sequential navigation sequences. Within each group, substantial variability exists; participants use multiple strategies (e.g., context‑menu vs. shortcuts for SUs, sequential vs. shortcut navigation for BLVUs) and switch tactics when an approach fails. This intra‑group diversity challenges prior assumptions that BLVU interaction is uniformly slower or less efficient.
The authors then evaluate two state‑of‑the‑art CUAs: Anthropic’s Claude Sonnet 4.5 (closed) and Qwen3‑VL‑32B‑Instruct (open). Each agent is run under three conditions: (1) default visual mode, (2) keyboard‑only mode (simulating screen‑reader input constraints), and (3) a 150 % magnified viewport (simulating low‑vision magnification). Sonnet achieves 78.33 % task success in the default condition, but performance drops sharply to 41.67 % with keyboard‑only and 28.33 % with magnification. Qwen3 manages only 20 % success in the default setting and fails completely (0 %) under both AT conditions. Error analysis shows that agents struggle with (i) perceiving UI elements without visual cues, (ii) aligning their internal action plans with the linear, text‑based navigation order of screen readers, and (iii) correctly translating coordinates under magnification.
To conceptualize the observed shortcomings, the paper proposes a three‑dimensional “accessibility gap” model: perception (visual‑only input), cognitive (misalignment of intent inference under AT constraints), and action (inability to generate keyboard‑oriented or magnified interactions). The authors argue that closing this gap requires redesigning CUAs to ingest accessibility metadata (ARIA labels, screen‑reader announcements), to reason about text‑based navigation, and to output AT‑compatible commands.
Beyond analysis, the authors release the full A11y‑CUA dataset and the open‑source recorder, enabling reproducible research and benchmarking of future agents. They claim three primary contributions: (1) a novel dataset that captures both sighted and BLVU interaction traces across a broad task set, (2) a detailed comparative study of interaction styles highlighting both between‑group and within‑group variability, and (3) an empirical evaluation of current CUAs under realistic AT constraints, exposing substantial perception, cognitive, and action gaps.
Limitations include the modest participant count (16) and focus on a Windows desktop environment, which may limit generalization to other operating systems or mobile platforms. The evaluated agents are primarily vision‑centric; future work should explore agents that directly interface with screen‑reader APIs or that are trained on multimodal AT data. Expanding the diversity of AT configurations (different screen readers, magnification levels) and investigating collaborative “in‑the‑loop” workflows between humans and agents are identified as promising directions.
In sum, A11y‑CUA provides the first large‑scale, multimodal benchmark for assessing the accessibility of computer‑use agents, laying groundwork for the development of agents that can serve all users, regardless of visual ability.
Comments & Academic Discussion
Loading comments...
Leave a Comment