TaskAudit: Detecting Functiona11ity Errors in Mobile Apps via Agentic Task Execution

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Accessibility checkers are tools in support of accessible app development, and their use is encouraged by accessibility best practices. However, most current checkers evaluate static or mechanically-generated contexts, failing to capture common accessibility errors impacting mobile app functionality. In this work, we define functiona11ity errors as accessibility barriers that only manifest through interaction (i.e., named according to a blend of “functionality” and “accessibility”). We introduce TaskAudit, which comprises three components: a Task Generator that constructs interactive tasks from app screens, a Task Executor that uses agents with a screen reader proxy to perform these tasks, and an Accessibility Analyzer that detects and reports accessibility errors by examining interaction traces. Our evaluation on real-world apps shows that TaskAudit detects 48 functiona11ity errors from 54 app screens, compared to between 4 and 20 with existing checkers. Our analysis demonstrates common error patterns that TaskAudit can detect in addition to those from prior work, including label-functionality mismatch, cluttered navigation, and inappropriate feedback.

💡 Research Summary

**
The paper introduces TaskAudit, a novel system for automatically detecting “functiona11ity” errors in mobile applications—accessibility barriers that only become apparent during interaction with a screen reader. Traditional accessibility checkers (e.g., Google Accessibility Scanner, Axe, Groundhog) operate on static UI snapshots or perform mechanical crawling, which limits them to detecting surface‑level issues such as missing labels, insufficient contrast, or unclickable elements. They miss errors that arise only when a user actually navigates the app with a screen reader, such as mismatched labels, lack of feedback after activation, or navigation structures that are inefficient for blind users.

Problem Definition and Taxonomy
The authors define functiona11ity errors as accessibility problems that manifest only through interaction. They categorize them into five groups:

Locatability – elements that cannot receive focus via a screen reader.
Actionability – focusable elements that cannot be activated (e.g., double‑tap does nothing).
Label – semantic mismatch between an element’s spoken label and its actual function.
Feedback – missing or inappropriate auditory feedback after an action.
Navigation – overly cluttered or poorly ordered focusable elements that impede efficient traversal.

Each category maps to specific WCAG 2.1 success criteria (e.g., 2.1.1 Keyboard, 4.1.2 Name, Role, Value, 3.2.2 On Input, 1.3.1 Info and Relationships).

System Architecture
TaskAudit consists of three tightly coupled components:

Task Generator – Takes a screenshot and the app’s view hierarchy, then uses a large language model (LLM) together with UI‑parsing models to synthesize concrete, screen‑reader‑compatible tasks (e.g., “enter ‘Los Angeles, CA’ into the destination field”, “double‑tap the swap button”). In a crowd‑labeled benchmark, the generator covered 69.4 % of human‑identified tasks.
Task Executor – Deploys a fleet of LLM‑driven agents that interact with the app through a screen‑reader proxy (e.g., Android TalkBack). The agents interpret the generated task description, issue the appropriate gestures, and continuously parse the screen‑reader’s spoken output. When the app is error‑free, the agents successfully complete 96.0 % of within‑screen tasks, demonstrating that the agentic approach can reliably emulate a real user’s experience without visual cues.
Accessibility Analyzer – Consumes the interaction trace (focus changes, spoken feedback, UI state transitions) and applies rule‑based heuristics aligned with the five error categories. For example, if a double‑tap yields no spoken response, a Feedback error is logged; if a button’s spoken label (“Explore”) does not correspond to its intended function (search), a Label error is reported.

Evaluation
The authors address three research questions:

RQ1 – Task Identification: The generator’s 69.4 % coverage shows that LLMs can infer realistic tasks from raw UI data.

RQ2 – Agentic Execution: In a controlled set of apps without known accessibility defects, agents achieved a 96 % success rate, outperforming mechanical crawlers that often produce false positives due to blind clicking.

RQ3 – Error Detection: On 54 real‑world app screens (covering 78 potential functiona11ity errors), TaskAudit detected 48 errors. By contrast, existing static checkers reported only 4–20 errors on the same screens, indicating a 2.4‑ to 12‑fold improvement in detection coverage.

A qualitative analysis of the discovered errors revealed that label‑function mismatches were the most frequent, followed by navigation clutter and missing feedback. The paper also discusses false‑positive cases arising from ambiguous screen‑reader output and suggests future refinements such as multi‑agent voting or richer semantic parsing.

Contributions

Introduction of the functiona11ity error concept and a five‑category taxonomy grounded in WCAG.
Design and implementation of TaskAudit, the first system that automatically generates, executes, and analyzes screen‑reader‑based tasks using LLM‑driven agents.
Empirical evidence that agentic execution dramatically expands accessibility coverage beyond static analysis, with a concrete 48‑error detection on a realistic dataset.
A detailed error‑type analysis that surfaces systematic design flaws (e.g., misleading labels, navigation overload) that are invisible to current tooling.

Limitations and Future Work
TaskAudit currently focuses exclusively on screen‑reader interactions; it does not handle pure touch gestures, voice‑only commands, or complex gesture‑based navigation. The quality of task generation depends on prompt engineering and the underlying LLM size, which may affect scalability. The authors propose extending the framework with multimodal vision‑language models to capture visual feedback, integrating the system into CI pipelines for continuous accessibility regression testing, and exploring lightweight agent architectures for on‑device deployment.

Conclusion
TaskAudit bridges a critical gap between static accessibility linting and real‑world user experience by leveraging LLM‑driven agents and a screen‑reader proxy to surface interaction‑dependent accessibility defects. Its substantial improvement over existing tools suggests a promising direction for future automated accessibility evaluation, potentially setting a new standard for mobile app development workflows.

TaskAudit: Detecting Functiona11ity Errors in Mobile Apps via Agentic Task Execution

💡 Research Summary

Comments & Academic Discussion

Leave a Comment