Making Videos Accessible for Blind and Low Vision Users Using a Multimodal Agent Video Player
Video content remains largely inaccessible to blind and low-vision (BLV) users. To address this, we introduce a prototype that leverages a multimodal agent - powered by a novel conversational architecture using a multimodal large language model (MLLM) - to provide BLV users with an interactive, accessible video experience. This Multimodal Agent Video Player (MAVP) demonstrates that an interactive accessibility mode can be added to a video through multilayered prompt orchestration. We describe a user-centered design process involving 18 sessions with BLV users that showed that BLV users do not just want accessibility features, but desire independence and personal agency over the viewing experience. We conducted a qualitative study with an additional 8 BLV participants; in this, we saw that the MAVP’s conversational dialogue offers BLV users a sense of personal agency, fostering collaboration and trust. Even in the case of hallucinations, it is meta-conversational dialogues about AI’s limitations that can repair trust.
💡 Research Summary
This paper presents the Multimodal Agent Video Player (MAVP), a prototype that equips a video player with a conversational multimodal agent—named “Blue”—powered by a state‑of‑the‑art multimodal large language model (MLLM). The system transforms video consumption for blind and low‑vision (BLV) users from a passive listening experience into an interactive dialogue. MAVP automatically generates audio descriptions (AD) on the fly, but unlike traditional pre‑recorded AD, it can adapt the level of detail (brief, medium, or detailed) based on user commands, answer on‑demand questions about characters, objects, settings, and even provide contextual information beyond the video (e.g., product availability).
The technical core consists of three layers: (1) a Retrieval‑Augmented Generation (RAG) module that indexes the entire video and enables rapid frame‑level retrieval; (2) a prompt‑orchestration engine that translates spoken user intents into multimodal prompts for the LLM; and (3) a high‑quality text‑to‑speech (TTS) pipeline that renders the LLM’s textual output as natural‑sounding speech. By chaining these components, the system can answer queries such as “What is the main character wearing?” or “Where can I buy the painting in the background?” within seconds, while also allowing full playback control via voice (e.g., “skip to the scene where the dress changes”).
The design process was deeply user‑centered. Eighteen co‑design sessions with fifteen BLV participants informed the iterative refinement of prompts, interaction flows, and voice feedback. Early trials revealed that off‑the‑shelf LLM prompts assumed sighted users and produced confusing or irrelevant responses. Through multi‑layered, meta‑prompt engineering, the team achieved responses that were both accurate and phrased in a way that BLV users found trustworthy.
A qualitative evaluation with eight participants (seven BLV, one sighted) highlighted three key findings. First, participants valued autonomy and personal agency above mere feature availability; the ability to ask follow‑up questions and adjust description granularity gave them a sense of control. Second, the conversational modality produced higher comprehension and engagement than static AD, especially for complex or instructional videos where visual details are critical. Third, when the LLM generated hallucinations (incorrect information), the system’s meta‑conversational behavior—explicitly acknowledging uncertainty, offering to re‑search, and explaining its limitations—helped restore trust. This “trust‑repair” mechanism suggests a promising design pattern for future AI‑driven accessibility tools.
The authors position MAVP as a proof‑of‑concept that merges automatic AD generation, interactive Q&A, and meta‑dialogue into a single, seamless experience. They argue that accessibility should be reconceptualized as an interactive partnership rather than a static set of features. Limitations include reliance on cloud‑based LLMs (requiring stable internet and high‑end hardware), potential privacy concerns around video content processing, and the need for broader testing across diverse video genres and longer durations.
In conclusion, the paper demonstrates that multimodal LLMs can power a conversational video player that significantly enhances independence, agency, and trust for BLV users. By coupling real‑time description generation with user‑driven dialogue and transparent AI behavior, MAVP charts a new direction for next‑generation accessibility solutions that can be extended to streaming platforms, educational content, and live broadcasts.
Comments & Academic Discussion
Loading comments...
Leave a Comment