How well can VLMs rate audio descriptions: A multi-dimensional quantitative assessment framework
Digital video is central to communication, education, and entertainment, but without audio description (AD), blind and low-vision audiences are excluded. While crowdsourced platforms and vision-language-models (VLMs) expand AD production, quality is rarely checked systematically. Existing evaluations rely on NLP metrics and short-clip guidelines, leaving questions about what constitutes quality for full-length content and how to assess it at scale. To address these questions, we first developed a multi-dimensional assessment framework for uninterrupted, full-length video, grounded in professional guidelines and refined by accessibility specialists. Second, we integrated this framework into a comprehensive methodological workflow, utilizing Item Response Theory, to assess the proficiency of VLM and human raters against expert-established ground truth. Findings suggest that while VLMs can approximate ground-truth ratings with high alignment, their reasoning was found to be less reliable and actionable than that of human respondents. These insights show the potential of hybrid evaluation systems that leverage VLMs alongside human oversight, offering a path towards scalable AD quality control.
💡 Research Summary
Digital video is now a primary medium for communication, education, and entertainment, yet blind and low‑vision (BLV) audiences are excluded without audio description (AD). While crowdsourced platforms and vision‑language models (VLMs) have increased the volume of AD, systematic quality control remains lacking. Existing evaluations rely on short‑clip guidelines or surface‑level NLP similarity metrics, which do not capture the nuanced requirements of full‑length, uninterrupted video.
This paper addresses two research questions: (1) What constitutes AD quality in real‑world practice, and how can professional guidelines be operationalized into a systematic, multi‑dimensional assessment framework? (2) Can state‑of‑the‑art VLMs approximate expert judgments and serve as scalable evaluators when human review is costly?
To answer RQ1, the authors consulted accessibility specialists, blind consultants, and professional AD controllers. They extended established content dimensions (accuracy, relevance, clarity, objectivity, descriptiveness) with two formatting dimensions—timing and delivery method—resulting in a seven‑dimensional framework specifically designed for uninterrupted videos ranging from 1:30 to 5:05 minutes.
For RQ2, a corpus of ten full‑length videos was assembled, each paired with one human‑authored AD and three VLM‑generated ADs, yielding 40 descriptions. Three expert raters blind to source produced ground‑truth scores across all seven dimensions. Four human raters with accessibility experience and eight contemporary VLMs were then prompted to apply the same framework; VLM prompts required both per‑dimension scores and brief rationales.
The authors introduced Item Response Theory (IRT) to jointly model rater ability and item (description) difficulty on a common latent scale. IRT enables a nuanced comparison beyond simple correlation, revealing where VLMs succeed or fail relative to experts. Results show that VLMs achieve high overall alignment with experts (average Pearson ≈ 0.78) on content dimensions, but their ability scores are lower (≈ 0.71) than human raters (≈ 0.85). The biggest gaps appear in the formatting dimensions—timing and delivery—where VLMs frequently miss temporal alignment and produce less natural pacing. Moreover, VLM‑generated rationales tend to be superficial keyword listings, lacking the contextual depth that human raters provide.
The paper’s contributions are threefold: (1) a novel, multi‑dimensional AD quality framework that incorporates formatting considerations absent from prior work; (2) an IRT‑based methodological pipeline that can be reused for other multimodal evaluation tasks; (3) empirical evidence that VLMs can serve as efficient scorers for certain dimensions but still require human oversight for nuanced, actionable feedback. The authors discuss design implications for hybrid evaluation systems that combine VLM efficiency with human diagnostic value, and outline future work such as expanding the dataset, refining VLM prompts for better temporal reasoning, and extending IRT to multi‑dimensional item parameters.
In summary, while VLMs show promise as scalable evaluators of AD quality, they currently fall short of expert-level reasoning and formatting awareness. A hybrid human‑VLM workflow appears to be the most viable path toward sustainable, large‑scale AD quality assurance.
Comments & Academic Discussion
Loading comments...
Leave a Comment