E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs
E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a multi-modal information density assessment framework to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce E-commerce Video Ads Benchmark (E-VAds), which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs. These questions are organized into two primary dimensions, namely Perception and Cognition and Reasoning, which consist of five distinct tasks. Finally, we develop E-VAds-R1, an RL-based reasoning model featuring a multi-grained reward design called MG-GRPO. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples.
💡 Research Summary
The paper addresses the emerging challenge of understanding short e‑commerce advertisement videos, which are characterized by rapid editing, dense on‑screen text, and continuous speech, creating a highly multimodal and information‑dense signal that current video‑question‑answering (VideoQA) models struggle with. To quantify this “high information density,” the authors propose a three‑pronged multimodal density assessment framework: Visual dynamic density (Vden) measures the rate of semantic change across frames using DINOv3‑Base features and an exponentially decayed similarity metric; Audio density (Aden) counts ASR‑derived words per second; Textual density (Oden) counts OCR‑derived words per second. Compared with mainstream benchmarks (VideoMME‑short, MVBench, ActivityNetQA, etc.), E‑VAds exhibits substantially higher values (Vden ≈ 60.44, Aden ≈ 5.08, Oden ≈ 18.66), especially a four‑to‑five‑fold increase in textual density, confirming that e‑commerce short videos present a qualitatively harder domain.
Data collection starts from billions of Taobao ad videos. An automated filtering pipeline removes low‑quality, overly short, or non‑commercial clips, leaving ~30 k candidates. A dynamic, sigmoid‑based sampling function balances category distribution, resulting in 3,961 high‑quality videos covering a wide range of product categories. Each video is aligned at one‑second granularity with ASR transcripts, OCR text, visual frames, and metadata, forming a structured multimodal context.
To generate the benchmark, the authors design a multi‑agent annotation system. Three roles—question generator, answer validator, and evidence extractor—operate sequentially, each powered by large language models (LLMs) with task‑specific prompts. Human experts perform a final review to ensure consistency and reduce subjectivity. The resulting dataset, E‑VAds, contains 19,785 open‑ended QA pairs organized into two dimensions (Perception & Cognition, Reasoning) and five task types: Basic Perception, Cross‑Modal Detection, Marketing Logic, Consumer Insight, and Regulatory Compliance. Each QA is accompanied by explicit visual, OCR, and ASR evidence, encouraging models to produce evidence‑grounded answers.
For modeling, the paper introduces E‑VAds‑R1, an RL‑based reasoning architecture built on top of existing video‑LLM backbones (e.g., Flamingo‑Video, InternVL). The key innovation is the Multi‑Grained Reward design (MG‑GRPO). Rewards are decomposed into low‑granularity signals (e.g., answer correctness) that guide early exploration, and high‑granularity signals (e.g., multimodal evidence alignment, commercial intent accuracy) that provide a non‑linear incentive for expert‑level performance. This hierarchical reward scheme enables the model to achieve a 109.2 % relative improvement in commercial intent reasoning while being trained on only a few hundred annotated examples. Empirical results show consistent gains across all five tasks, with average absolute improvements of 12–18 % over strong baselines in metrics such as accuracy, BLEU, ROUGE, and METEOR; the largest gains appear in Marketing Logic and Consumer Insight, where commercial reasoning is most demanding.
The related‑work discussion positions E‑VAds as the first benchmark focusing on short, conversion‑oriented e‑commerce videos, contrasting it with prior VideoQA datasets (NextQA, EgoSchema) that target narrative or instructional content, and with advertising benchmarks (AdsQA, VideoAds) that deal with longer brand‑centric ads. The paper also surveys recent multimodal LLMs (LLaVA, Flamingo, InternVL, QwenVL) and reinforcement‑learning‑for‑reasoning approaches (RLHF, ReAd‑R), highlighting that none of these adequately address the dense multimodal and intent‑driven nature of e‑commerce short videos.
In conclusion, the study makes three major contributions: (1) a quantitative multimodal density framework that objectively characterizes the difficulty of e‑commerce short videos; (2) a scalable, semi‑automated pipeline that produces a large, high‑quality, evidence‑rich QA benchmark; and (3) an RL‑based reasoning model with multi‑grained rewards that dramatically improves commercial intent inference with minimal supervision. The authors suggest future directions including real‑time inference optimization, integration of user interaction logs for causal reasoning, multilingual expansion, and the development of ethical evaluation metrics for persuasive content.
Comments & Academic Discussion
Loading comments...
Leave a Comment