OSCBench: Benchmarking Object State Change in Text-to-Video Generation
Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object’s state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.
💡 Research Summary
The paper introduces OSCBench, a dedicated benchmark for evaluating how well text‑to‑video (T2V) generation models realize object state changes (OSC) explicitly described in prompts. While recent T2V models have achieved impressive visual fidelity and temporal coherence, existing evaluation suites focus on overall perceptual quality, text‑video alignment, or physical plausibility, leaving OSC—an essential aspect of action understanding—largely unexamined.
Dataset and Abstraction
OSCBench builds on the HowToChange dataset, which contains 20 fine‑grained cooking actions and 134 objects extracted from instructional videos. The authors first use large language models (GPT‑5.2 and Gemini‑3) to propose groupings, then refine them through a human‑in‑the‑loop process, resulting in 9 high‑level action categories (e.g., cutting, heating) and 8 major object categories with 28 sub‑categories (e.g., root vegetables). This abstraction reduces long‑tail bias and enables systematic scenario creation.
Scenario Design
Three complementary scenario families are defined:
- Regular – 108 common action‑object pairs (e.g., slicing lemon) each instantiated with eight concrete prompts, covering typical cooking transformations.
- Novel – 20 deliberately rare but feasible combinations (e.g., peeling berries) that require semantic reasoning rather than memorization.
- Compositional – 12 sequences of two actions applied to the same object (e.g., peel then slice a pear), testing temporal consistency of intermediate and final states.
Overall, OSCBench comprises 1,120 prompts averaging 9.2 words, with a structured template <subject><action><object><scene>. A subset of prompts is simplified to <action><object> to isolate pure OSC reasoning.
Evaluation Protocol
The benchmark employs a hybrid evaluation:
Human Study – Six domain experts rate generated videos on four dimensions (semantic adherence, OSC accuracy, scene alignment, perceptual quality) using a 5‑point Likert scale.
Automatic Scoring – State‑of‑the‑art multimodal large language models (MLLMs) are prompted with a Chain‑of‑Thought (CoT) framework that guides them through criteria grounding, evidence extraction, and justification. For OSC, the model must verify the progression “initial → intermediate → final” state across frames.
Correlation analysis shows a Pearson coefficient of 0.78 between human and MLLM scores, indicating that the automated pipeline reliably mirrors human judgments while drastically reducing annotation cost.
Experimental Findings
Six representative T2V systems are benchmarked: four open‑source models (Open‑Sora‑2.0, HunyuanVideo, HunyuanVideo‑1.5, Wan‑2.2) and two proprietary models (Kling‑2.5‑Turbo, VEO‑3.1‑Fast). Across all models:
- Semantic alignment is high (≈80 % of prompts receive strong scores).
- OSC accuracy is low, averaging only 35 % correct state changes.
- Performance degrades sharply for novel (≈20 % correct) and compositional (≈15 % correct) scenarios, revealing poor generalization and limited temporal reasoning.
Error analysis highlights frequent issues: objects remaining in the original state, partially completed transformations, or implausible intermediate appearances.
Implications and Future Directions
The results expose OSC as a critical bottleneck in current T2V research. Models excel at generating visually plausible motion but lack explicit mechanisms to map actions to object state transitions and to maintain those transitions over time. The authors suggest integrating structured state representations, action‑conditioned diffusion priors, or graph‑based reasoning modules to address this gap.
OSCBench itself serves as a diagnostic tool: its balanced scenario set, multi‑level difficulty, and validated automatic scoring make it suitable for large‑scale benchmarking and for guiding the development of “state‑aware” video generators. Moreover, the CoT‑based MLLM evaluation framework can be adapted to other video benchmarks, offering a scalable alternative to costly human studies.
In summary, OSCBench fills a notable void in the evaluation landscape by focusing on object state changes, demonstrates that even cutting‑edge T2V models fall short on this dimension, and provides a concrete roadmap for future research aiming at more faithful, action‑grounded video synthesis.
Comments & Academic Discussion
Loading comments...
Leave a Comment