LoCoT2V-Bench: Benchmarking Long-Form and Complex Text-to-Video Generation
Reading time: 1 minute
...
📝 Original Info
- Title: LoCoT2V-Bench: Benchmarking Long-Form and Complex Text-to-Video Generation
- ArXiv ID: 2510.26412
- Date: 2025-10-30
- Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (저자명 및 소속을 확인하려면 원문 PDF를 참고하시기 바랍니다.) **
📝 Abstract
Recent advances in text-to-video generation have achieved impressive performance on short clips, yet evaluating long-form generation under complex textual inputs remains a significant challenge. In response to this challenge, we present LoCoT2V-Bench, a benchmark for long video generation (LVG) featuring multi-scene prompts with hierarchical metadata (e.g., character settings and camera behaviors), constructed from collected real-world videos. We further propose LoCoT2V-Eval, a multi-dimensional framework covering perceptual quality, text-video alignment, temporal quality, dynamic quality, and Human Expectation Realization Degree (HERD), with an emphasis on aspects such as fine-grained text-video alignment and temporal character consistency. Experiments on 13 representative LVG models reveal pronounced capability disparities across evaluation dimensions, with strong perceptual quality and background consistency but markedly weaker fine-grained text-video alignment and character consistency. These findings suggest that improving prompt faithfulness and identity preservation remains a key challenge for long-form video generation.💡 Deep Analysis
📄 Full Content
Reference
This content is AI-processed based on open access ArXiv data.