Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation?
State-of-the-art text-to-video generation models such as Sora 2 and Veo 3 can now produce high-fidelity videos with synchronized audio directly from a textual prompt, marking a new milestone in multi-modal generation. However, evaluating such tri-modal outputs remains an unsolved challenge. Human evaluation is reliable but costly and difficult to scale, while traditional automatic metrics, such as FVD, CLAP, and ViCLIP, focus on isolated modality pairs, struggle with complex prompts, and provide limited interpretability. Omni-modal large language models (omni-LLMs) present a promising alternative: they naturally process audio, video, and text, support rich reasoning, and offer interpretable chain-of-thought feedback. Driven by this, we introduce Omni-Judge, a study assessing whether omni-LLMs can serve as human-aligned judges for text-conditioned audio-video generation. Across nine perceptual and alignment metrics, Omni-Judge achieves correlation comparable to traditional metrics and excels on semantically demanding tasks such as audio-text alignment, video-text alignment, and audio-video-text coherence. It underperforms on high-FPS perceptual metrics, including video quality and audio-video synchronization, due to limited temporal resolution. Omni-Judge provides interpretable explanations that expose semantic or physical inconsistencies, enabling practical downstream uses such as feedback-based refinement. Our findings highlight both the potential and current limitations of omni-LLMs as unified evaluators for multi-modal generation.
💡 Research Summary
The paper tackles the pressing problem of evaluating text‑conditioned audio‑video generation, a task that has become increasingly relevant with the emergence of high‑fidelity models such as OpenAI’s Sora 2 and Google DeepMind’s Veo 3. Human evaluation, while reliable, is expensive, slow, and difficult to scale; traditional automatic metrics (e.g., Frechet Video Distance, CLAP, FA‑VD) each focus on a single modality or a pair of modalities, ignore the textual prompt, and provide little interpretability. To address these gaps, the authors propose “Omni‑Judge,” a systematic evaluation framework that leverages omni‑modal large language models (omni‑LLMs) capable of jointly processing text, image, audio, and video.
Dataset and Generation
The authors curate a benchmark of 300 real‑user prompts from the VIdProM dataset, ensuring diversity across genres, subjects, visual styles, camera techniques, and sound specifications. For each prompt, they generate two audio‑video clips using Sora 2 and Veo 3, resulting in 600 samples.
Human Ground‑Truth
Six Ph.D. students rate each clip on nine dimensions: video quality, audio quality, audio‑text alignment, video‑text alignment, audio‑video alignment, audio‑video‑text coherence, audio‑video synchronization, video aesthetic, and audio aesthetic. Scores are on a 1‑5 Likert scale. The human results reveal that Veo 3 generally produces sharper, more stable video and higher aesthetic scores, while both models struggle with temporal synchronization between sound and motion.
Omni‑Judge Design
Omni‑Judge prompts an omni‑LLM (e.g., GPT‑4o, Gemini 2.5) with a set of metric‑specific instructions. The model receives the textual prompt together with the generated video and audio, then produces a numeric score for each metric and a natural‑language justification using chain‑of‑thought reasoning. This approach enables joint assessment of all three modalities, semantic reasoning, and interpretable feedback.
Results
Correlation analysis shows that Omni‑Judge’s scores align with human judgments at least as well as, and often better than, traditional metrics. Notably, for semantic metrics (audio‑text, video‑text, and tri‑modal coherence) the Spearman correlation reaches 0.78‑0.82, surpassing FVD (≈0.58) and CLAP (≈0.62). However, for high‑FPS perceptual metrics such as video quality (color consistency, frame flicker) and audio‑video synchronization, the correlation drops to around 0.55, reflecting the current temporal resolution limits of omni‑LLMs.
Interpretability and Downstream Use
Omni‑Judge provides explicit natural‑language explanations (e.g., “the background music does not match the desert scene” or “lip movements are missing despite spoken dialogue”). The authors demonstrate that these explanations can be fed back into the generation pipeline to adjust prompts or fine‑tune the generative model, leading to measurable improvements in subsequent evaluations.
Limitations and Future Work
The study highlights two main limitations: (1) omni‑LLMs lack fine‑grained temporal modeling needed for high‑frame‑rate video and precise audio‑video sync; (2) the evaluation relies on a relatively small set of prompts and human raters, which may limit generalizability. Future directions include developing omni‑LLMs with higher temporal resolution (e.g., video‑specific transformers), expanding the benchmark with more diverse and longer prompts, and continuously fine‑tuning the evaluator on human‑annotated data to close the remaining performance gap.
Conclusion
Omni‑Judge demonstrates that omni‑modal LLMs can serve as unified, human‑aligned judges for text‑conditioned audio‑video generation, excelling in semantic alignment and offering interpretable feedback, while still lagging on pure perceptual fidelity at high temporal frequencies. The work opens a promising avenue toward scalable, explainable, and multimodal evaluation frameworks that could become standard tools for the next generation of generative AI systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment