AutoMV: An Automatic Multi-Agent System for Music Video Generation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Music-to-Video (M2V) generation for full-length songs faces significant challenges. Existing methods produce short, disjointed clips, failing to align visuals with musical structure, beats, or lyrics, and lack temporal consistency. We propose AutoMV, a multi-agent system that generates full music videos (MVs) directly from a song. AutoMV first applies music processing tools to extract musical attributes, such as structure, vocal tracks, and time-aligned lyrics, and constructs these features as contextual inputs for following agents. The screenwriter Agent and director Agent then use this information to design short script, define character profiles in a shared external bank, and specify camera instructions. Subsequently, these agents call the image generator for keyframes and different video generators for “story” or “singer” scenes. A Verifier Agent evaluates their output, enabling multi-agent collaboration to produce a coherent longform MV. To evaluate M2V generation, we further propose a benchmark with four high-level categories (Music Content, Technical, Post-production, Art) and twelve ine-grained criteria. This benchmark was applied to compare commercial products, AutoMV, and human-directed MVs with expert human raters: AutoMV outperforms current baselines significantly across all four categories, narrowing the gap to professional MVs. Finally, we investigate using large multimodal models as automatic MV judges; while promising, they still lag behind human expert, highlighting room for future work.

💡 Research Summary

The paper “AutoMV: An Automatic Multi-Agent System for Music Video Generation” addresses the significant challenges in automatically generating full-length music videos (MVs) directly from a song, a task known as Music-to-Video (M2V) generation. Existing methods typically produce short, disjointed clips that fail to align with the musical structure, beats, or lyrics and lack the long-term temporal consistency required for a coherent narrative. The authors propose AutoMV, a novel, training-free multi-agent pipeline designed to overcome these limitations.

AutoMV operates through a structured, collaborative workflow involving several specialized AI agents. The process begins with Music-Aware Preprocessing, where tools like Qwen2.5-Omni (for music captioning), SongFormer (for structure segmentation), htdemucs (for vocal/accompaniment separation), and Whisper (for lyric transcription with timestamps) are employed. These tools extract high-level attributes such as genre, mood, song sections (intro, verse, chorus), and time-aligned lyrics, constructing a rich contextual foundation for subsequent agents.

Next, the Screenwriter and Director Agents (powered by Gemini and Doubao APIs) interpret this multimodal context. The Screenwriter agent segments the song temporally based on lyrics and structure, generates a narrative scenario description for each segment, and creates detailed character profiles (appearance, attire). These profiles are stored in a shared external “Character Bank” to ensure identity consistency across different scenes and shots. The Director agent then uses this script to specify camera instructions and generate prompts for keyframe images.

The Video Generation (Renderer) stage follows, where the system adaptively chooses between different generation backends based on scene type. For narrative “story” scenes, it utilizes text-to-video or image-to-video models. For “singer” scenes featuring vocal performance, it opts for speech-to-video models capable of lip-syncing to the isolated vocal track. This specialization ensures appropriate visual treatment for different parts of the song.

A critical innovation is the inclusion of a Verifier Agent (also based on Gemini). This agent evaluates each generated video clip against the original script instructions, checking for alignment, physical feasibility, and overall quality. Clips that fail to meet a certain threshold are sent back for regeneration, creating an iterative feedback loop that significantly enhances the final output’s coherence and quality. All verified clips are finally compiled in sequence to produce the complete music video.

To rigorously evaluate M2V generation—a domain lacking standardized metrics—the authors introduce a comprehensive new benchmark. This benchmark comprises four high-level categories: Music Content Alignment (beat, lyric, structure sync), Technical Quality (video quality, temporal consistency, physical plausibility), Post-production (editing, transitions, visual effects), and Artistic Merit (creativity, storytelling, aesthetics), broken down into twelve fine-grained criteria. Using this framework, expert human raters evaluated commercial M2V tools (Pika, Runway, etc.), AutoMV, and professionally human-directed MVs.

The experimental results demonstrate that AutoMV significantly outperforms all current commercial baselines across all four evaluation categories. Furthermore, AutoMV’s scores narrow the gap to professionally produced human-directed MVs, indicating a substantial advancement in the quality of AI-generated long-form music videos. The paper also explores the use of Large Multimodal Models (LMMs) like GPT-4o and Gemini as automatic judges. While these show promise and some correlation with human scores, they still lag behind expert human judgment, identifying an area for future improvement.

In summary, the paper’s contributions are threefold: (1) the introduction of AutoMV, the first open-source, multi-agent pipeline for generating coherent, full-length music videos directly from audio; (2) the proposal of the first dedicated benchmark for evaluating long-form M2V generation; and (3) an extensive ablation study validating the importance of its core components—music preprocessing, the character bank, and the verifier agent. AutoMV represents a significant step towards democratizing high-quality music video production, potentially lowering barriers for independent artists and creators.

AutoMV: An Automatic Multi-Agent System for Music Video Generation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment