Maintaining narrative coherence and visual consistency remains a central challenge in open-domain video generation. Existing text-to-video models often treat each shot independently, resulting in identity drift, scene inconsistency, and unstable temporal structure. We propose CoAgent, a collaborative and closed-loop framework for coherent video generation that formulates the process as a plan-synthesize-verify pipeline. Given a user prompt, style reference, and pacing constraints, a Storyboard Planner decomposes the input into structured shot-level plans with explicit entities, spatial relations, and temporal cues. A Global Context Manager maintains entity-level memory to preserve appearance and identity consistency across shots. Each shot is then generated by a Synthesis Module under the guidance of a Visual Consistency Controller, while a Verifier Agent evaluates intermediate results using vision-language reasoning and triggers selective regeneration when inconsistencies are detected. Finally, a pacing-aware editor refines temporal rhythm and transitions to match the desired narrative flow. Extensive experiments demonstrate that CoAgent significantly improves coherence, visual consistency, and narrative quality in long-form video generation.
Open-domain video generation requires not only visual realism within individual clips but also narrative coherence and cross-shot consistency throughout long sequences. While recent text-to-video (T2V) diffusion models [19,21,36,39,42] excel at generating short, semantically aligned clips, they remain fragile when tasked with producing extended multi-shot stories involving recurring characters, evolving scenes, and rhythmically structured pacing. The root of this limitation lies in how the generation process is formulated [4,10,30,31].
Most existing pipelines treat video generation as a Given a script decomposed into N shots, they independently generate each shot i using a T2V model F with textual prompt p i , which can be summarized as V = N i=1 F(p i ), where denotes concatenation. This paradigm is inherently stateless: the generation of s j is conditionally independent of s i for i ̸ = j. Without any persistent memory or feedback, such models cannot recall an entity’s exact visual identity from one shot and reproduce it consistently in subsequent scenes. Consequently, long-form videos suffer from identity drift, scene discontinuity, and unstable visual style-issues that break narrative immersion and limit creative control [2,3,19,32,40,41].
We argue that coherent video generation should not be a static mapping F but a dynamic, feedback-driven reasoning process. To this end, we introduce CoAgent-a Collaborative Planning and Consistency Agent that reformulates video synthesis as a closed-loop, multi-agent process of planning, generation, verification, and refinement. Unlike prior prompt-based or globally conditioned approaches, Co-Agent explicitly maintains state across shots through struc-tured memory and inter-agent collaboration, enabling both continuity and controllability [4,37,38,44].
At the core of CoAgent lies a Storyboard Planner (A plan ) that decomposes a high-level user concept P idea into a structured shot plan S = {S 1 , . . . , S N } describing entities, spatial-temporal relations, and pacing intent. To preserve identities and context across shots, we introduce a Global Context Manager (GCM), an explicit crossshot memory M GCM that registers visual representations of key entities e k such as characters or props. Conditioned on this memory, the Synthesis Module (A synth ) renders each shot as s i = A synth (S i , M GCM , s i-1 ), retrieving entity appearance from M GCM and optionally referencing the previous shot s i-1 -for example, under ff2v or flf2v modes-to maintain temporal smoothness under the guidance of a Visual Consistency Controller. This conditioning transforms coherence from a fragile prompt heuristic into an explicit, state-aware mechanism.
To close the loop, a Verifier Agent (A verif y ) employs a vision-language model to assess each synthesized shot and produces a verification signal V i = A verif y (s i , S i , M GCM ),quantifying both shot fidelity and cross-shot consistency. If V i < τ , indicating inconsistency, CoAgent triggers selective regeneration-refining S i or adjusting the synthesis mode within A synth -thereby instituting a self-correcting feedback loop. A pacing-aware editor subsequently reconciles rhythm and transitions, ensuring that the final video aligns with the intended narrative tempo and mood.
In summary, CoAgent transforms video generation from a stateless, open-loop pipeline into a stateful, agentic framework. By coupling explicit global memory with verification-driven feedback, it shifts the responsibility for coherence from brittle prompt engineering to structured, collaborative reasoning. Our main contributions are summarized as follows: 1. We propose CoAgent, a collaborative closed-loop paradigm that unifies planning, synthesis, and verification for multi-shot video generation. 2. We introduce an explicit entity-level memory module M GCM that preserves identity and appearance consistency across shots through structured retrieval and conditioning. 3. We develop a verifier-guided adaptive synthesis strategy that achieves a controllable trade-off between generation efficiency and visual fidelity via selective regeneration.
Our work, CoAgent, addresses the challenge of long-form, coherent video generation by integrating narrative planning, stateful memory, and closed-loop verification. Our contributions are thus situated at the intersection of three primary research thrusts: (1) spatiotemporal consistency preserva-tion, (2) high-level narrative and compositional planning, and (3) agentic, feedback-driven generative systems.
A significant body of work focuses on mitigating the “visual failure” component of the narrative gap, such as identity drift and scene incoherence. Identity Preservation. Maintaining character identity is a critical sub-problem. A popular approach involves using tuning-free adapters [13,20,28,34,35] to inject identity features, extracted from reference images via encoders like CLIP or ArcFace, into the cross-attention layers of
This content is AI-processed based on open access ArXiv data.