Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation

Reading time: 4 minute
...

📝 Original Info

  • Title: Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation
  • ArXiv ID: 2511.17844
  • Date: 2025-11-21
  • Authors: ** Shihan Cheng¹, Nilesh Kulkarni², David Hyde¹, Dmitriy Smirnov² ¹Vanderbilt University ²Netflix **

📝 Abstract

Figure 1 . Our "Less is More" framework for data-efficient controllable generation. A T2V backbone, fine-tuned solely on a sparse, low-fidelity synthetic dataset (left), learns to generalize to complex physical controls. This enables precise, high-fidelity manipulation of shutter speed (motion blur), aperture (bokeh), and color temperature during real-world inference (right), driven by a continuous control.

💡 Deep Analysis

📄 Full Content

Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation Shihan Cheng1 Nilesh Kulkarni2 David Hyde1 Dmitriy Smirnov2 1Vanderbilt University 2Netflix {shihan.cheng, david.hyde.1}@vanderbilt.edu {nkulkarni, dimas}@netflix.com Shutter Low Medium High Low Medium High Low Medium High Aperture Temperature High-Fidelity Controllable Generation Low-Fidelity Synthetic Training Figure 1. Our “Less is More” framework for data-efficient controllable generation. A T2V backbone, fine-tuned solely on a sparse, low-fidelity synthetic dataset (left), learns to generalize to complex physical controls. This enables precise, high-fidelity manipulation of shutter speed (motion blur), aperture (bokeh), and color temperature during real-world inference (right), driven by a continuous control. Abstract Fine-tuning large-scale text-to-video diffusion models to add new generative controls, such as those over physical camera parameters (e.g., shutter speed or aperture), typi- cally requires vast, high-fidelity datasets that are difficult to acquire. In this work, we propose a data-efficient fine- tuning strategy that learns these controls from sparse, low- quality synthetic data. We show that not only does fine- tuning on such simple data enable the desired controls, it actually yields superior results to models fine-tuned on pho- torealistic “real” data. Beyond demonstrating these results, we provide a framework that justifies this phenomenon both intuitively and quantitatively. 1. Introduction Recent advances in generative AI using diffusion models have enabled unprecedented levels of quality in video gen- eration The primary foundational models for video genera- tion are text-to-video (T2V) such as [28, 43], where users describe their desired creations in natural language. Due to limitations of controllability via text, significant effort has been put into creating methods that accept other input modalities, such as images, keyframes, depth maps, bound- ing boxes, pose skeletons, driver videos, camera trajecto- ries, and so on [1, 3, 15, 43]. However, achieving consistent, reliable, and intuitive fine-grained control over all aspects of the output video still remains a challenge. In this work, we tackle the problem of adding a control mechanism over spe- cific low-dimensional physical or optical properties, such as camera intrinsics, via a simple synthetic data generation and model fine-tuning framework. A common pattern that has emerged in methods for pro- ducing specialized generative video models is starting with a large “foundation” model, trained on huge amounts of video data, and fine-tuning it on a carefully crafted task- specific smaller dataset [28]. Such datasets help the model 1 arXiv:2511.17844v2 [cs.CV] 11 Dec 2025 focus on a particular character identity, artistic style, or spe- cialized effect. The success of such approaches hints at the fact that the initial pre-training equips the model with many useful priors implicit in its latent representation, which can be explicitly “coaxed” out during post-training. Our ap- proach is the first approach that aims to enable conditioning on camera effects (like shutter speed, focal length, etc) in pre-trained video generative models for consistent genera- tion. While the quality of the post-training data is indeed critical, we argue and demonstrate that, surprisingly, hav- ing data that are perfectly photorealistic and representative of the output domain can not only be unnecessary but even detrimental for certain specializations. This paper proposes an approach that allows the use of synthetic data to learn several conditioning effects. We con- tribute a novel joint training approach that factorizes adap- tation: a standard low-rank adapter (LoRA) [13] encodes a minimal domain shift, while a disentangled cross-attention adapter learns the conditioning physical effect. Addition- ally, our paper’s contributions include a formal analysis of why this data-efficient approach succeeds. We demon- strate that, counterintuitively, the success of small-data fine- tuning hinges on the simplicity of the synthetic data. We show that fine-tuning on photorealistic synthetic data, while seemingly a higher-fidelity choice, induces catastrophic for- getting by corrupting the backbone’s pre-trained priors, leading to a “content collapse.” To quantify this, we intro- duce a new evaluation framework that measures this gen- erative drift and its impact on semantic fidelity. Using this framework, we show that a model trained on simple data retains its generative diversity and high semantic fidelity, while the model trained on complex data suffers a quantifi- able and catastrophic collapse. Our work provides both a data-efficient method for controllable video generation and a formal methodology for diagnosing and preventing back- bone corruption during adaptation. 2. Related Work Text-to-Video Generation. Diffusion models [12, 36] have become the leading approach for video

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut