📝 Original Info
- Title: Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation
- ArXiv ID: 2511.17844
- Date: 2025-11-21
- Authors: ** Shihan Cheng¹, Nilesh Kulkarni², David Hyde¹, Dmitriy Smirnov² ¹Vanderbilt University ²Netflix **
📝 Abstract
Figure 1 . Our "Less is More" framework for data-efficient controllable generation. A T2V backbone, fine-tuned solely on a sparse, low-fidelity synthetic dataset (left), learns to generalize to complex physical controls. This enables precise, high-fidelity manipulation of shutter speed (motion blur), aperture (bokeh), and color temperature during real-world inference (right), driven by a continuous control.
💡 Deep Analysis
📄 Full Content
Less is More: Data-Efficient Adaptation for Controllable Text-to-Video
Generation
Shihan Cheng1
Nilesh Kulkarni2
David Hyde1
Dmitriy Smirnov2
1Vanderbilt University 2Netflix
{shihan.cheng, david.hyde.1}@vanderbilt.edu
{nkulkarni, dimas}@netflix.com
Shutter
Low
Medium
High
Low
Medium
High
Low
Medium
High
Aperture
Temperature
High-Fidelity Controllable Generation
Low-Fidelity Synthetic Training
Figure 1. Our “Less is More” framework for data-efficient controllable generation. A T2V backbone, fine-tuned solely on a sparse,
low-fidelity synthetic dataset (left), learns to generalize to complex physical controls. This enables precise, high-fidelity manipulation of
shutter speed (motion blur), aperture (bokeh), and color temperature during real-world inference (right), driven by a continuous control.
Abstract
Fine-tuning large-scale text-to-video diffusion models to
add new generative controls, such as those over physical
camera parameters (e.g., shutter speed or aperture), typi-
cally requires vast, high-fidelity datasets that are difficult
to acquire. In this work, we propose a data-efficient fine-
tuning strategy that learns these controls from sparse, low-
quality synthetic data. We show that not only does fine-
tuning on such simple data enable the desired controls, it
actually yields superior results to models fine-tuned on pho-
torealistic “real” data. Beyond demonstrating these results,
we provide a framework that justifies this phenomenon both
intuitively and quantitatively.
1. Introduction
Recent advances in generative AI using diffusion models
have enabled unprecedented levels of quality in video gen-
eration The primary foundational models for video genera-
tion are text-to-video (T2V) such as [28, 43], where users
describe their desired creations in natural language. Due
to limitations of controllability via text, significant effort
has been put into creating methods that accept other input
modalities, such as images, keyframes, depth maps, bound-
ing boxes, pose skeletons, driver videos, camera trajecto-
ries, and so on [1, 3, 15, 43]. However, achieving consistent,
reliable, and intuitive fine-grained control over all aspects of
the output video still remains a challenge. In this work, we
tackle the problem of adding a control mechanism over spe-
cific low-dimensional physical or optical properties, such as
camera intrinsics, via a simple synthetic data generation and
model fine-tuning framework.
A common pattern that has emerged in methods for pro-
ducing specialized generative video models is starting with
a large “foundation” model, trained on huge amounts of
video data, and fine-tuning it on a carefully crafted task-
specific smaller dataset [28]. Such datasets help the model
1
arXiv:2511.17844v2 [cs.CV] 11 Dec 2025
focus on a particular character identity, artistic style, or spe-
cialized effect. The success of such approaches hints at the
fact that the initial pre-training equips the model with many
useful priors implicit in its latent representation, which can
be explicitly “coaxed” out during post-training. Our ap-
proach is the first approach that aims to enable conditioning
on camera effects (like shutter speed, focal length, etc) in
pre-trained video generative models for consistent genera-
tion. While the quality of the post-training data is indeed
critical, we argue and demonstrate that, surprisingly, hav-
ing data that are perfectly photorealistic and representative
of the output domain can not only be unnecessary but even
detrimental for certain specializations.
This paper proposes an approach that allows the use of
synthetic data to learn several conditioning effects. We con-
tribute a novel joint training approach that factorizes adap-
tation: a standard low-rank adapter (LoRA) [13] encodes a
minimal domain shift, while a disentangled cross-attention
adapter learns the conditioning physical effect. Addition-
ally, our paper’s contributions include a formal analysis
of why this data-efficient approach succeeds. We demon-
strate that, counterintuitively, the success of small-data fine-
tuning hinges on the simplicity of the synthetic data. We
show that fine-tuning on photorealistic synthetic data, while
seemingly a higher-fidelity choice, induces catastrophic for-
getting by corrupting the backbone’s pre-trained priors,
leading to a “content collapse.” To quantify this, we intro-
duce a new evaluation framework that measures this gen-
erative drift and its impact on semantic fidelity. Using this
framework, we show that a model trained on simple data
retains its generative diversity and high semantic fidelity,
while the model trained on complex data suffers a quantifi-
able and catastrophic collapse. Our work provides both a
data-efficient method for controllable video generation and
a formal methodology for diagnosing and preventing back-
bone corruption during adaptation.
2. Related Work
Text-to-Video Generation. Diffusion models [12, 36] have
become the leading approach for video
Reference
This content is AI-processed based on open access ArXiv data.