TeleStyle: Content-Preserving Style Transfer in Images and Videos
Content-preserving style transfer, generating stylized outputs based on content and style references, remains a significant challenge for Diffusion Transformers (DiTs) due to the inherent entanglement of content and style features in their internal representations. In this technical report, we present TeleStyle, a lightweight yet effective model for both image and video stylization. Built upon Qwen-Image-Edit, TeleStyle leverages the base model’s robust capabilities in content preservation and style customization. To facilitate effective training, we curated a high-quality dataset of distinct specific styles and further synthesized triplets using thousands of diverse, in-the-wild style categories. We introduce a Curriculum Continual Learning framework to train TeleStyle on this hybrid dataset of clean (curated) and noisy (synthetic) triplets. This approach enables the model to generalize to unseen styles without compromising precise content fidelity. Additionally, we introduce a video-to-video stylization module to enhance temporal consistency and visual quality. TeleStyle achieves state-of-the-art performance across three core evaluation metrics: style similarity, content consistency, and aesthetic quality. Code and pre-trained models are available at https://github.com/Tele-AI/TeleStyle
💡 Research Summary
TeleStyle is a lightweight yet powerful framework for content‑preserving style transfer in both images and videos, built on top of Qwen‑Image‑Edit, a state‑of‑the‑art Diffusion Transformer (DiT) model. The authors identify a core difficulty: DiTs inherently entangle content and style features, making it hard to change the visual style without degrading the underlying structure. To overcome this, they adopt a data‑centric approach combined with a novel training curriculum.
First, they construct a hybrid triplet dataset. A curated “clean” set (≈300 k triplets) is assembled from high‑quality sources such as GPT‑4o‑generated pairs, LoRA‑augmented samples, and manually filtered data. Because this set covers only about 30 distinct style categories, they augment it with a large synthetic set (≈1 M triplets) generated automatically. The synthetic pipeline reverses a stylized target back to a photorealistic content reference using a FLUX‑based editor, extracts a style reference with a CDST model that leverages DINOv2 embeddings, and pairs the two using random prompts (excluding human subjects to avoid identity leakage). This yields diverse “style‑reference, content‑reference, target” triples that span thousands of in‑the‑wild style categories.
Training directly on the mixed data would cause catastrophic forgetting and content degradation. Therefore, the authors propose a three‑stage Curriculum Continual Learning (CCL) strategy:
- Capability Activation – Train a LoRA‑augmented Qwen‑Image‑Edit (Q1) on the clean set D₁ to acquire basic content‑preserving style transfer.
- Content Fidelity Refinement – Re‑weight high‑fidelity samples within D₁ to form D₂, then fine‑tune Q1 into Q2, dramatically improving fine‑grained detail preservation (e.g., facial identity).
- Robust Generalization – Mix D₂ with a low‑ratio of synthetic data (≈10 %) to create D₃, and continue training Q2 into the final model Q3. This stage expands style coverage while retaining the content fidelity learned in stage 2.
Only the LoRA adapters are updated, keeping parameter overhead low while leveraging the full capacity of the underlying DiT. The loss is a flow‑matching objective that measures L₂ distance between the predicted velocity field and the ground‑truth noise, conditioned on style, content, and a standardized prompt template.
For video stylization, TeleStyle introduces a first‑frame‑conditioned propagation module. A style reference image and the source video frames are encoded by separate Patch Embedders into token sequences Z_I and Z_V. These are concatenated channel‑wise with noisy latents and an empty text embedding, then processed by N DiT blocks. Crucially, the style token receives a temporal index of 0, acting as a fixed anchor, while video tokens retain their original frame indices. This positional encoding enables the model to propagate style consistently across time without explicit optical‑flow guidance or test‑time optimization. Training uses a flow‑matching loss on linearly interpolated stylized video frames.
Evaluation covers three metrics: Style Similarity (StyleID), Content Preservation Consistency (CPC at various thresholds), and Aesthetic Score. TeleStyle outperforms prior methods such as StyleShot, InstantStyle, OmniStyle, and DreamO, achieving the highest scores across all metrics (e.g., StyleID 0.577, CPC@0.5 0.441, Aesthetic 6.317). Video experiments demonstrate temporally coherent results even on challenging content like anime‑style sequences with large structural changes.
In summary, TeleStyle advances the field by (1) leveraging a curriculum‑driven, hybrid dataset to disentangle content and style within DiTs, (2) employing efficient LoRA fine‑tuning to keep training lightweight, and (3) extending the approach to video with a simple yet effective first‑frame conditioning scheme. The work provides a practical blueprint for adding high‑fidelity style transfer capabilities to large‑scale diffusion models without sacrificing content integrity or temporal consistency.
Comments & Academic Discussion
Loading comments...
Leave a Comment