GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling
Aligning video generative models with human preferences remains challenging: current approaches rely on Vision-Language Models (VLMs) for reward modeling, but these models struggle to capture subtle temporal dynamics. We propose a fundamentally different approach: repurposing video generative models, which are inherently designed to model temporal structure, as reward models. We present the Generative-Transformer-based Self-Supervised Video Judge (\modelname), a novel evaluation model that transforms state-of-the-art video generation models into powerful temporally-aware reward models. Our key insight is that generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives. To prevent the model from exploiting superficial differences between real and generated videos, we design challenging synthetic negative videos through controlled latent-space perturbations: temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts. \modelname achieves state-of-the-art performance on GenAI-Bench and MonteBench using only 30K human-annotations: $6\times$ to $65\times$ fewer than existing VLM-based approaches.
💡 Research Summary
The paper introduces GT‑SVJ, a novel framework that repurposes state‑of‑the‑art video generative models as temporally‑aware reward models for aligning video generation with human preferences. Recognizing that modern video generators such as CogVideoX already learn rich spatiotemporal representations through causal self‑attention over latent video tokens, the authors reformulate these generators as energy‑based models (EBMs). In this formulation, low energy corresponds to high‑quality (human‑preferred) videos, while high energy signals degraded or undesirable content.
Training proceeds in two stages. First, a discriminative EBM is built by attaching a lightweight MLP head to the last third of CogVideoX’s transformer layers, with LoRA adapters enabling low‑rank fine‑tuning. The model receives latent video representations from the pretrained VAE encoder, reducing pixel‑level redundancy and focusing learning on semantic appearance and motion. A contrastive loss (Eq. 4) drives the network to assign lower energy to real videos and higher energy to negative samples.
Crucially, the authors design five controlled perturbations applied directly in latent space to generate challenging synthetic negatives: (1) frame shuffle (temporal order disruption), (2) frame drop (missing/duplicated frames), (3) noisy segment injection (localized Gaussian noise), (4) patch swap (spatial region exchange between two temporal slices), and (5) temporal slice swap (exchange of longer contiguous time blocks). These manipulations preserve overall visual fidelity while subtly breaking motion continuity, object trajectories, or long‑range dynamics, forcing the EBM to learn genuine spatiotemporal cues rather than trivial domain gaps between real and generated videos.
After the contrastive pre‑training, the model is fine‑tuned on a modest set of 30 K human preference pairs using a Bradley‑Terry (or Bradley‑Terry with ties) likelihood objective, converting the energy scores into a scalar reward function rϕ(x). This reward can then be used for RLHF, Direct Preference Optimization, or other preference‑guided fine‑tuning of video generators.
Empirical evaluation on two recent preference benchmarks—GenAI‑Bench and MonteBench—shows that GT‑SVJ outperforms prior Vision‑Language Model (VLM) based reward models by a large margin (≈25 % absolute gain on GenAI‑Bench, 3–8 % on MonteBench) while using 6× to 65× fewer human annotations. On VideoReward‑Bench, GT‑SVJ remains competitive, trailing the best VLM‑based method by only 4–7 %. Ablation studies confirm that each latent‑space perturbation contributes to robustness, and that LoRA‑based fine‑tuning yields strong performance with minimal extra parameters.
The paper also discusses limitations: the handcrafted perturbations may not capture all possible failure modes, especially in highly dynamic domains; EBM training still relies on approximations of the partition function, which can be computationally demanding; and the current approach focuses solely on video‑only preferences, leaving multimodal text‑video alignment for future work. Nonetheless, GT‑SVJ demonstrates that leveraging the intrinsic temporal understanding of video generative models, combined with self‑supervised contrastive learning, provides a data‑efficient, stable, and highly accurate reward model for video generation.
Comments & Academic Discussion
Loading comments...
Leave a Comment