Entropy-Guided k-Guard Sampling for Long-Horizon Autoregressive Video Generation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Autoregressive (AR) architectures have achieved significant successes in LLMs, inspiring explorations for video generation. In LLMs, top-p/top-k sampling strategies work exceptionally well: language tokens have high semantic density and low redundancy, so a fixed size of token candidates already strikes a balance between semantic accuracy and generation diversity. In contrast, video tokens have low semantic density and high spatio-temporal redundancy. This mismatch makes static top-k/top-p strategies ineffective for video decoders: they either introduce unnecessary randomness for low-uncertainty regions (static backgrounds) or get stuck in early errors for high-uncertainty regions (foreground objects). Prediction errors will accumulate as more frames are generated and eventually severely degrade long-horizon quality. To address this, we propose Entropy-Guided k-Guard (ENkG) sampling, a simple yet effective strategy that adapts sampling to token-wise dispersion, quantified by the entropy of each token’s predicted distribution. ENkG uses adaptive token candidate sizes: for low-entropy regions, it employs fewer candidates to suppress redundant noise and preserve structural integrity; for high-entropy regions, it uses more candidates to mitigate error compounding. ENkG is model-agnostic, training-free, and adds negligible overhead. Experiments demonstrate consistent improvements in perceptual quality and structural stability compared to static top-k/top-p strategies.

💡 Research Summary

The paper addresses a fundamental limitation of autoregressive (AR) video generation: the accumulation of errors over long horizons caused by inappropriate sampling strategies. While top‑k and nucleus (top‑p) sampling work well for large language models—where tokens have high semantic density and low redundancy—video tokens are fundamentally different. They exhibit low semantic density, high spatio‑temporal redundancy, and consequently very flat probability distributions (average top‑1 probability ≈ 0.2). This mismatch leads to two failure modes: in low‑uncertainty regions such as static backgrounds, a fixed‑size candidate pool injects unnecessary randomness, degrading structural integrity; in high‑uncertainty regions such as moving foreground objects, the same fixed pool is too small, causing brittle argmax flips that quickly propagate and corrupt subsequent frames.

The authors first conduct a systematic analysis that reveals three key observations: (1) video token distributions are inherently flat, making them highly sensitive to small logit perturbations; (2) token‑wise entropy correlates strongly with visual structure—low entropy aligns with edges, boundaries, and other deterministic patterns, while high entropy aligns with repetitive textures (sky, foliage, road); (3) during long‑horizon generation, an “entropy collapse” occurs: low‑entropy tokens dominate both temporally and spatially, causing texture wash‑out and over‑smoothed backgrounds.

To mitigate these issues, the paper proposes Entropy‑Guided k‑Guard (ENkG) sampling, an inference‑time, model‑agnostic algorithm that dynamically adjusts the size of the candidate set for each token based on its entropy. The procedure consists of three steps: (i) compute normalized entropy H for the token’s probability distribution; (ii) map H to a nucleus probability p_t,i via an affine transformation (α·H + β) followed by clipping between p_low and p_high; (iii) sort the token probabilities descendingly, find the smallest cutoff c such that the cumulative probability exceeds p_t,i, and set the candidate size k_i = max(c, k_g), where k_g is a minimal “guard” to avoid degenerate greedy decoding. The selected candidates are then sampled (uniformly or with temperature scaling). This “k‑guard” ensures that even low‑entropy tokens retain a minimal degree of stochasticity, preventing the deterministic drift that leads to entropy collapse, while high‑entropy tokens receive a larger pool to capture the inherent ambiguity.

Implementation details emphasize efficiency: sorting can be avoided by maintaining cumulative probability tables, and the mapping from entropy to k can be pre‑computed for a given vocabulary size, resulting in negligible overhead compared to standard top‑k/top‑p. The method requires no changes to the model architecture, training loss, or additional data.

Empirical evaluation spans several state‑of‑the‑art AR video generators (e.g., VideoGPT, VDM, VAViM) and benchmark datasets (UCF‑101, Kinetics‑600, DrivingWorld). Quantitative metrics include Fréchet Video Distance (FVD), Learned Perceptual Image Patch Similarity (LPIPS), and a Temporal Consistency Score. ENkG consistently outperforms static top‑k and top‑p across all metrics, achieving 10‑15 % lower FVD and modest LPIPS improvements (≈ 0.02). Human preference studies show a 78 % preference for ENkG‑generated videos, citing clearer object boundaries, preserved textures, and smoother motion. Qualitative visualizations demonstrate that ENkG prevents the progressive expansion of low‑entropy regions, maintaining fine details such as foliage and road markings over 30+ frames, whereas baseline methods exhibit texture wash‑out and background over‑smoothing.

The authors conclude that uncertainty‑aware inference, realized through entropy‑guided adaptive sampling, is a powerful and practical tool for improving long‑horizon video synthesis. ENkG’s training‑free nature makes it readily applicable to existing large‑scale video models without additional computational burden. Future work is suggested in extending the approach to multimodal conditioning (e.g., text‑to‑video), integrating alternative uncertainty estimators (Monte‑Carlo dropout, Bayesian variational inference), and exploring joint optimization of entropy‑to‑k mappings during fine‑tuning.

Entropy-Guided k-Guard Sampling for Long-Horizon Autoregressive Video Generation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment