Adaptive Visual Autoregressive Acceleration via Dual-Linkage Entropy Analysis
Visual AutoRegressive modeling (VAR) suffers from substantial computational cost due to the massive token count involved. Failing to account for the continuous evolution of modeling dynamics, existing VAR token reduction methods face three key limitations: heuristic stage partition, non-adaptive schedules, and limited acceleration scope, thereby leaving significant acceleration potential untapped. Since entropy variation intrinsically reflects the transition of predictive uncertainty, it offers a principled measure to capture modeling dynamics evolution. Therefore, we propose NOVA, a training-free token reduction acceleration framework for VAR models via entropy analysis. NOVA adaptively determines the acceleration activation scale during inference by online identifying the inflection point of scale entropy growth. Through scale-linkage and layer-linkage ratio adjustment, NOVA dynamically computes distinct token reduction ratios for each scale and layer, pruning low-entropy tokens while reusing the cache derived from the residuals at the prior scale to accelerate inference and maintain generation quality. Extensive experiments and analyses validate NOVA as a simple yet effective training-free acceleration framework.
💡 Research Summary
Visual Autoregressive (VAR) models have demonstrated impressive generation quality by predicting next‑scale residual features instead of raster‑scan tokens, yet their computational cost grows quadratically with token length, especially at high resolutions. Existing token‑reduction accelerators mitigate this cost by pruning tokens only in a heuristically chosen later stage, using fixed reduction ratios, and treating scales and layers independently. These design choices lead to three major drawbacks: (1) heuristic stage partition that can either degrade quality if applied too early or miss speed‑ups if applied too late; (2) non‑adaptive schedules that ignore per‑instance complexity; and (3) limited acceleration scope that leaves redundancy in earlier stages untouched.
The paper introduces NOVA, a training‑free acceleration framework that leverages entropy as a principled measure of predictive uncertainty. Entropy naturally captures how much information a token contributes to future predictions; high‑entropy tokens are uncertain and potentially influential, while low‑entropy tokens are already well‑determined. NOVA operates on two linked dimensions—scale and layer—and adapts token reduction dynamically for each.
Scale‑level adaptation. For every scale t, NOVA computes the predictive entropy of each token rₜ,i as Hₜ,i = –∑₍v₎ pₜ,i(v) log pₜ,i(v), where pₜ,i(v) is the model’s class distribution conditioned on all previous scales. The mean entropy ¯Hₜ = (1/Nₜ)∑ᵢ Hₜ,i is tracked across scales. NOVA identifies the inflection point where the growth of ¯Hₜ transitions from rapid increase to a slower, more stable phase. This point defines the activation scale t*; acceleration starts only after t*, preventing premature pruning that would harm quality. A scale‑linkage function then assigns a distinct reduction ratio ρₛ(t) to each subsequent scale, allowing aggressive pruning in later, low‑growth scales while preserving more tokens in earlier, high‑growth scales.
Layer‑level adaptation. Within a given scale, different Transformer layers exhibit heterogeneous entropy patterns. Shallow layers often model low‑level textures and have lower average entropy, whereas middle and deep layers capture structural and semantic information with higher entropy. NOVA introduces a layer‑linkage ratio adjustment fₗ(l) that modulates the per‑layer reduction ratio based on each layer’s mean entropy ¯Hₗ. The final reduction ratio for a token at scale t and layer l becomes ρ(t,l) = ρₛ(t)·fₗ(l). This dual‑linkage scheme allocates computation where it matters most and discards redundant work where uncertainty is already low.
Residual cache reuse. When tokens are pruned, NOVA does not discard their intermediate representations. Instead, it stores the residual cache from the previous scale and reuses it for any token that reappears at a finer scale. This eliminates duplicated forward passes and dramatically reduces both FLOPs and GPU memory consumption, especially for high‑resolution generation (e.g., 1024×1024 images).
Theoretical justification. The authors derive an upper bound on future uncertainty using conditional entropy: H(Rₜ | R<ₜ) ≤ Σᵢ H(rₜ,i | R<ₜ). By retaining high‑entropy tokens, the bound remains tight, maximizing potential information gain for subsequent scales. Pruning low‑entropy tokens therefore reduces attention cost with minimal impact on the bound, providing a solid information‑theoretic foundation for the method.
Empirical evaluation. NOVA is evaluated on state‑of‑the‑art VAR models Infinity‑2B and Infinity‑8B across benchmarks such as GenEval and ImageReward. Results show:
- 2.89× speed‑up on Infinity‑2B with only 0.01 % performance loss (measured by standard generation metrics).
- Latency reduction for Infinity‑8B from 1.51 s to 0.75 s, while achieving a higher human preference score than the unaccelerated model.
- Visualizations confirm that fine details and semantic consistency are preserved, and sometimes even improved due to reduced noise from low‑entropy token removal.
Limitations and future work. Computing token‑wise entropy requires the model’s probability distribution, adding a modest overhead that may be non‑trivial for very large batches or real‑time streaming. The current approach uses mean entropy per token; richer multivariate entropy estimations that capture spatial and channel correlations could further refine pruning decisions. Extending the dual‑linkage concept to other generative domains (video, 3D, multimodal) and integrating hardware‑aware cache management are promising directions.
In summary, NOVA presents a novel, theoretically grounded, and practically effective solution for accelerating visual autoregressive generation. By exploiting entropy to drive adaptive, scale‑ and layer‑aware token reduction and by reusing residual caches, it achieves substantial speed‑ups without sacrificing—and occasionally enhancing—generation quality, all without any additional training.
Comments & Academic Discussion
Loading comments...
Leave a Comment