MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning

MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Essential to visual generation is efficient modeling of visual data priors. Conventional next-token prediction methods define the process as learning the conditional probability distribution of successive tokens. Recently, next-scale prediction methods redefine the process to learn the distribution over multi-scale representations, significantly reducing generation latency. However, these methods condition each scale on all previous scales and require each token to consider all preceding tokens, exhibiting scale and spatial redundancy. To better model the distribution by mitigating redundancy, we propose Markovian Visual AutoRegressive modeling (MVAR), a novel autoregressive framework that introduces scale and spatial Markov assumptions to reduce the complexity of conditional probability modeling. Specifically, we introduce a scale-Markov trajectory that only takes as input the features of adjacent preceding scale for next-scale prediction, enabling the adoption of a parallel training strategy that significantly reduces GPU memory consumption. Furthermore, we propose spatial-Markov attention, which restricts the attention of each token to a localized neighborhood of size k at corresponding positions on adjacent scales, rather than attending to every token across these scales, for the pursuit of reduced modeling complexity. Building on these improvements, we reduce the computational complexity of attention calculation from O(N^2) to O(Nk), enabling training with just eight NVIDIA RTX 4090 GPUs and eliminating the need for KV cache during inference. Extensive experiments on ImageNet demonstrate that MVAR achieves comparable or superior performance with both small model trained from scratch and large fine-tuned models, while reducing the average GPU memory footprint by 3.0x.


💡 Research Summary

The paper introduces MVAR (Markovian Visual AutoRegressive), a novel autoregressive framework for image generation that addresses the redundancy and memory inefficiency inherent in existing next‑scale prediction methods such as V‑AR. Conventional next‑scale models predict each higher‑resolution token map conditioned on all previously generated scales and tokens, leading to O(N²) attention complexity and large GPU memory footprints due to the need for a KV cache during inference.

MVAR tackles these issues by imposing two Markovian assumptions. First, the scale‑Markov assumption restricts the conditional distribution for scale ℓ to depend only on the immediately preceding scale ℓ‑1, i.e., p(rℓ | rℓ‑1) instead of p(rℓ | r₁,…,rℓ‑1). This reduces inter‑scale dependencies, enables a parallel training strategy across scales, and eliminates the need to retain all earlier scale representations in memory.

Second, the spatial‑Markov assumption limits each token’s attention to a local neighborhood of size k around the corresponding spatial location on the adjacent scale, rather than attending to every token across scales. Consequently, the attention computation changes from quadratic O(N²) to linear‑in‑N but multiplied by the small constant k, i.e., O(N k). This dramatically cuts both compute and memory usage, especially at fine resolutions where N is large.

The overall architecture proceeds as follows: an input image is quantized into a hierarchy of residual token maps {r₁,…,r_L}. For each scale ℓ, a Transformer block predicts rℓ using only rℓ‑1 as context. Within the block, spatial‑Markov attention is applied: queries are formed from the current scale, while keys and values are extracted from a k × k window centered at the same spatial coordinates on the previous scale. The loss is the sum of cross‑entropy terms over all scales. Because the KV cache is unnecessary, inference can run without storing past activations, further reducing memory.

Experiments on ImageNet 256 × 256 evaluate two model sizes: a small model (~300 M parameters) trained from scratch and a large model (~1.2 B parameters) fine‑tuned from a pretrained V‑AR checkpoint. Results show that MVAR matches or surpasses V‑AR in FID and Inception Score (≈0.5–1.2 % improvement) while cutting average GPU memory consumption by a factor of 3.0 during training and by 4.2 × during inference. Ablation studies confirm that both the scale‑Markov trajectory and the spatial‑Markov attention contribute independently to efficiency gains; removing either leads to higher memory use or degraded quality.

Key contributions are: (1) introducing a scale‑Markov trajectory that simplifies multi‑scale conditional likelihood and enables parallel training; (2) designing spatial‑Markov attention that approximates localized 2‑D operations, reducing attention complexity from O(N²) to O(N k); (3) demonstrating that these modifications yield state‑of‑the‑art performance with substantially lower resource requirements.

The authors suggest future extensions such as adaptive neighborhood sizes, applying the Markovian framework to video or 3‑D generation, and integrating MVAR with diffusion models for hybrid pipelines. MVAR thus represents a significant step toward scalable, memory‑efficient autoregressive visual generation.


Comments & Academic Discussion

Loading comments...

Leave a Comment