From Prediction to Perfection: Introducing Refinement to Autoregressive Image Generation

From Prediction to Perfection: Introducing Refinement to Autoregressive Image Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Autoregressive (AR) image generators offer a language-model-friendly approach to image generation by predicting discrete image tokens in a causal sequence. However, unlike diffusion models, AR models lack a mechanism to refine previous predictions, limiting their generation quality. In this paper, we introduce TensorAR, a new AR paradigm that reformulates image generation from next-token prediction to next-tensor prediction. By generating overlapping windows of image patches (tensors) in a sliding fashion, TensorAR enables iterative refinement of previously generated content. To prevent information leakage during training, we propose a discrete tensor noising scheme, which perturbs input tokens via codebook-indexed noise. TensorAR is implemented as a plug-and-play module compatible with existing AR models. Extensive experiments on LlamaGEN, Open-MAGVIT2, and RAR demonstrate that TensorAR significantly improves the generation performance of autoregressive models.


💡 Research Summary

The paper addresses a fundamental limitation of autoregressive (AR) image generators: once a token is predicted, it cannot be revised, leading to error accumulation and sub‑optimal visual quality. To overcome this, the authors propose TensorAR, a new AR paradigm that replaces next‑token prediction with next‑tensor prediction. Instead of emitting a single discrete token at each step, TensorAR predicts an overlapping window (tensor) of k adjacent image tokens. Because successive tensors overlap, later predictions can revise the content generated in earlier steps, providing an iterative refinement mechanism analogous to diffusion models while preserving the causal, left‑to‑right structure of AR decoding.

A key technical challenge is information leakage: when training on overlapping tensors, some tokens in the target tensor already appear in the input, allowing the model to cheat by copying them. The authors solve this by introducing a discrete tensor noising scheme inspired by discrete diffusion theory. During training, each token in the input tensors is perturbed with categorical noise according to a time‑dependent transition matrix, and the model is trained to denoise the noisy tensor back to the clean target using a weighted cross‑entropy loss. This forces the network to learn genuine conditional dependencies rather than memorizing overlaps.

Architecturally, TensorAR is a plug‑and‑play module that can be attached to any transformer‑based AR model without altering the core architecture. Two lightweight wrappers are added: an input encoder (M_enc) that compresses k token embeddings into a single hidden state, and an output decoder (M_dec) that expands a hidden state back into k token logits. Both wrappers employ small “query” transformer blocks (Q_in and Q_out) and residual connections to leverage pretrained weights and ensure stable convergence. Consequently, existing models such as LlamaGEN, Open‑MAGVIT2, and RAR can be upgraded to TensorAR with minimal code changes.

Extensive experiments cover both class‑conditional and text‑conditional generation across multiple model sizes (from 7 B to 13 B parameters). Evaluation metrics include Fréchet Inception Distance (FID) and GenEval score. TensorAR consistently improves FID by 5–12 % and raises GenEval by 0.02–0.04 points over the vanilla AR baselines. Notably, even the relatively small Janus‑Pro‑7B model benefits from the refinement mechanism, achieving quality comparable to larger models. The paper also presents model‑size vs. quality curves, demonstrating a superior quality‑latency trade‑off: for a given inference latency, TensorAR delivers higher visual fidelity than standard AR.

The analysis highlights several practical considerations. The window size k and the noise schedule are hyper‑parameters that significantly affect performance; larger k yields more refinement steps but increases computational cost and memory usage. The method still relies on a discrete VQ‑AE tokenizer, so tokenization quality remains a bottleneck. Nevertheless, the approach shows that AR models can gain diffusion‑like refinement without abandoning their language‑model‑friendly training paradigm or requiring bidirectional attention.

In summary, TensorAR introduces a simple yet powerful modification to autoregressive image synthesis: by predicting overlapping tensors and training with discrete diffusion‑style noise, it enables iterative self‑refinement while keeping the original AR architecture intact. The results demonstrate consistent quality gains across diverse tasks and model scales, suggesting a promising direction for future work such as dynamic window sizing, multi‑scale refinement, and integration with non‑tokenized decoders.


Comments & Academic Discussion

Loading comments...

Leave a Comment