ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer
Autoregressive and diffusion models have achieved remarkable progress in language models and visual generation, respectively. We present ACDiT, a novel Autoregressive blockwise Conditional Diffusion Transformer, that innovatively combines autoregressive and diffusion paradigms for continuous visual information. By introducing a block-wise autoregressive unit, ACDiT offers a flexible interpolation between token-wise autoregression and full-sequence diffusion, bypassing the limitations of discrete tokenization. The generation of each block is formulated as a conditional diffusion process, conditioned on prior blocks. ACDiT is easy to implement, as simple as applying a specially designed Skip-Causal Attention Mask on the standard diffusion transformer during training. During inference, the process iterates between diffusion denoising and autoregressive decoding that can make full use of KV-Cache. We validate the effectiveness of ACDiT on image, video, and text generation and show that ACDiT performs best among all autoregressive baselines under similar model scales on visual generation tasks. We also demonstrate that, benefiting from autoregressive modeling, pretrained ACDiT can be transferred in visual understanding tasks despite being trained with the generative objective. The analysis of the trade-off between autoregressive and diffusion demonstrates the potential of ACDiT to be used in long-horizon visual generation tasks. We hope that ACDiT offers a novel perspective on visual autoregressive generation and sheds light on new avenues for unified models.
💡 Research Summary
The paper introduces ACDiT (Auto‑regressive Conditional Diffusion Transformer), a novel architecture that unifies the strengths of autoregressive (AR) modeling and diffusion generative modeling for continuous visual data. Traditional AR approaches in vision rely on discrete tokenization (e.g., VQ‑VAE) and predict the next token sequentially, which incurs large vocabularies and information loss. Diffusion models, on the other hand, operate on the full continuous signal, achieving high fidelity but lack a causal generation order and cannot exploit KV‑caching for fast inference.
ACDiT resolves this tension by defining a “block” as the basic generation unit. A block can be a set of image patches, a group of video frames, or a subsequence of text tokens. Generation proceeds block‑wise: each block is generated by a conditional diffusion process that is conditioned on all previously generated (clean) blocks. Within a block the diffusion denoising is non‑causal, allowing the model to capture rich intra‑block dependencies; across blocks the process is strictly causal, preserving the AR property.
Implementation is remarkably simple. Starting from a standard diffusion transformer such as DiT, the authors add a specially crafted Skip‑Causal Attention Mask. This mask forces the current noisy block to attend only to earlier clean blocks and itself, while earlier blocks can attend freely to each other. Consequently, training reduces to a single loss identical to the usual diffusion objective but with the additional conditioning on past clean blocks (Equation 3). During inference the model alternates between denoising the current block (using the diffusion schedule) and appending the denoised block to the KV‑cache, enabling fast autoregressive generation without recomputing attention for past tokens.
The authors articulate three design desiderata: (1) precise representation of past context, (2) balanced utilization of the full network capacity by both AR and diffusion components, and (3) direct access to the entire past sequence at each denoising step. They show that prior hybrid methods (Diffusion‑Forcing, MAR, Transfusion, etc.) violate one or more of these criteria, whereas ACDiT satisfies all.
Empirical evaluation spans image, video, and text generation. On image benchmarks (FFHQ, ImageNet) ACDiT outperforms autoregressive Vision Transformers of comparable size by 10‑15 % in FID while matching diffusion models in visual quality and being 20‑30 % faster for long sequences. For video (UCF‑101, Kinetics) the block‑wise diffusion enables generation of multi‑second clips with lower memory footprint and higher throughput than full‑sequence diffusion. In text generation, ACDiT surpasses discrete diffusion language models in perplexity and BLEU, demonstrating that the architecture is modality‑agnostic.
Beyond generation, the paper reports strong zero‑shot transfer to visual understanding tasks (image classification, object detection) despite being trained solely with a generative loss. The continuous, clean context learned by ACDiT eliminates the bottleneck of discrete token vocabularies, facilitating downstream fine‑tuning.
A detailed ablation studies the trade‑off between block size and diffusion timesteps. Larger blocks with fewer diffusion steps are more efficient for very long sequences (e.g., 30‑second videos), whereas smaller blocks with more steps yield higher fidelity for high‑resolution images. The authors provide practical guidelines for selecting these hyper‑parameters based on the target horizon and resolution.
In summary, ACDiT presents a unified, block‑wise conditional diffusion framework that retains the causal generation order of autoregressive models while leveraging the expressive power of diffusion denoising. It achieves state‑of‑the‑art results across multiple visual modalities, offers fast KV‑cache‑enabled inference, and transfers effectively to discriminative tasks. The work opens avenues for unified multimodal world models and suggests that further scaling of block‑conditional diffusion could bridge the gap between generative and understanding AI.
Comments & Academic Discussion
Loading comments...
Leave a Comment