Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Reading time: 5 minute
...

📝 Original Info

  • Title: Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed
  • ArXiv ID: 2512.14067
  • Date: 2025-12-16
  • Authors: Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, Maksim Khadkevich, Jan Kautz, Yingyan Celine Lin, Pavlo Molchanov

📝 Abstract

Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models' weight distributions than fully bidirectional modeling, in addition to its known benefit of enabling KV caching, and leads to a win-win in accuracy and efficiency. Second, to mitigate the training-test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs' attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM family, which outperforms state-of-the-art AR models and dLMs, e.g., our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B, respectively.

💡 Deep Analysis

Figure 1

📄 Full Content

2025-11-01 Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed Yonggan Fu*, Lexington Whalen*1, Zhifan Ye1, Xin Dong, Shizhe Diao, Jingyu Liu2, Chengyue Wu3, Hao Zhang, Enze Xie, Song Han4, Maksim Khadkevich, Jan Kautz, Yingyan (Celine) Lin1, Pavlo Molchanov Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non- autoregressive generation for higher throughput, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models’ task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks and conditions each block on clean context while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models’ weight distributions than the fully bidirectional modeling used in prior work such as Dream, in addition to its known benefit of enabling KV caching, and leads to a win–win in accuracy and efficiency. Second, to mitigate the training–test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs’ attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM model family, which outperforms state-of-the-art AR models and dLMs in accuracy–throughput trade-offs. For example, our Efficient-DLM 8B maintains comparable (slightly better) accuracy than Qwen3 8B and achieves +5.4%/+2.7% higher accuracy with 4.5×/2.7× higher throughput compared to Dream 7B and Qwen3 4B, respectively. 1. Introduction The success of large language models (LLMs) has been largely driven by autoregressive (AR) modeling, where tokens are generated sequentially left to right. Despite strong benchmark performance, AR models are constrained by token-by-token decoding, which limits generation throughput, especially in memory- bounded scenarios (e.g., small batch sizes) where hard- ware utilization is low. To overcome the sequential bottleneck of AR de- coding, diffusion language models (dLMs) [1, 2, 3, 4] have recently emerged as an alternative paradigm. By leveraging iterative denoising steps, dLMs enable par- allel, non-autoregressive generation and hold promise for higher throughput. However, despite their con- ceptual appeal, most existing dLMs have not deliv- ered faster speed than AR models in practice [3, 4], due to the limited compatibility with key-value (KV) caching and the limited parallelism during decoding. Although pioneering works [5, 6, 7] demonstrate po- tential speed-up on small-scale models (e.g., 110M [5]) with limited downstream accuracy, successful scaling of dLMs to larger model sizes has been restricted by prohibitive training costs [8]. This is because AR models learn only left-to-right modeling, while dLMs learn all possible permutations [9], which is more difficult and requires longer training. This work leverages pretrained AR models for ini- tialization and systematically explores how to continu- ously pretrain them into dLMs that achieve high gen- eration speed while preserving task accuracy. The key insight is that, with an appropriate training scheme in terms of attention patterns and objectives, pretrained AR models can be converted into faster dLMs that support parallel decoding with KV cache at low train- ing cost (on the order of 10B tokens), and extended continuous training (on the order of 100B tokens) enables more aggressive parallel generation. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to- dLM methods and proposing a continuous pretraining scheme that features a block-wise attention pattern [5] and position-dependent token masking. Specifically, our extensive study of attention patterns shows that a * Co-first author; additional affiliation: 1Georgia Tech, 2University of Chicago, 3University of Hong Kong, 4MIT. © 2025 NVIDIA. All rights reserved. arXiv:2512.14067v1 [cs.CL] 16 Dec 2025 Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed     Eff-DLM 8B Figure 1 | Benchmarking the accuracy–throughput trade-off

📸 Image Gallery

nvlogo2.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut