DiRL: An Efficient Post-Training Framework for Diffusion Language Models

Reading time: 5 minute
...

📝 Abstract

Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models. While recent efforts have validated their pre-training potential and accelerated inference speeds, the post-training landscape for dLLMs remains underdeveloped. Existing methods suffer from computational inefficiency and objective mismatches between training and inference, severely limiting performance on complex reasoning tasks such as mathematics. To address this, we introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference. This architecture enables a streamlined online model update loop, facilitating efficient two-stage post-training (Supervised Fine-Tuning followed by Reinforcement Learning). Building on this framework, we propose DiPO, the first unbiased Group Relative Policy Optimization (GRPO) implementation tailored for dLLMs. We validate our approach by training DiRL-8B-Instruct on high-quality math data. Our model achieves state-of-the-art math performance among dLLMs and surpasses comparable models in the Qwen2.5 series on several benchmarks.

💡 Analysis

Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models. While recent efforts have validated their pre-training potential and accelerated inference speeds, the post-training landscape for dLLMs remains underdeveloped. Existing methods suffer from computational inefficiency and objective mismatches between training and inference, severely limiting performance on complex reasoning tasks such as mathematics. To address this, we introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference. This architecture enables a streamlined online model update loop, facilitating efficient two-stage post-training (Supervised Fine-Tuning followed by Reinforcement Learning). Building on this framework, we propose DiPO, the first unbiased Group Relative Policy Optimization (GRPO) implementation tailored for dLLMs. We validate our approach by training DiRL-8B-Instruct on high-quality math data. Our model achieves state-of-the-art math performance among dLLMs and surpasses comparable models in the Qwen2.5 series on several benchmarks.

📄 Content

Large Diffusion Language Models (dLLMs) have become a hot topic in NLP (Nie et al., 2025;Inception, 2025). The emergence of dLLMs such as LLaDA (Nie et al., 2025), Dream (Ye et al., 2025), Mercury (Inception, 2025) and Gemini Diffusion (Gemini, 2025), together with blockwise hybrids like SDAR (Cheng et al., 2025) that combine diffusion with traditional auto-regression (AR), confirms the scalability of this paradigm (Nie et al., 2024;Gong et al., 2024;Ni et al., 2025). Building on these results, a growing body of work now seeks to advance their performance in multi-modality (Yang et al., 2025;You et al., 2025), long-context modeling (Liu et al., 2025b;He et al., 2025) and inference efficiency (Wu et al., 2025b;a;Song et al., 2025). Although pre-training of dLLMs is now proven feasible, post-training of dLLMs, especially reinforcement learning (RL), remains underdeveloped, limiting dLLMs’ performance on math tasks and real-world deployment.

The difficulty of dLLM post-training, especially RL, lies in the fact that the logits and derived policy cannot be computed exactly (Zhao et al., 2025a;Zhu et al., 2025b). In original fully bidirectional dLLMs, the generation order is unconstrained, making teacher-forcing style logit acquisition during SFT infeasible. Injecting uniform random noise into the output fails to reproduce the realistic inference step map, resulting in biased logits and a large mismatch between training and inference objectives (Zhao et al., 2025a;Wang et al., 2025b). Furthermore, in the RL stage, the absence of a KV cache further increases computational overhead (Liu et al.;Ma et al., 2025;Song et al., 2025). Most existing dLLM-based RL efforts lack an inference-engine backend, efficient training-inference codesign, and fast rollouts with online model updates, which prevents the practical adoption of mature RL algorithms such as GRPO (DeepSeek-AI, 2024;Guo et al., 2025). Blockwise dLLMs partially alleviate these issues by restricting generation within blocks, enabling exact logit computation through blockwise forward passes (Cheng et al., 2025;Wang et al., 2025b). However, they do not fully resolve the train-inference mismatch in post-training, nor do they address the efficiency and algorithmic challenges of dLLM RL. How to achieve consistent training and inference while enabling scalable RL for dLLMs remains underexplored. To fill this gap, we introduce our dLLM RL algorithm DiPO, together with the training framework DiRL, which enforces training-inference consistency and enables efficient rollouts and policy optimization for blockwise dLLMs. We further present the state-of-the-art dLLM DiRL-8B-Instruct, as shown in Figure 1, Figure 2 and Figure 3.

At the algorithm level, DiPO leverages the good property of blockwise dLLM to achieve the first unbiased GRPO implementation for dLLMs through efficient, unbiased logit computation. At the framework level, DiRL supports the two-stage (SFT-RL) post-training of dLLMs, aligning training and inference objectives while surpassing existing methods in efficiency as shown in Figure 2. Concretely, we integrate the efficient inference property of blockwise dLLMs with the efficient FlexAttention interface (Dong et al., 2024) and LMDeploy framework (InternLM, 2023) to enable fast rollout and online model updates in the API server. At the model level, based on high-quality math datasets, we train DiRL-8B-Instruct from SDAR-8B-Chat and achieve best performance on math tasks in dLLMs, even outperform Qwen2.5 Series (Qwen et al., 2024), the widely-acknowledged larger AR model, in AIME24, AIME25 (MAA, 2024;2025) and OlympiadBench (He et al., 2024) as shown in Figure 1. Our contributions can be summarized as follows.

• DiRL, an efficient post-training framework for dLLMs that replaces offline model loading with inference-server-based rollouts and online policy updates, ensuring training-inference consistency and accelerated by FlexAttention.

• DiPO, the first unbiased GRPO implementation in dLLMs, leveraging the unbiased logits computation of blockwise dLLM.

• DiRL-8B-Instruct, the state-of-the-art dLLMs in math tasks, based on the above algorithm and engineering improvements, as well as high-quality math data.

Blockwise Diffusion Language Models, as exemplified by BD3-LMs (Arriola et al., 2025) and SDAR (Cheng et al., 2025), primarily unify the global sequential dependency of AR models with the local parallel generation capability of dLLMs through a Semi-Autoregressive generation paradigm.

In blockwise dLLMs, given a discrete sequence x, we partition it into K non-overlapping text blocks, denoted as x = (b 1 , b 2 , . . . , b K ), where each block contains B tokens. Then the joint probability distribution of the sequence p θ (x) is factorized into a product of conditional probabilities, where b <k represents the historical context preceding the current block.

In contrast to the token-by-token generation of AR, the intra-block conditional distribution p θ (b k

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut