병렬 토큰 생성 위한 강화학습 기반 마스크 확산 언어 모델 가속기 dUltra
📝 Abstract
Masked diffusion language models (MDLMs) offer the potential for parallel token generation, but most open-source MDLMs decode fewer than 5 tokens per model forward pass even with sophisticated sampling strategies. As a result, their sampling speeds are often comparable to AR + speculative decoding schemes, limiting their advantage over mainstream autoregressive approaches. Existing distillation-based accelerators (dParallel, d3LLM) finetune MDLMs on trajectories generated by a base model, which can become off-policy during finetuning and restrict performance to the quality of the base model’s samples. We propose dUltra, an on-policy reinforcement learning framework based on Group Relative Policy Optimization (GRPO) that learns unmasking strategies for efficient parallel decoding. dUltra introduces an unmasking planner head that predicts per-token unmasking likelihoods under independent Bernoulli distributions. We jointly optimize the base diffusion LLM and the unmasking order planner using reward signals combining verifiable reward, distillation reward, and the number of unmasking steps. Across mathematical reasoning and code generation tasks, dUltra improves the accuracy-efficiency trade-off over state-of-the-art heuristic and distillation baselines, moving towards achieving “diffusion supremacy” over autoregressive models. Preprint. Under review.
💡 Analysis
Masked diffusion language models (MDLMs) offer the potential for parallel token generation, but most open-source MDLMs decode fewer than 5 tokens per model forward pass even with sophisticated sampling strategies. As a result, their sampling speeds are often comparable to AR + speculative decoding schemes, limiting their advantage over mainstream autoregressive approaches. Existing distillation-based accelerators (dParallel, d3LLM) finetune MDLMs on trajectories generated by a base model, which can become off-policy during finetuning and restrict performance to the quality of the base model’s samples. We propose dUltra, an on-policy reinforcement learning framework based on Group Relative Policy Optimization (GRPO) that learns unmasking strategies for efficient parallel decoding. dUltra introduces an unmasking planner head that predicts per-token unmasking likelihoods under independent Bernoulli distributions. We jointly optimize the base diffusion LLM and the unmasking order planner using reward signals combining verifiable reward, distillation reward, and the number of unmasking steps. Across mathematical reasoning and code generation tasks, dUltra improves the accuracy-efficiency trade-off over state-of-the-art heuristic and distillation baselines, moving towards achieving “diffusion supremacy” over autoregressive models. Preprint. Under review.
📄 Content
The great success of diffusion models and their parallel generation nature makes them attractive for language generation. There has been an ongoing effort to transfer the diffusion paradigm to the natural language domain. Earlier attempts tried to sample in the token embedding space [Li et al., 2022, Gulrajani and Hashimoto, 2023, Gong et al., 2023] or the latent space of a language autoencoder [Lovelace et al., 2023] using continuous diffusion. In a parallel line of work, discrete diffusion in the vocabulary space is studied in [Austin et al., 2021, Campbell et al., 2022, Meng et al., 2022, He et al., 2022, Wang and Cho, 2019]. Without the need for an invertible token-wise embedding function, discrete diffusion language models generally perform better than continuous diffusion language models when operating in a discrete state space with a countable number of states [Lou et al., 2023, Shi et al., 2024, Sahoo et al., 2024]. Moreover, it is observed that the best performance is achieved with an absorbing source distribution, which further restricts possible sampling paths [Lou et al., 2023, Austin et al., 2021]. The diffusion models with an absorbing source distribution are also called masked diffusion language models (MDLMs) because of the introduction of a masked token. Moreover, Ou et al. [2025] and Sahoo et al. [2024] have shown that the optimal solution that minimizes negative ELBO is independent of the time variable, thus promoting time-agnostic parameterization. Recent works have scaled time-agnostic MDLMs to show that their performance can rival that of autoregressive (AR) models with similar parameter size [Nie et al., 2025, Ye et al., 2025].
However, even these best open-source masked diffusion language models (MDLMs) [Nie et al., 2025, Ye et al., 2025] face the same slow sampling problem as continuous diffusion models. Apart from system-level optimization, such as approximate KV Cache [Wu et al., 2025, Hu et al., 2025], attempts to accelerate MDLMs have focused on designing confidence/entropy-aware sampling methods to unmask tokens in parallel [Ben-Hamu et al., 2025, Wu et al., 2025]. These deterministic parallel decoding strategies generally favor tokens with high confidence or low entropy within a single unmasking step, leading to 3-5x faster sampling speed. Despite this decent speed-up, the number of tokens decoded in parallel within a single denoising step is still low (∼4.5 tokens per step on GSM8K and ∼3.3 tokens per step on MATH500) [Wu et al., 2025]. Meanwhile, commercial diffusion language models [Google DeepMind, 2025, Khanna et al., 2025] operate at a throughput of ∼1000 tokens/s, significantly surpassing the state-of-the-art throughput of autoregressive models, which typically range from 100-300 tokens/s [Khanna et al., 2025]. Therefore, current open-source MDLMs have yet to exploit the parallel generation potential of diffusion language models to their fullest and have not achieved “diffusion supremacy” (analogous to quantum supremacy) over autoregressive models.
A fundamental observation is that when tokens are generated in parallel, MDLMs assume conditional independence between them given the masked input sequence (see also Section 2.2.2) due to the curse of dimensionality [Aaron, 2024]. If the tokens are actually dependent and ambiguous 1 , the actual distribution sampled from will have unwanted modes, and the model may generate nonsensical text (Figure 1, first row, middle panel). Since acceleration methods aim to increase parallelism by decoding more tokens per step, they inevitably unmask some dependent tokens together, violating the independence assumption. Therefore, given the current paradigm of parallel decoding 2 , any method that accelerates sampling is implementing some form of mode filtering, i.e. the accelerated sampling procedure will try to sample from a single mode that produces sensical text out of all modes that a slower model can sample from. Figure 1 illustrates this behavior. 1 An ambiguous token means that multiple words in the vocabulary are predicted to have similar and high probability.
2 Unless curse of dimensionality is solved
Recent work [Qian et al., 2025, Chen et al., 2025] on accelerating diffusion LLM sampling with offline distillation demonstrates this principle as well. Specifically, d3LLM [Qian et al., 2025] distills from the sampling trajectory produced by the teacher model, effectively teaching the student model to sample from one of the high probability teacher modes at each sampling step. Similarly, dParallel [Chen et al., 2025] proposes certainty-forcing distillation that finetunes the model to be more confident on self-generated trajectories, which are fixed per prompt. This effectively encourages the model to sample from one of the high probability teacher modes marginalized by the prompts. However, this self-distillation approach faces two shortcomings. First of all, the quality of training data cannot exceed the capability of the base
This content is AI-processed based on ArXiv data.