Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

Why Diffusion Language Models Struggle with T ruly Parallel (Non-A utor egr essiv e) Decoding? Pengxiang Li 1 * Dilxat Muhtar 2 3 4 * Lu Y in 5 * Tianlong Chen 6 Shiwei Liu 2 3 4 Abstract Dif fusion Language Models (DLMs) are often ad- vertised as enabling parallel token generation, yet practical “fast” DLMs frequently con ver ge to left- to-right, autoregressi v e (AR)-like decoding dy- namics. In contrast, genuinely non-AR generation is promising because it remov es AR’ s sequential bottleneck, better exploiting parallel hardw are to reduce synchronization/communication ov erhead and impro ve latency scaling with output length. W e argue that a primary driver of AR-like de- coding is a mismatch between DLM objecti ves and the highly sequential structure of widely used training data, including standard pre-training cor- pora and long chain-of-thought (CoT) supervision. Motiv ated by this diagnosis, we propose NAP (Non-Autoregressi v e Parallel DLMs), a proof-of- concept, data-centric approach that better aligns supervision with non-AR parallel decoding. N AP curates examples as multiple independent reason- ing trajectories and couples them with a parallel- forced decoding strate gy that encourages multi- token parallel updates. Across math reasoning benchmarks, N AP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains gro wing as parallelism increases. Our results suggest that revisiting data and supervision is a principled di- rection for mitigating AR-like beha vior and mov- ing tow ard genuinely non-autoregressi v e parallel generation in DLMs. Our code is av ailable at https://github .com/pix eli99/N AP . 1 The Hong Kong Polytechnic Uni versity , Hong Kong, China 2 ELLIS Institute T ¨ ubingen, T ¨ ubingen, Germany 3 Max Planck In- stitute for Intelligent Systems, T ¨ ubingen, Germany 4 T ¨ ubingen AI Center, T ¨ ubingen, Germany 5 Univ ersity of Surrey , Guild- ford, United Kingdom 6 The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA . Correspondence to: Lu Y in < l.yin@surrey .ac.uk > , Shiwei Liu < sliu@tue.ellis.eu > . Pr eprint. F ebruary 27, 2026. 1. Introduction Large language models (LLMs) ha ve become a cornerstone of modern AI, yet their rapidly growing computational and en vironmental footprints raise pressing sustainability con- cerns ( Patterson et al. , 2021 ; Luccioni et al. , 2023 ). This mo- tiv ates renewed interest in alternativ e generation paradigms that can reduce inference latency and cost without sacriﬁc- ing capability . Diffusion Language Models (DLMs) hav e recently emerged as a compelling candidate: by iterativ ely denoising a sequence, DLMs can in principle enable par- allel token generation , of fering a path to ward faster , more efﬁcient generation ( Austin et al. , 2021b ; Lou et al. , 2023 ; Shi et al. , 2024b ; Sahoo et al. , 2024a ; Nie et al. , 2025b ; Gong et al. , 2024 ; Y e et al. , 2025 ). When paired with es- tablished inference accelerators, such as KV caching ( Ma et al. , 2025 ; W u et al. , 2025 ; Liu et al. , 2025 ) and specu- lative decoding ( Christopher et al. , 2025 ; Gao et al. , 2025 ; Chen et al. , 2026 ), DLM-based systems are often claimed as substantially faster alternati v es to standard autore gressi ve (AR) decoding. Y et, despite their promise, practical “f ast” DLMs exhibit a striking and under-discussed behavior: many methods that aim for highly parallel decoding conver ge toward AR-like generation , where the effecti ve reasoning trajectory pro- ceeds largely fr om left to right ( Nie et al. , 2025b ; Israel et al. , 2025 ; W u et al. , 2025 ; Gong et al. , 2025 ). In other words, ev en when the model architecture permits bidirec- tional context and parallel reﬁnement, the realized decoding dynamics can resemble a sequential construction of the out- put. This phenomenon makes real-world DLM usage more nuanced than the headline promise of “truly parallel decod- ing”: speedups are often coupled to subtle quality trade-of fs, and the conditions under which DLMs depart meaningfully from AR behavior remain unclear ( Kang et al. , 2025 ). The payoff for achieving genuinely (non-AR) parallel de- coding is substantial : AR-style decoding is fundamentally sequential, e very token depends on the previous one, so generation latency scales roughly with output length. Al- though we can switch to fast parallel decoding in subsequent blocks after earlier blocks have lar gely con v er ged, the need to wait for upstream stabilization introduces a sequential critical path, leading to extra latency and communication 1 Why Diffusion Language Models Struggle with T ruly Parallel (Non-A utor egressiv e) Decoding? 0 50 100 150 200 250 300 350 Decoding Step 0 200 400 600 800 1000 T ok en P osition P ositions Unmask ed per Step AR r efer ence 0 50 100 150 200 250 300 350 Decoding Step 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 T ok ens Unmask ed T ok ens Unmask ed per Step (T otal: 1024 tok ens in 340 steps) Mean: 3.01 tok ens/step 0 50 100 150 200 250 300 350 Decoding Step 0 200 400 600 800 1000 T ok en P osition P osition Distribution per Step (black line = mean) AR r efer ence 0 200 400 600 800 1000 T ok en Inde x (flattened or der) 0 200 400 600 800 1000 T ok en P osition P osition vs T ok en Inde x (color = step inde x, r ed = fix ed positions) P erfect AR 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 k 0.0 0.2 0.4 0.6 0.8 1.0 L ocal AR -ness@k @1: 0.713 L ocal AR -ness vs k (consecutive ne xt-tok en pr edictions) P erfect AR 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 k 0.0 0.2 0.4 0.6 0.8 1.0 Global AR -ness@k @1: 0.735 Global AR -ness vs k (prioritize lef tmost masks) P erfect AR 0 50 100 150 200 250 300 Step Inde x (a) LLaDA-8B (A O) 0 50 100 150 200 250 300 350 Decoding Step 0 200 400 600 800 1000 T ok en P osition P ositions Unmask ed per Step AR r efer ence 0 50 100 150 200 250 300 350 Decoding Step 0 1 2 3 4 5 6 T ok ens Unmask ed T ok ens Unmask ed per Step (T otal: 1024 tok ens in 340 steps) Mean: 3.01 tok ens/step 0 50 100 150 200 250 300 350 Decoding Step 0 200 400 600 800 1000 T ok en P osition P osition Distribution per Step (black line = mean) AR r efer ence 0 200 400 600 800 1000 T ok en Inde x (flattened or der) 0 200 400 600 800 1000 T ok en P osition P osition vs T ok en Inde x (color = step inde x, r ed = fix ed positions) P erfect AR 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 k 0.0 0.2 0.4 0.6 0.8 1.0 L ocal AR -ness@k @1: 0.921 L ocal AR -ness vs k (consecutive ne xt-tok en pr edictions) P erfect AR 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 k 0.0 0.2 0.4 0.6 0.8 1.0 Global AR -ness@k @1: 0.006 Global AR -ness vs k (prioritize lef tmost masks) P erfect AR 0 50 100 150 200 250 300 Step Inde x (b) Dream-7B (A O) 0 50 100 150 200 250 300 Decoding Step 0 200 400 600 800 1000 T ok en P osition P ositions Unmask ed per Step AR r efer ence 0 50 100 150 200 250 300 Decoding Step 0 2 4 6 8 T ok ens Unmask ed T ok ens Unmask ed per Step (T otal: 1024 tok ens in 326 steps) Mean: 3.14 tok ens/step 0 50 100 150 200 250 300 Decoding Step 0 200 400 600 800 1000 T ok en P osition P osition Distribution per Step (black line = mean) AR r efer ence 0 200 400 600 800 1000 T ok en Inde x (flattened or der) 0 200 400 600 800 1000 T ok en P osition P osition vs T ok en Inde x (color = step inde x, r ed = fix ed positions) P erfect AR 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 k 0.0 0.2 0.4 0.6 0.8 1.0 L ocal AR -ness@k @1: 0.003 L ocal AR -ness vs k (consecutive ne xt-tok en pr edictions) P erfect AR 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 k 0.0 0.2 0.4 0.6 0.8 1.0 Global AR -ness@k @1: 0.014 Global AR -ness vs k (prioritize lef tmost masks) P erfect AR 0 50 100 150 200 250 300 Step Inde x (c) Random 0 50 100 150 200 250 300 350 Decoding Step 0 200 400 600 800 1000 T ok en P osition P ositions Unmask ed per Step F ix ed positions (8) AR r efer ence 0 50 100 150 200 250 300 350 Decoding Step 0 1 2 3 4 5 6 T ok ens Unmask ed T ok ens Unmask ed per Step (T otal: 1016 tok ens in 340 steps) Mean: 2.99 tok ens/step 0 50 100 150 200 250 300 350 Decoding Step 0 200 400 600 800 1000 T ok en P osition P osition Distribution per Step (black line = mean) AR r efer ence 0 200 400 600 800 1000 T ok en Inde x (flattened or der) 0 200 400 600 800 1000 T ok en P osition P osition vs T ok en Inde x (color = step inde x, r ed = fix ed positions) P erfect AR 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 k 0.0 0.2 0.4 0.6 0.8 1.0 L ocal AR -ness@k @1: 0.020 L ocal AR -ness vs k (consecutive ne xt-tok en pr edictions) P erfect AR 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 k 0.0 0.2 0.4 0.6 0.8 1.0 Global AR -ness@k @1: 0.298 Global AR -ness vs k (prioritize lef tmost masks) P erfect AR 0 50 100 150 200 250 300 Step Inde x (d) Ours F igur e 1. V isualization of decoding dynamics. W e plot the token position being unmasked (y-axis) against the decoding step (x-axis). (a, b) Despite using conﬁdence-based Arbitrary Order (A O) decoding, standard DLMs (LLaDA and Dream) e xhibit a strict linear diagonal pattern, rev ealing that their behavior collapses into autoregressi v e (left-to-right) generation. (c) Random decoding eliminates AR bias but lacks structure. (d) Our method (NAP) breaks the single-stream bottleneck, generating multiple reasoning trajectories simultaneously . cost ( W ang et al. , 2025 ; Fu et al. , 2025 ). In contrast, truly non-AR parallel decoding is naturally compatible with the distributed hardw are, i.e., when dependencies across spans are weak, decoding is naturally compatible with distributed hardware and can be distributed across de vices, with only occasional synchronization to maintain global consistency . In this work, we ar gue that one primary cav eat of this AR- bias is a mismatch between the learning objective and the training data . Existing DLM pipelines blindly reuse training data originally designed for AR models, where rea- soning trajectories are implicitly encoded as left-to-right progressions, e.g., next-token prediction–style ordering ( Y e et al. , 2025 ; Allal et al. , 2025 ; Li et al. , 2024 ), or sequen- tial Chain-of-Thought (CoT) rationales ( Zhao et al. , 2025 ; Lambert et al. , 2024 ). As a result, even if the diff usion process is nominally position-agnostic, the model can learn denoising strategies that preferentially reconstruct outputs in an AR-shaped manner . This “ AR-shaped data” ef fect not only limits the e xtent to which DLMs can exploit gen- uine parallelism, but also complicates e v aluation: a method may appear effecti v e while largely reproducing AR model’ s dynamics under a different wrapper . T o test this conjecture, we conduct a systematic analysis of the decoding behavior of commonly used DLMs. The main ﬁndings are summarized below . I. Widely used training corpora are strongly sequen- tial. W e quantify the sequential dependency of datasets by measuring ho w strongly the tok en at one position is de- termined by its preceding context. W e sho w a consistent trend: commonly used pre-training corpora (i.e., FineW eb ( Penedo et al. , 2024 )) and long CoT reasoning datasets (i.e., Open-R1-Math ( T eam , 2025 )) display strong sequence de- pendence. II. DLM decoding remains largely autoregr essive. Across widely used DLM families such as LLaD A ( Nie et al. , 2025c ) and Dream ( Y e et al. , 2025 ), ARness remains high: AR AO Rand Decoding Order 0 20 40 60 80 100 P ass@1 Accuracy (%) 78.2 78.2 33.9 71.9 51.9 51.6 GSM8K Dr eam-7B LLaD A -8B AR AO Rand Decoding Order 0 20 40 60 80 100 P ass@1 Accuracy (%) 38.2 39.0 17.2 32.4 21.8 17.6 MA TH-500 Dr eam-7B LLaD A -8B Performance on GSM8K and MA TH-500 by Decoding Order F igur e 2. Perf ormance on GSM8K (left) and MA TH-500 (right). Forcing low-ARness behavior (Random decoding) gen- erally causes reasoning performance to collapse. Notably , for LLaD A, we employ a constrained block-wise decoding strategy to ensure generation validity . This preserves local structural in- tegrity , resulting in the Arbitrary Order (A O) decoding maintaining comparable performance, unlike the sharp drop observ ed in fully unstructured random decoding. the model still tends to “lock in” decisions in a quasi-left-to- right pattern, despite the nominally arbitrary decoding rules. Con v ersely , forcing genuinely low ARness behavior , for instance, by randomizing the update order aggressiv ely , can reduce ARness but typically causes reasoning performance to collapse. T aken together, these results indicate a non- tri vial tradeoff: in standard setups, either ARness stays high to maintain capacity , or lowering ARness breaks reasoning. III. T raining on long CoT data escalates ARness. Con- tinued post-training on standard long CoT datasets further increases ARness ov er time. While DLMs trained from scratch (e.g., LLaD A) tend to exhibit lower ARness than those adapted from pre-trained AR models (e.g., Dream), this gap gradually narrows with sustained CoT supervision. Intuitiv ely , long CoT supervision provides an explicit step- by-step trajectory with a privile ged ordering. Matching such training targets re w ards the model for producing and stabi- lizing earlier tokens before later ones, thereby progressively shifting the learned decoding dynamics tow ard increasingly 2 Why Diffusion Language Models Struggle with T ruly Parallel (Non-A utor egressiv e) Decoding? 0 200 400 600 800 1000 T ok en length 10 0 10 20 30 40 50 60 70 SeqDep SeqDep vs. T ok en Length Binned mean ±1 std (a) OpenR1-Math SeqDep 200 400 600 800 1000 T ok en length 10 0 10 20 30 40 50 60 70 SeqDep SeqDep vs. T ok en Length Binned mean ±1 std (b) Fineweb SeqDep F igur e 3. Sequential Dependence (SeqDep) Analysis on (a) OpenR1-Math and (b) FineW eb Datasets. The consistently high and rising SeqDep scores indicate that standard training corpora possess strong intrinsic sequentiality , driving models to internalize AR-like dependencies. autoregressi v e behavior . IV . Recent parallel fast-DLM methods gain speed by amplifying , not removing, AR-like generation. Despite being motiv ated by parallel decoding, many recent fast- DLM approaches achiev e practical speedups by reinforcing an underlying autoregressi ve computation pattern. In partic- ular , they rely on increasingly conﬁdent early predictions or staged block-wise updates that stabilize preﬁx es before allowing limited parallelism do wnstream. As a result, paral- lelism is effecti v ely gated by an AR-like con v ergence order , and the achiev ed acceleration stems from exaggerating this sequential structure rather than eliminating it. The abov e ﬁndings suggest that even though the DLMs per - mit arbitrary decoding strategy , as DLMs are trained on highly sequentially structured data, the model tends to inter- nalize an AR-like computational strategy . In other words, the training distribution teaches the model that reasoning is a chain with a privile ged order , and changing the decoding procedure alone is often insuf ﬁcient to undo this learned reliance. Addressing the issue therefore requires re visiting the data and supervision that shape the model’ s generation strategy in the ﬁrst place. T o this end, we propose NAP ( Non-A utor e gressive P arallel DLMs ), a proof-of-concept approach that tackles the prob- lem from a data and decoding codesign perspectiv e. First, we curate supervision in which each example consists of multiple independent r easoning tr ajectories generated in parallel, this format deemphasizes an y pri vileged tok en or - der and is naturally compatible with denoising-style learning in DLMs. Second, we introduce a parallel-for ced decod- ing strategy that explicitly encourages multi-token parallel updates at different reasoning traces, further steering gener - ation away from AR-lik e critical paths. T ogether , these two components provide a simple and effecti ve way to better align DLM behavior with truly parallel decoding. Across a range of math reasoning benchmarks, our results show that N AP , ﬁne-tuned with 103K samples, consistently yields stronger performance under parallel decoding than the base- line trained on standard long CoT datasets. Moreover , the improv ement becomes more pronounced as we increase the degree of parallelism, indicating that N AP is better aligned with non-AR decoding dynamics rather than relying on an implicit sequential critical path. Note that our goal is not to claim that NAP fully resolves the challenges of non-AR parallel decoding. Rather, we aim to use this small-scale post-training only result to show that revisiting data and supervision design is a promising direction for mitigating AR-lik e behavior in DLMs and mov- ing to w ard genuinely non-autoregressi v e parallel generation. W e hope our results motiv ate further work on data-centric approaches to unlock the full efﬁcienc y potential of DLMs. 2. Related W ork 2.1. Diffusion Language Models Diffusion models ( Sohl-Dickstein et al. , 2015 ; Ho et al. , 2020 ; Song et al. , 2021 ), best known for their success in image generation ( Rombach et al. , 2022 ; Nichol et al. , 2022 ; Saharia et al. , 2022 ), are increasingly studied as a non- autoregressi v e alternativ e for text generation. Bringing diffu- sion from continuous variables to discrete tok ens can be for - malized by treating the forward corruption as a Markov pro- cess ov er a ﬁnite v ocab ulary: D3PM ( Austin et al. , 2021a ) instantiates this idea with discrete-time transition matrices, while subsequent w ork extends it to continuous time through CTMC formulations ( Campbell et al. , 2022 ). A particularly practical family is mask ed dif fusion, which can be vie wed as an absorbing-state construction in the D3PM lineage and 3 Why Diffusion Language Models Struggle with T ruly Parallel (Non-A utor egressiv e) Decoding? operates directly in token space via random masking ( Shi et al. , 2024a ). This paradigm has produced strong results across scales, from smaller models such as MDLM ( Sa- hoo et al. , 2024b ) and RADD ( Ou et al. , 2025 ) to large systems like LLaDA ( Nie et al. , 2025a ) and Dream ( Y e et al. , 2025 ). Beyond text-only settings, MMaD A ( Y ang et al. , 2025 ) further generalizes large diffusion models to multimodal generation with a shared probabilistic view and modality-agnostic architecture, while the broader literature highlights potential beneﬁts such as parallelizable decoding and ﬂexible (non left-to-right) generation orders that may be useful for complex reasoning. 2.2. Decoding Order and Sampling Schedules A key degree of freedom in masked diffusion language models is the sampling path—which positions are updated (or committed) at each reﬁnement step and in what or- der . Rather than being a mere implementation detail, sev- eral works treat order as an explicit control knob for qual- ity/efﬁcienc y trade-offs. P2 ( Peng et al. , 2025 ) cast order selection as a planning problem, where a separate planner chooses which tokens to denoise at each step, decoupling where/when to update from how to update. Prophet ( Li et al. , 2025 ) further lev erages model conﬁdence to early-commit, switching from iterativ e reﬁnement to one-shot completion when the top-2 gap indicates con vergence. Order -aw areness has also been pushed into training, e.g., by encouraging sim- pler and more coherent sampling paths ( Zhu et al. ). Mean- while, Ni et al. ( 2026 ) caution that arbitrary-order ﬂexibility can be a double-edged sword: models may preferentially resolve lo w-uncertainty tokens and bypass high-uncertainty branching points, collapsing the ef fecti ve reasoning space, suggesting that constraining or regularizing generation order can sometimes improv e reasoning. 3. Preliminaries 3.1. Diffusion Language Models W e consider diffusion language models (DLMs), and in particular masked diffusion models (MDMs), which gen- erate discrete token sequences by iterati v ely denoising a partially masked state. Let x denote the input prompt and let y 0 = ( y 1 0 , . . . , y L 0 ) ∈ V L denote a clean output sequence of length L ov er vocab ulary V . MDMs deﬁne a forward masking process indexed by a continuous time variable t ∈ [0 , 1] , where t represents the masking ratio. Giv en y 0 , the forward process independently masks each token with probability t : q  y i t | y i 0  = ( [MASK] , with prob . t, y i 0 , with prob . 1 − t, (1) and factorizes across positions as q ( y t | y 0 ) = Q L i =1 q ( y i t | y i 0 ) . At t = 1 , the sequence is fully masked; at t = 0 , it remains unchanged. 3.2. Measuring A utoregressiv e Bias T o quantify how autoregressi ve-like a DLM decoding tra- jectory is, we adopt the ARness metrics proposed by Gong et al. ( 2025 ), which distinguish between global left-to-right bias and local sequential continuity . Let the decoding pro- cess be represented by a sequence of unmasked positions p = ( p 1 , p 2 , . . . , p L ) , where p c ∈ { 1 , . . . , L } denotes the position index of the token committed at decoding step c . Let M c − 1 be the set of masked positions just before step c . Global ARness. This metric measures the tendency to prioritize unmasking the leftmost remaining tokens, captur - ing a front-to-back ﬁlling strategy . For a tolerance window k ≥ 1 , we deﬁne an indicator I global ( c, k ) that is 1 if the cho- sen position p c is among the k earliest positions in M c − 1 : I global ( c, k ) = ( 1 , if p c ∈ smallest- k ( M c − 1 ) , 0 , otherwise . (2) The Global ARness score is the av erage ov er the sequence: Global-ARness @ k = 1 L L X c =1 I global ( c, k ) ∈ [0 , 1] . (3) A score of 1.0 (at k = 1 ) indicates a strict autoregressiv e (left-to-right) generation order . Unless otherwise stated, we use Global-ARness@1 as the primary measure of ARness in our analysis, as it directly quantiﬁes the adherence to a causal generation order . 3.3. Measuring Sequential Dependence (SeqDep) T o quantify the intrinsic sequentiality of a dataset, we mea- sure how much the prediction of a current text segment relies on its preceding generation history compared to re- lying solely on the initial prompt. Let x denote the input prompt. Suppose the corresponding output sequence is di- vided into N segments s = ( s 1 , . . . , s N ) . Using an e xternal autoregressi v e scorer p AR (e.g., a pretrained LLM), we de- ﬁne the Sequential Dependence (SeqDep) as the average log-probability gain provided by the preﬁx conte xt: SeqDep( x, s ) = 1 N − 1 N X n =2  log p AR ( s n | x, s Try ▢ root ▢ ▢ factors ▢ Step 2 Try x=1 root works Try factors x=2 Step N Try x=1 root works Try factors x=2 ▢ Masked token

Summary block

x ▢ ▢ 2 = Thinking block Parallel Thinking Summary Commit

x = −3 1 2 ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ find remaining find root root find remaining root x= x= -3 -3 ▢

Summary: x=3 or x=1 F igur e 6. Overview of the parallel-f or ced decoding framework . The model concurrently generates multiple independent reasoning paths within structured thinking blocks. These parallel trajectories are then synthesized into a result within a designated summary block. objectiv e. 5.3. Parallel-F or ced Decoding T o enable the model to reason in parallel, we design a de- coding can vas that spatially separates reasoning streams and enforce a structure-aware update schedule. Decoding Can v as. W e deﬁne a structured output format containing m independent reasoning blocks and one sum- mary block: Y =  B 1 , R (1) , B 2 , R (2) , . . . , B m , R ( m ) , B S , S  , (5) where B j are ﬁxed textual headers (e.g., “ ”), R ( j ) are free-form reasoning contents for the j -th path, and S is a ﬁnal summary containing the answer . Gi v en a prompt x , we initialize a can v as of length L = P ( | B j | + L j ) + ( | B S | + L S ) , where ﬁxed headers are clamped and reason- ing slots are initialized to [MASK] . This layout ef fecti vely enforces conditional independence between R ( i ) and R ( j ) giv en the prompt, as there is no causal masking order be- tween them in a bidirectional model. Macro-P arallel, Micro-Conﬁdence Updates. Standard arbitrary-order decoding often degenerates into global sequential generation because the model preferentially re- solves the immediate ne xt tokens. N AP-D pre v ents this via a hierarchical schedule. At the macro le vel, we enforce strict parallelism: the unmasking budget is distributed across all m reasoning blocks { R (1) , . . . , R ( m ) } at ev ery step. This constraint pre vents the model from stabilizing upstream paths before initiating downstream ones. At the micro lev el, within each individual block R ( j ) , we apply a conﬁdence- based strategy (i.e., masking lo w-conﬁdence tokens). W e do not enforce a left-to-right order locally; instead, tokens are committed based on their conﬁdence scores. This com- bination ensures that the global process is parallel (ev olving multiple trajectories simultaneously) while local generation retains the ﬂexibility of non-autore gressi ve reﬁnement. 6. Experiments This section ev aluates whether our decoding strategy can (i) improv e reasoning performance ov er standard dif fusion decoding rules, (ii) reshape the induced generation order as measured by ARness (Section 3.2 ), and (iii) mitigate order sensitivity in regimes where long-form rationales exhibit strong sequential dependence (Eq. ( 4 ) ). Unless otherwise stated, all results use the same pretrained masked diffusion model and differ only in the decoding rule. Evaluation pr otocol. W e ev aluate on a suite of reason- ing benchmarks including G S M 8 K ( Cobbe et al. , 2021 ), M A T H - 5 0 0 ( Lightman et al. , 2023 ), and G P Q A ( Rein et al. , 2024 ). Each example is prompted to produce a think- ing path and a ﬁnal answer in a ﬁxed format; we e xtract answers with a deterministic parser and report accuracy . Models and T raining. W e conduct experiments on tw o state-of-the-art diffusion language models: LLaDA-8B-Instruct ( Nie et al. , 2025c ) and Dream-7B-Instruct ( Y e et al. , 2025 ). T o v ali- date our proposed method, we ﬁne-tune these base models on the parallel reasoning dataset D parallel curated via the pipeline described in Section 5.2 . For a fair comparison, we also train a Long-CoT baseline on the same set of reasoning trajectories but serialized in the standard autoregressiv e format. Crucially , this baseline is ev aluated using standard decoding—its optimal inference setting—rather than our parallel strategy , ensuring a strong and fair comparison. Both v ariants are trained using the standard masked diffusion objectiv e for 3 epochs. W e use the AdamW optimizer with a learning rate of 2e-6 and a global batch size of 256. All experiments are conducted on 8 NVIDIA A800 GPUs. 7 Why Diffusion Language Models Struggle with T ruly Parallel (Non-A utor egressiv e) Decoding? T able 3. Benchmark results on LLaD A-8B-Instruct and Dream-7B-Instruct under different step b udgets. T ok/Step denotes the number of tokens decoded per decoding step; lar ger T ok/Step corresponds to higher decoding parallelism. Benchmark Steps T ok/Step LLaDA 8B LLaDA 8B (Long-CoT) NAP-LLaD A 8B Dream-7B Dream-7B (Long-CoT) NAP-Dream-7B Mathematics & Scientiﬁc GSM8K 256 4 46.4 54.1 56.1 (+2.0) 35.0 46.5 60.9 (+14.4) 336 3 54.4 60.9 63.3 (+2.4) 49.4 56.9 70.9 (+14.0) 512 2 62.0 82.0 82.6 (+0.4) 58.5 66.8 79.2 (+12.4) 1024 1 66.5 83.5 84.1 (+0.6) 68.9 78.0 83.6 (+5.6) MA TH-500 256 4 17.8 21.4 26.6 (+5.2) 8.8 16.2 23.8 (+7.6) 336 3 20.6 26.6 35.4 (+8.8) 11.4 25.6 31.4 (+5.8) 512 2 28.0 41.2 43.0 (+1.8) 20.8 40.0 43.0 (+3.0) 1024 1 30.4 45.0 47.0 (+2.0) 35.0 47.4 49.6 (+2.2) GPQA 336 3 12.5 15.4 19.0 (+3.6) 5.8 7.3 10.5 (+3.2) 512 2 18.8 21.2 25.9 (+4.7) 14.7 19.4 22.5 (+3.1) 1024 1 20.8 23.0 28.6 (+5.6) 26.1 28.6 29.5 (+0.9) Decoding baselines. W e compare several widely used un- masking rules under the common mask-and-predict frame- work. AR order commits the leftmost unresolved tokens at each step (a diffusion realization of left-to-right decod- ing). Arbitrary order (A O) commits the most conﬁdent positions. Random order (Rand) commits a uniformly ran- dom subset at each step, serving as a low-ARness control. Our method generates m multiple independent reasoning paths and a ﬁnal summary commit on a structured canv as. T o ensure a fair budget, P aS-Dec uses the same total token cap L by allocating per-path budgets 330 and a summary budget 32 such that the overall can v as length matches the baseline. The summary block is the only region used for answer extraction and scoring. 6.1. Main Results T able 3 summarizes the performance across three bench- marks. Across all benchmarks and step b udgets, our method achiev es higher accuracy than both the Base model and the Long-CoT baseline. For instance, on GSM8K with Dream-7B (1024 steps), N AP-Dream-7B reaches 83.6%, surpassing the Long-CoT model (78.0%) despite using the same amount of compute and training data. This suggests that organizing reasoning into parallel streams is a more effecti ve supervision signal for DLMs than forcing a single long chain. The most signiﬁcant adv antage of N AP appears in the lo w- step regime, e.g., 256 steps (4x parallel), where the model must generate more than one token per forward pass. Stan- dard Long-CoT models degrade sharply as parallelism increases. On Dream-7B/GSM8K, accuracy drops from 78.0% (1024 steps) to 46.5% (256 steps). This conﬁrms that standard supervision creates a dependenc y on sequential stability; when forced to hurry , the reasoning collapses. In the same setting, N AP-Dream-7B maintains strong accuracy at 60.9%, compared to 46.5% of the Long-CoT baseline, thereby retaining substantially more capability . Notably , the gap between N AP and Long-CoT widens as parallel decod- ing is made more aggressiv e, increasing from +5.6% at 1024 steps to +14.4% at 256 steps. This result v alidates our core hypothesis: by training on data that lacks a privile ged order , the model learns to be less reliant on the immediate left-side context, enabling ef fecti v e Non-AR parallel decoding. T o further understand how N AP achieves these results, we analyze the relationship between performance and the se- quential nature of generation (ARness). As shown in Fig- ure 1 , standard models (LLaDA/Dream) using Arbitrary Order (A O) decoding exhibit a strict diagonal pattern. Ev en though they can decode anywhere, the y effecti vely collapse into a left-to-right process (High ARness). In contrast, N AP (Figure 1 (d)) displays distinct parallel bands, conﬁrming that multiple reasoning trajectories are being generated si- multaneously . 6.2. Ablation Studies W e in vestig ate the individual contributions of the supervi- sion data and the decoding strategy using Dream-7B on the GSM8K benchmark. The Necessity of Data-Decoding Co-design. W e ﬁrst iso- late the impact of our proposed decoding method versus the parallel-aligned data. As sho wn in T able 4 , applying our Parallel-F orced Decoding strategy to a standard base model that has not been trained with our data leads to a larger performance drop than standard Arbitrary Order (A O) decoding. This suggests that without training support, the original Dream-7B struggles to handle the fragmented con- text of simultaneous generation. In addition, the decoding strategy becomes critical when parallelism is high. Speciﬁ- cally at the aggressiv e 256-step budget, our P arallel-Forced decoding outperforms A O (60.9% vs. 57.4%). This con- ﬁrms that while the data pro vides the foundational reason- 8 Why Diffusion Language Models Struggle with T ruly Parallel (Non-A utor egressiv e) Decoding? ing capability , aligning the decoding strategy is essential to maintain robustness when forcing the model to generate multiple tokens in parallel. T able 4. GSM8K accuracy using Dream-7B. Simply applying parallel decoding to a base model hurts performance; gains require aligned supervision. T raining Data Decoding 256 512 1024 Base (Pretrained) A O 35.0 58.5 68.9 Base (Pretrained) Parallel-For ced 31.0 52.6 60.2 NAP (Ours) A O 57.4 78.9 85.1 NAP (Ours) Parallel-F orced 60.9 79.2 83.6 Impact of Parallel Width ( m ). W e further analyze ho w the number of parallel reasoning paths affects performance while keeping the total token b udget constant. As detailed in T able 5 , increasing the number of reasoning paths from a single chain ( m = 1 ) to three ( m = 3 ) provides consistent accuracy gains across both model families. Speciﬁcally , N AP-Dream sees a substantial impro vement from 75.4% to 83.6%, while NAP-LLaD A rises from 79.4% to 84.1%. This monotonic trend supports the vie w that N AP beneﬁts from an “internal ensemble” effect, where the ﬁnal summary block ef fecti vely aggre gates insights from multiple di v erse trajectories generated in parallel to deri ve a more robust answer . T able 5. Accuracy on GSM8K with v arying m . T otal token budget is ﬁxed. Method 1 Path 2 Paths 3 Paths N AP-Dream 75.4 78.9 83.6 N AP-LLaD A 79.4 82.6 84.1 Intrinsic Parallelism of Curated Data. T o verify that our data curation pipeline effecti v ely reduces the autoregressi ve bottleneck, we analyze the Sequential Dependence (SeqDep) of our constructed dataset D parallel . As illustrated in Figure 7 , the SeqDep score remains remarkably stable (mean ≈ 12 ) ev en as the sequence length grows from 500 to over 1000 tokens. Unlike standard long-chain reasoning (as sho wn in Section 4 ), where dependence often escalates with depth, our parallel-structured data maintains a consistent lev el of information density . This ”ﬂat” dependency proﬁle conﬁrms that the reasoning trajectories within our data possess high conditional independence, providing the necessary learning signal for the model to perform effecti v e parallel updates during inference. 7. Conclusion In this work, we argue that the struggle of Dif fusion Lan- guage Models (DLMs) to achie ve genuine parallel decoding stems largely from the implicit sequentiality of standard 500 600 700 800 900 1000 T ok en length 10 0 10 20 30 40 50 60 70 SeqDep SeqDep vs. T ok en Length Binned mean ±1 std F igur e 7. SeqDep Analysis on D parallel . W e visualize the Sequen- tial Dependence (SeqDep) of our curated parallel reasoning d ata against token length. The green curve (binned mean) sho ws that SeqDep remains stable and relativ ely lo w across v arying lengths. training data. Our proposed method, N AP , demonstrates that aligning supervision with parallel decoding dynam- ics ef fecti v ely mitigates this autore gressiv e collapse. By training on parallel reasoning trajectories and enforcing multi-stream updates, N AP decouples reasoning capability from sequential order , achieving superior performance in high-parallelism regimes while signiﬁcantly reducing global ARness. These results suggest that unlocking the full po- tential of non-autoregressi ve generation requires mo ving beyond decoding heuristics to fundamentally rethink how we structure supervision for parallel reasoning. Limitations. While NAP demonstrates the feasibility of aligning supervision with genuinely parallel decoding, our current implementation serves primarily as a proof-of- concept. The method is ev aluated in a post-training setting on a relativ ely small scale ( ∼ 100K samples). As scaling laws dictate much of DLMs’ behavior , a broader pre-training phase with inherently non-sequential, parallel-structured data may be required to completely eliminate the autoregres- siv e bottleneck. References Allal, L. B., Lozhkov , A., Bakouch, E., Bl ´ azquez, G. M., Penedo, G., T unstall, L., Maraﬁoti, A., Kydl ´ ı ˇ cek, H., Lajar ´ ın, A. P ., Sriv asta v , V ., et al. Smollm2: When smol goes big–data-centric training of a small language model. arXiv pr eprint arXiv:2502.02737 , 2025. Arriola, M., Gokaslan, A., Chiu, J. T ., Y ang, Z., Qi, Z., Han, J., Sahoo, S. S., and Kulesho v , V . Block diffusion: Inter- polating between autoregressi ve and diffusion language models. arXiv preprint , 2025. Austin, J., Johnson, D. D., Ho, J., T arlow , D., and van den Berg, R. Structured denoising dif fusion models in discrete state-spaces. In Ranzato, M., Beygelzimer , A., Dauphin, 9 Why Diffusion Language Models Struggle with T ruly Parallel (Non-A utor egressiv e) Decoding? Y . N., Liang, P ., and V aughan, J. W . (eds.), Advances in Neural Information Pr ocessing Systems 34: Annual Confer ence on Neural Information Pr ocessing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual , pp. 17981–17993, 2021a. Austin, J., Johnson, D. D., Ho, J., T arlow , D., and V an Den Berg, R. Structured denoising diffusion models in discrete state-spaces. Advances in neur al information pr ocessing systems , 34:17981–17993, 2021b. Campbell, A., Benton, J., Bortoli, V . D., Rainforth, T ., Deli- giannidis, G., and Doucet, A. A continuous time frame- work for discrete denoising models. In Koyejo, S., Mo- hamed, S., Agarwal, A., Belgrav e, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Pr ocessing Systems 35: Annual Confer ence on Neural Information Pr ocessing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 , 2022. Chen, J., Liang, Y ., and Liu, Z. Dﬂash: Block dif- fusion for ﬂash speculativ e decoding. arXiv preprint arXiv:2602.06036 , 2026. Christopher , J. K., Bartoldson, B. R., Ben-Nun, T ., Cardei, M., Kailkhura, B., and Fioretto, F . Speculativ e dif fusion decoding: Accelerating language generation through dif- fusion. In Pr oceedings of the 2025 Confer ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language T echnolo- gies (V olume 1: Long P apers) , pp. 12042–12059, 2025. Cobbe, K., Kosaraju, V ., Bav arian, M., Chen, M., Jun, H., Kaiser , L., Plappert, M., T worek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training veriﬁers to solv e math word problems. arXiv pr eprint arXiv:2110.14168 , 2021. Fu, H., Huang, B., Adams, V ., W ang, C., Sriniv asan, V ., and Jiao, J. From bits to rounds: Parallel decoding with exploration for dif fusion language models. arXiv pr eprint arXiv:2511.21103 , 2025. Gao, Y ., Ji, Z., W ang, Y ., Qi, B., Xu, H., and Zhang, L. Self speculati ve decoding for dif fusion large language models. arXiv pr eprint arXiv:2510.04147 , 2025. Gong, S., Agarwal, S., Zhang, Y ., Y e, J., Zheng, L., Li, M., An, C., Zhao, P ., Bi, W ., Han, J., et al. Scaling dif fu- sion language models via adaptation from autoregressiv e models. arXiv preprint , 2024. Gong, S., Zhang, R., Zheng, H., Gu, J., Jaitly , N., Kong, L., and Zhang, Y . Diffucoder: Understanding and improving masked diffusion models for code generation. arXiv pr eprint arXiv:2506.20639 , 2025. Ho, J., Jain, A., and Abbeel, P . Denoising dif fusion proba- bilistic models. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural In- formation Pr ocessing Systems 33: Annual Conference on Neural Information Pr ocessing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual , 2020. Israel, D., den Broeck, G. V ., and Gro ver , A. Accelerating diffusion llms via adaptiv e parallel decoding. CoRR , abs/2506.00413, 2025. Kang, W ., Galim, K., Oh, S., Lee, M., Zeng, Y ., Zhang, S., Hooper , C., Hu, Y ., K oo, H. I., Cho, N. I., et al. Parallel- bench: Understanding the trade-offs of parallel decoding in dif fusion llms. arXiv pr eprint arXiv:2510.04767 , 2025. Lambert, N., Morrison, J., Pyatkin, V ., Huang, S., Ivison, H., Brahman, F ., Miranda, L. J. V ., Liu, A., Dziri, N., L yu, S., et al. Tulu 3: Pushing frontiers in open language model post-training. arXiv pr eprint arXiv:2411.15124 , 2024. Li, J., Fang, A., Smyrnis, G., Ivgi, M., Jordan, M., Gadre, S. Y ., Bansal, H., Guha, E., Keh, S. S., Arora, K., et al. Datacomp-lm: In search of the ne xt generation of training sets for language models. Advances in Neural Informa- tion Pr ocessing Systems , 37:14200–14282, 2024. Li, P ., Zhou, Y ., Muhtar, D., Y in, L., Y an, S., Shen, L., Liang, Y ., V osoughi, S., and Liu, S. Dif fusion language models know the answer before decoding. arXiv pr eprint arXiv:2508.19982 , 2025. Lightman, H., K osaraju, V ., Burda, Y ., Edwards, H., Bak er , B., Lee, T ., Leike, J., Schulman, J., Sutskev er , I., and Cobbe, K. Let’ s verify step by step. arXiv preprint arXiv:2305.20050 , 2023. Liu, Z., Y ang, Y ., Zhang, Y ., Chen, J., Zou, C., W ei, Q., W ang, S., and Zhang, L. dllm-cache: Accelerating diffu- sion large language models with adapti ve caching. arXiv pr eprint arXiv:2506.06295 , 2025. Lou, A., Meng, C., and Ermon, S. Discrete diffusion lan- guage modeling by estimating the ratios of the data distri- bution. arXiv pr eprint arXiv:2310.16834 , 2023. Luccioni, A. S., V iguier , S., and Ligozat, A.-L. Estimating the carbon footprint of bloom, a 176b parameter language model. Journal of machine learning r esear c h , 24(253): 1–15, 2023. Ma, X., Y u, R., Fang, G., and W ang, X. dkv-cache: The cache for dif fusion language models, 2025. URL https: //arxiv.org/abs/2505.15781 . Ni, Z., W ang, S., Y ue, Y ., Y u, T ., Zhao, W ., Hua, Y ., Chen, T ., Song, J., Y u, C., Zheng, B., et al. The ﬂexibility 10 Why Diffusion Language Models Struggle with T ruly Parallel (Non-A utor egressiv e) Decoding? trap: Rethinking the value of arbitrary order in diffusion language models. 2026. Nichol, A. Q., Dhariwal, P ., Ramesh, A., Shyam, P ., Mishkin, P ., McGrew , B., Sutske ver , I., and Chen, M. GLIDE: tow ards photorealistic image generation and edit- ing with text-guided dif fusion models. In Chaudhuri, K., Jegelka, S., Song, L., Szepesv ´ ari, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learn- ing, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA , volume 162 of Pr oceedings of Machine Learning Resear c h , pp. 16784–16804. PMLR, 2022. Nie, S., Zhu, F ., Y ou, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y ., W en, J., and Li, C. Large language dif fusion models. CoRR , abs/2502.09992, 2025a. Nie, S., Zhu, F ., Y ou, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y ., W en, J.-R., and Li, C. Large language diffusion models. arXiv preprint , 2025b. Nie, S., Zhu, F ., Y ou, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y ., W en, J.-R., and Li, C. Large language diffusion models. arXiv pr eprint arXiv:2502.09992 , 2025c. doi: 10.48550/arXiv .2502.09992. URL https://arxiv. org/abs/2502.09992 . Ou, J., Nie, S., Xue, K., Zhu, F ., Sun, J., Li, Z., and Li, C. Y our absorbing discrete diffusion secretly models the conditional distributions of clean data. In The Thir- teenth International Confer ence on Learning Representa- tions, ICLR 2025, Singapor e, April 24-28, 2025 . Open- Revie w .net, 2025. Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.- M., Rothchild, D., So, D., T exier , M., and Dean, J. Car- bon emissions and lar ge neural network training. arXiv pr eprint arXiv:2104.10350 , 2021. Penedo, G., K ydl ´ ı ˇ cek, H., allal, L. B., Lozhkov , A., Mitchell, M., Raffel, C., W erra, L. V ., and W olf, T . The ﬁneweb datasets: Decanting the web for the ﬁnest text data at scale, 2024. URL 2406.17557 . Peng, F . Z., Bezemek, Z., Patel, S., Rector -Brooks, J., Y ao, S., Bose, A. J., T ong, A., and Chatterjee, P . Path planning for masked diffusion model sampling. arXiv pr eprint arXiv:2502.03540 , 2025. Rein, D., Hou, B. L., Stickland, A. C., Petty , J., Pang, R. Y ., Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A graduate-lev el google-proof q&a benchmark. In F irst Confer ence on Language Modeling , 2024. Rombach, R., Blattmann, A., Lorenz, D., Esser, P ., and Ommer , B. High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Com- puter V ision and P attern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 , pp. 10674–10685. IEEE, 2022. Saharia, C., Chan, W ., Saxena, S., Li, L., Whang, J., Den- ton, E. L., Ghasemipour, S. K. S., Lopes, R. G., A yan, B. K., Salimans, T ., Ho, J., Fleet, D. J., and Norouzi, M. Photorealistic text-to-image dif fusion models with deep language understanding. In K oyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Pr ocessing Systems 35: Annual Confer ence on Neural Information Pr ocessing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 , 2022. Sahoo, S., Arriola, M., Schiff, Y ., Gokaslan, A., Marroquin, E., Chiu, J., Rush, A., and Kuleshov , V . Simple and effecti ve masked diffusion language models. Advances in Neural Information Pr ocessing Systems , 37:130136– 130184, 2024a. Sahoo, S. S., Arriola, M., Schif f, Y ., Gokaslan, A., Marro- quin, E., Chiu, J. T ., Rush, A., and Kuleshov , V . Sim- ple and ef fecti ve masked dif fusion language models. In Globersons, A., Mackey , L., Belgrave, D., Fan, A., Pa- quet, U., T omczak, J. M., and Zhang, C. (eds.), Advances in Neural Information Pr ocessing Systems 38: Annual Confer ence on Neural Information Pr ocessing Systems 2024, NeurIPS 2024, V ancouver , BC, Canada, December 10 - 15, 2024 , 2024b. Shi, J., Han, K., W ang, Z., Doucet, A., and Titsias, M. K. Simpliﬁed and generalized masked diffusion for discrete data. In Globersons, A., Mackey , L., Belgrav e, D., Fan, A., Paquet, U., T omczak, J. M., and Zhang, C. (eds.), Advances in Neural Information Pr ocessing Systems 38: Annual Confer ence on Neural Information Pr ocessing Systems 2024, NeurIPS 2024, V ancouver , BC, Canada, December 10 - 15, 2024 , 2024a. Shi, J., Han, K., W ang, Z., Doucet, A., and Titsias, M. K. Simpliﬁed and generalized masked diffusion for discrete data. arXiv preprint , 2024b. Sohl-Dickstein, J., W eiss, E. A., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequi- librium thermodynamics. CoRR , abs/1503.03585, 2015. Song, Y ., Sohl-Dickstein, J., Kingma, D. P ., Kumar , A., Er - mon, S., and Poole, B. Score-based generativ e modeling through stochastic dif ferential equations. In 9th Interna- tional Conference on Learning Repr esentations, ICLR 2021, V irtual Event, Austria, May 3-7, 2021 . OpenRe- view .net, 2021. 11 Why Diffusion Language Models Struggle with T ruly Parallel (Non-A utor egressiv e) Decoding? T eam, O. Openr1-math-220k: A large-scale math reasoning dataset. https://huggingface.co/datasets/ open- r1/OpenR1- Math- 220k , 2025. Accessed 2025. W ang, X., Xu, C., Jin, Y ., Jin, J., Zhang, H., and Deng, Z. Diffusion llms can do faster -than-ar inference via dis- crete diffusion forcing. arXiv preprint , 2025. W en, H., Su, Y ., Zhang, F ., Liu, Y ., Liu, Y ., Zhang, Y .-Q., and Li, Y . Parathinker: Nati ve parallel thinking as a ne w paradigm to scale llm test-time compute. arXiv preprint arXiv:2509.04475 , 2025. W u, C., Zhang, H., Xue, S., Liu, Z., Diao, S., Zhu, L., Luo, P ., Han, S., and Xie, E. Fast-dllm: T raining-free acceler- ation of dif fusion llm by enabling kv cache and parallel decoding. arXiv preprint , 2025. Y ang, L., T ian, Y ., Li, B., Zhang, X., Shen, K., T ong, Y ., and W ang, M. Mmada: Multimodal large dif fusion language models. CoRR , abs/2505.15809, 2025. Y e, J., Xie, Z., Zheng, L., Gao, J., W u, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models. arXiv preprint , 2025. Zhao, S., Gupta, D., Zheng, Q., and Grover , A. d1: Scaling reasoning in dif fusion large language models via rein- forcement learning. CoRR , abs/2504.12216, 2025. Zhu, Y ., Chen, W ., Kwok, J., and Zhao, Z. Spmdm: Enhanc- ing masked dif fusion models through simpliﬁng sampling path. In The Thirty-ninth Annual Conference on Neural Information Pr ocessing Systems . 12

Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment