Diffusion models over discrete spaces have recently shown striking empirical success, yet their theoretical foundations remain incomplete. In this paper, we study the sampling efficiency of score-based discrete diffusion models under a continuous-time Markov chain (CTMC) formulation, with a focus on $τ$-leaping-based samplers. We establish sharp convergence guarantees for attaining $\varepsilon$ accuracy in Kullback-Leibler (KL) divergence for both uniform and masking noising processes. For uniform discrete diffusion, we show that the $τ$-leaping algorithm achieves an iteration complexity of order $\tilde O(d/\varepsilon)$, with $d$ the ambient dimension of the target distribution, eliminating linear dependence on the vocabulary size $S$ and improving existing bounds by a factor of $d$; moreover, we establish a matching algorithmic lower bound showing that linear dependence on the ambient dimension is unavoidable in general. For masking discrete diffusion, we introduce a modified $τ$-leaping sampler whose convergence rate is governed by an intrinsic information-theoretic quantity, termed the effective total correlation, which is bounded by $d \log S$ but can be sublinear or even constant for structured data. As a consequence, the sampler provably adapts to low-dimensional structure without prior knowledge or algorithmic modification, yielding sublinear convergence rates for various practical examples (such as hidden Markov models, image data, and random graphs). Our analysis requires no boundedness or smoothness assumptions on the score estimator beyond control of the score entropy loss.
Diffusion models have recently emerged as state-of-the-art approaches for high-fidelity image generation and video synthesis (Dhariwal and Nichol (2021); Ho et al. (2020Ho et al. ( , 2022)); Song and Ermon (2019)), and have already led to significant scientific advances in various domains, including climate modeling, protein structure prediction, and materials science (Li et al. (2024); Watson et al. (2023); Zeni et al. (2025)). At their core, diffusion models are built upon two stochastic processes: a forward process that gradually corrupts the data distribution into pure noise, and a reverse process that generates samples by learning the logarithmic gradient of the perturbed marginals, commonly referred to as the score function.
Despite their broad empirical success, diffusion models have been predominantly developed for continuous data. Their extension to discrete domains, such as natural language, graph-structured data, and categorical labels, has long remained challenging, although already discussed in Sohl-Dickstein et al. (2015). This perspective began to shift following the seminal work of Austin et al. (2021), which revealed the promise of diffusion-based approaches in discrete settings. Analogous to the continuous case, discrete diffusion models rely on a pair of noisy forward and reverse processes, with sampling achieved by learning appropriate ratios of distributions. Among recent developments (Bach and Saremi (2025); Campbell et al. (2022); Ou et al. (2025); Sahoo et al. (2024)), score-entropy discrete diffusion (SEDD) has demonstrated striking performance in text generation (Lou et al. (2024)), challenging the long-standing dominance of autoregressive language models. In contrast to autoregressive approaches, diffusion-based language models are not constrained to a fixed generation order (such as left-to-right), and they naturally lend themselves to more flexible forms of controlled generation, including conditional and structured text synthesis.
The promise of discrete diffusion models has spurred growing interest in their theoretical foundations. A particularly influential line of work formulates discrete diffusion through the lens of continuous-time Markov chains (CTMCs) (Campbell et al., 2022), in which the forward dynamics is governed by a carefully designed rate matrix, and the backward dynamics is approximated via a learned score function. Among the proposed constructions, two choices have emerged as especially prominent: the uniform rate matrix, which induces a uniform stationary distribution for the forward process, and the absorbing rate matrix, which yields a degenerate stationary distribution with an absorbing state. In practice, the performance of the resulting samplers depends sensitively on the choice of the rate matrix (Lou et al. (2024);von Rütte et al. (2025)). Correspondingly, two parallel lines of work have sought to understand the sampling efficiency of discrete diffusion models -specifically, the number of steps required to produce sufficiently accurate samplesunder these respective constructions. Representative results include Chen and Ying (2025); Liang et al. (2025c); Pham et al. (2025); Ren et al. (2025); Zhang et al. (2025) for uniform diffusion and Chen et al. (2025); Conforti et al. (2025); Li and Cai (2025); Liang et al. (2025b); Park et al. (2025) for masking diffusion (also referred to as absorbing diffusion).
Existing theoretical analyses for score-based discrete diffusions suggest that convergence rates typically scale at least linearly with both the vocabulary size S and the ambient dimension d. Such scaling can quickly become prohibitive in applications; for instance, in GPT-2-based tasks, the vocabulary size is S " 50,257 and the dimension is d " 10 2 " 10 3 (Lou et al., 2024). These considerations naturally motivate a fundamental question:
How efficient are discrete diffusion models? When is sublinear convergence possible?
To put our discussion in context, there has been substantial progress in understanding the sample efficiency of continuous diffusion models. Seminal work by Chen et al. (2023b) characterizes the iteration complexity of the DDPM sampler under Lipschitz (or smoothness) assumptions on the score functions across all steps. Subsequent studies significantly relax these conditions and establish convergence guarantees for broader classes of continuous distributions (Benton et al., 2024;Chen et al., 2023a;Li et al., 2023). Nevertheless, it is now well understood that for general distributions, a linear dependence on the ambient dimension d is unavoidable. By contrast, when the target distribution exhibits additional structure -such as Gaussian mixture models or support on low-dimensional manifolds -a growing body of work shows that popular samplers can adaptively exploit intrinsic low-dimensional geometry, achieving improved efficiency without explicit dimension reduction (see, e.g., Huang et al. (2024); Li et al. (2025); Li and Yan (2024);
This content is AI-processed based on open access ArXiv data.