Efficient Sampling with Discrete Diffusion Models: Sharp and Adaptive Guarantees
Diffusion models over discrete spaces have recently shown striking empirical success, yet their theoretical foundations remain incomplete. In this paper, we study the sampling efficiency of score-based discrete diffusion models under a continuous-tim…
Authors: Daniil Dmitriev, Zhihan Huang, Yuting Wei
Efficien t Sampling with Discrete Diffusion Mo dels: Sharp and A daptiv e Guaran tees Daniil Dmitriev ∗ † , Zhihan Huang ˚: , Y uting W ei : F ebruary 17, 2026 Abstract Diffusion models o v er discrete spaces hav e recently sho wn striking empirical success, yet their the- oretical foundations remain incomplete. In this pap er, w e study the sampling efficiency of score-based discrete diffusion models under a con tinuous-time Mark ov c hain (CTMC) form ulation, with a focus on τ -leaping-based samplers. W e establish sharp conv ergence guaran tees for attaining ε accuracy in Kullbac k-Leibler (KL) div ergence for b oth uniform and masking noising processes. F or uniform discrete diffusion, we show that the τ -leaping algorithm achiev es an iteration complexit y of order r O p d { ε q , with d the ambien t dimension of the target distribution, eliminating linear dependence on the vocabulary size S and impro ving existing bounds b y a factor of d ; moreo ver, we establish a matching algorithmic lo wer bound sho wing that linear dependence on the am bient dimension is unav oidable in general. F or masking discrete diffusion, we in tro duce a modified τ -leaping sampler whose conv ergence rate is gov erned b y an intrinsic information-theoretic quan tity , termed the effe ctive total c orr elation , which is b ounded by d log S but can b e sublinear or even constant for structured data. As a consequence, the sampler pro v- ably adapts to low-dimensional structure without prior knowledge or algorithmic modification, yielding sublinear conv ergence rates for v arious practical examples (such as hidden Marko v mo dels, image data, and random graphs). Our analysis requires no b oundedness or smo othness assumptions on the score estimator beyond control of the score entrop y loss. Con ten ts 1 In tro duction 2 1.1 Sampling efficiency and adaptivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Preliminaries of discrete diffusion mo dels 5 2.1 A contin uous-time Marko v chain formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Score estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Score-based sampling algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Main results 8 3.1 Uniform discrete diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Masking discrete diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4 Discussion 16 A Examples of lo w in trinsic dimensions 19 A.1 Details and formal results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A.2 Pro ofs of results in Section A.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 ∗ Equal contribution, alphab etical order. † Department of Statistics and Data Science, the Wharton Sc ho ol, Universit y of Pennsylv ania; email: {daniild,zhihanh,ytwei}@wharton.upenn.edu 1 B T ec hnical preparations 30 B.1 Score functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 B.2 T echnical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 C Proofs of results in Section 3.1 33 C.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 C.2 Proof of Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 C.3 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 C.4 Efficien t sampling for high-entrop y distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 38 D Pro ofs of results in Section 3.2 40 D.1 Pro of of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 D.2 Pro of of Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 D.3 τ -leaping for masking discrete diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 E Pro ofs of the main lemmas 49 E.1 Characterization of B p q data q and C p q data q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 E.2 Pro of of Lemma 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 E.3 Pro of of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 E.4 Pro of of Lemma 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 E.5 Pro of of Lemma 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 E.6 Pro of of Lemma 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 E.7 Pro of of Lemma 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 F Pro ofs of the auxiliary lemmas 56 F.1 Pro of of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 F.2 Pro of of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 F.3 Pro of of Lemma 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 F.4 Pro of of Lemma 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 1 In tro duction Diffusion mo dels ha v e recen tly emerged as state-of-the-art approac hes for high-fidelity image generation and video syn thesis ( Dhariw al and Nic hol ( 2021 ); Ho et al. ( 2020 , 2022 ); Song and Ermon ( 2019 )), and ha ve already led to significan t scientific adv ances in v arious domains, including climate modeling, protein structure prediction, and materials science ( Li et al. ( 2024 ); W atson et al. ( 2023 ); Zeni et al. ( 2025 )). A t their core, diffusion models are built upon t wo stochastic pro cesses: a forward pro cess that gradually corrupts the data distribution into pure noise, and a rev erse pro cess that generates samples b y learning the logarithmic gradien t of the p erturb ed marginals, commonly referred to as the score function. Despite their broad empirical success, diffusion mo dels ha ve been predominan tly developed for con tinuous data. Their extension to discrete domains, such as natural language, graph-structured data, and categorical lab els, has long remained c hallenging, although already discussed in Sohl-Dic kstein et al. ( 2015 ). This p erspective began to shift follo wing the seminal work of Austin et al. ( 2021 ), which rev ealed the promise of diffusion-based approac hes in discrete settings. Analogous to the con tinuous case, discrete diffusion mo dels rely on a pair of noisy forward and rev erse pro cesses, with sampling achiev ed by learning appropriate ratios of distributions. Among recent developmen ts ( Bach and Saremi ( 2025 ); Campb ell et al. ( 2022 ); Ou et al. ( 2025 ); Saho o et al. ( 2024 )), score-entrop y discrete diffusion (SEDD) has demonstrated striking performance in text generation ( Lou et al. ( 2024 )), c hallenging the long-standing dominance of autoregressiv e language mo dels. In contrast to autoregressiv e approac hes, diffusion-based language mo dels are not constrained to a fixed generation order (such as left-to-righ t), and they naturally lend themselves to more flexible forms of con trolled generation, including conditional and structured text synthesis. The promise of discrete diffusion mo dels has spurred growing interest in their theoretical foundations. A particularly influential line of w ork form ulates discrete diffusion through the lens of contin uous-time Marko v c hains (CTMCs) ( Campbell et al. , 2022 ), in whic h the forward dynamics is gov erned b y a carefully designed 2 rate matrix, and the backw ard dynamics is appro ximated via a learned score function. Among the prop osed constructions, t wo c hoices hav e emerged as esp ecially prominen t: the uniform rate matrix, which induces a uniform stationary distribution for the forw ard pro cess, and the absorbing rate matrix, whic h yields a degenerate stationary distribution with an absorbing state. In practice, the p erformance of the resulting samplers dep ends sensitiv ely on the choice of the rate matrix ( Lou et al. ( 2024 ); v on Rütte et al. ( 2025 )). Corresp ondingly , tw o parallel lines of work ha ve sought to understand the sampling efficiency of discrete diffusion mo dels — sp ecifically , the num b er of steps required to pro duce sufficiently accurate samples — under these respective constructions. Representativ e results include Chen and Ying ( 2025 ); Liang et al. ( 2025c ); Pham et al. ( 2025 ); Ren et al. ( 2025 ); Zhang et al. ( 2025 ) for uniform diffusion and Chen et al. ( 2025 ); Conforti et al. ( 2025 ); Li and Cai ( 2025 ); Liang et al. ( 2025b ); Park et al. ( 2025 ) for masking diffusion (also referred to as absorbing diffusion). Existing theoretical analyses for score-based discrete diffusions suggest that conv ergence rates typically scale at least linearly with b oth the vocabulary size S and the am bient dimension d . Suc h scaling can quickly b ecome prohibitiv e in applications; for instance, in GPT-2-based tasks, the vocabulary size is S “ 50 , 257 and the dimension is d “ 10 2 „ 10 3 ( Lou et al. , 2024 ). These considerations naturally motiv ate a fundamen tal question: How efficient ar e discr ete diffusion mo dels? When is subline ar c onver genc e p ossible? 1.1 Sampling efficiency and adaptivit y T o put our discussion in con text, there has b een substan tial progress in understanding the sample efficiency of contin uous diffusion mo dels. Seminal w ork by Chen et al. ( 2023b ) characterizes the iteration complexit y of the DDPM sampler under Lipsc hitz (or smo othness) assumptions on the score functions across all steps. Subsequen t studies significan tly relax these conditions and establish con vergence guarantees for broader classes of contin uous distributions ( Ben ton et al. , 2024 ; Chen et al. , 2023a ; Li et al. , 2023 ). Nevertheless, it is now well understo od that for general distributions, a linear dep endence on the ambien t dimension d is una voidable. By contrast, when the target distribution exhibits additional structure — such as Gaussian mixture models or supp ort on low-dimensional manifolds — a growing b ody of work shows that popular samplers can adaptively exploit intrinsic low-dimensional geometry , achieving improv ed efficiency without explicit dimension reduction (see, e.g., Huang et al. ( 2024 ); Li et al. ( 2025 ); Li and Y an ( 2024 ); Liang et al. ( 2025a )). The landscap e shifts considerably as we mo ve to discrete diffusion mo dels. Under the CTMC formulation, algorithms suc h as Gillespie’s metho d and uniformization allow for exact simulation of the reverse process, free of discretization error ( Chen and Ying , 2025 ; Gillespie , 1976 ; V an Dijk , 1992 ). How ever, these methods suffer from high computational costs in high-dimensional settings. Moreo ver, their conv ergence guarantees are inherently sto c hastic, as they dep end on a random num b er of transitions. An alternative and widely adopted approach, particularly in diffusion-based language mo dels, is pro vided by τ -leaping and its v ariants, including truncated τ -leaping ( Campbell et al. , 2022 ; Gillespie , 2001 ). Originally dev elop ed in chemical kinetics, τ -leaping replaces sequen tial state transitions with parallel up dates across coordinates, offering substan tial computational gains in large systems. Y et, our theoretical understanding of τ -leaping remains incomplete. Current state-of-the-art results exhibit at least linear dep endence on the vocabulary size S , linear dep endence on d for the absorbing case, and quadratic dependence on d for the uniform case ( Conforti et al. ( 2025 ); Liang et al. ( 2025b , c )); see T able 1 for more details. It remains an op en question whether these dep endencies are fundamental information-theoretic barriers or merely analytical artifacts. F urthermore, as in the con tinuous setting, an ideal sampling algorithm should automatically adjust to the in trinsic difficulty of the target distribution. F or example, one w ould exp ect substantially faster conv ergence for Dirac delta measures or uniform target distributions, without prior kno wledge of the structure or mo difications to the algorithm. Existing analyses of τ -leaping do not illuminate whether suc h adaptivit y is possible. More sp ecifically , we aim to address the following question: Can sc or e-b ase d samplers automatic al ly adapt to structur e d tar get distributions? 3 1.2 Our contributions The contributions of this w ork are cen tered on establishing sharp con vergence guaran tees for discrete dif- fusion mo dels, bridging the gap b et w een empirical success and theoretical understanding. Sp ecifically , our con tributions are mainly threefold: • Optimal rates for uniform diffusion : W e establish that for the uniform diffusion process, the τ -leaping sampler requires only r O p d { ε q discretization steps to ac hieve an ε -error in KL divergence. This result significantly sharp ens the previously b est-kno wn b ound of r O p d 2 S { ε q ( Liang et al. , 2025c ), effectiv ely remo ving a factor of d and the dep endence on the v o cabulary size S . • F undamen tal low er b ounds : W e demonstrate that the linear dep endence on the dimension d is essen tially unimprov able for the τ -leaping algorithm. Sp ecifically , w e show that under uniform dif- fusion, an o p d q complexit y b ound is unattainable unless the target distribution is already pro ximal to the uniform measure. This result c haracterizes a fundamen tal price of sampling for informativ e distributions. • A daptivity for masking diffusion : F or the masking diffusion pro cess, w e introduce a refined τ - leaping sampler that has a complexity go verned b y r O p D { ε q , where D is the effective total correlation, an information-theoretic measure of the target distribution’s intrinsic complexit y . Notably , while D is alw ays b ounded b y the classical total correlation and the dual total correlation (and th us by d log S ), it can be sublinear or ev en O p 1 q for highly structured data, allo wing our sampler to adapt automatically to low-dimensional target distributions. In con trast to prior work, our upp er bounds do not require the b oundedness of the score estimator or an y auxiliary regularity assumptions b ey ond con trol of the score en tropy loss. The key technical ingredients include a Girsanov c hange-of-measure argumen t, combined with establishing the martingale prop erties of the sampling dynamics. The latter effectiv ely separates the approximation error from the discretization error, allo wing each to b e analyzed indep enden tly . F or the lo wer bound, w e leverage a log-Sob olev inequalit y to- gether with a strong data-processing inequality along the uniform noising process. T o demonstrate the scop e of our adaptivity results for masking discrete diffusion, we present several examples whose analysis requires careful constructions and delicate con trol of information-theoretic quantities, which may b e of indep enden t in terest. 1.3 Notation F or a p ositiv e integer n , w e define r n s : “ t 1 , . . . , n u and let I n P R n ˆ n denote the iden tity matrix. Let d ą 0 denote the num b er of dimensions, S ą 0 denote the vocabulary size and T ą 0 denote the time horizon. Let MASK denote a special v alue outside of r S s . Let X : “ V d denote the domain, where, dep ending on the con text, V : “ r S s or V : “ r S s Y t MASK u . W e denote the set of all distributions on X b y P p X q . Let H , KL , and I denote entr opy , Kul lb ack-L eibler (KL) diver genc e , and mutual information , resp ectively . Let δ x denote the Dirac measure at point x . W e adopt the standard asymptotic notation O p¨q , Ω p¨q , Θ p¨q , À , and ! . A dditionally , r O p¨q , r Ω p¨q , and r Θ p¨q are defined analogously , except that the logarithmic dep endency on d, S , 1 { ε , and 1 { δ is hidden. F or a vector x “ p x 1 , x 2 , . . . , x d q P X , i P r d s , and c P V , we define vectors x ´ i : “ p x 1 , . . . , x i ´ 1 , x i ` 1 , . . . , x d q , and x ‘ i c , x d i c P X as follows: • p x ‘ i c q j “ x j for all j ‰ i , and p x ‘ i c q i “ p x i ` c q mo d |V | 1 , • p x d i c q j “ x j for all j ‰ i , and p x d i c q i “ c , F or x, y P X , we denote the Hamming distance b y d H p x, y q : “ | t i P r d s : x i ‰ y i u | , and for x P pr S s Y t MASK uq d , we denote m p x q : “ | t i P r d s , such that x i “ MASK u | . 1 In this case, w e assume that V has additive structure. W e only apply this notation when V “ r S s . W e use the conv ention that 0 mo d S “ S . 4 Paper Noising process Score Est. Assump. No Early Stopping Sampler Iteration Complexity Adaptation Ren et al. ( 2025 ) Uniform Bounded % τ -leaping d 2 S 2 { ε % Liang et al. ( 2025c ) Uniform Bounded % τ -leaping d 2 S { ε % Our work, Theorems 1 & 2 Uniform No requirement ! τ -leaping d { ε ˚ % Liang et al. ( 2025b ) Masking Bounded % τ -leaping dS { ε % Conforti et al. ( 2025 ) Masking ˆ s t « s t % DMPM dS { ε % Our work, Theorem 3 Masking No requirement ! Algorithm 1 D { ε ! T able 1: Comparison with prior work. Logarithmic factors in the iteration complexit y are omitted. Ren et al. ( 2025 ) and Liang et al. ( 2025b ) describ e b ounds without early stopping under more stringent assumptions on the target distribution, the score function, or the score estimator. The b ound in Conforti et al. ( 2025 ) dep ends on additional quan tities in volving the score estimator beyond Assumption 1 , which are small whenev er s t « ˆ s t . The quan tity D (defined in Eqn. ( 16 )), is upp er b ounded by d log p S q and captures the intrinsic low- dimensional structure of the target distribution. Entry mark ed with ˚ indicates sharp rates, with matching lo wer b ounds established in Theorem 2 . 2 Preliminaries of discrete diffusion mo dels 2.1 A contin uous-time Mark o v c hain formulation Our goal is to mo del d -dimensional discrete data X 0 “ p X 1 0 , X 2 0 , . . . , X d 0 q P r S s d . Let q data “ q 0 denote the probabilit y mass function (p.m.f.) of X 0 from whic h we aim to sample, and let q i 0 b e the marginal p.m.f. of the i -th coordinate. Analogous to con tinuous diffusion mo dels, their discrete counterparts consist of a forw ard and a reverse pro cess ov er the discrete space. The forw ard pro cess. W e define a forward noising process that progressively transforms the data distri- bution q 0 to a distribution q T that is close to an easy-to-sample distribution. This pro cess is mo deled using a contin uous-time Marko v chain (CTMC). Definition 1 (Contin uous-time Marko v chain) . A CTMC with an initial distribution q 0 and r ate matric es p Q t q t Pr 0 ,T s is a right-c ontinuous sto chastic pr o c ess p x t q t Pr 0 ,T s such that • p x t q t Pr 0 ,T s satisfies the Mark ov prop ert y : for any 0 ď s ă t ď T , the c onditional distribution of x t given the history t x u , u ď s u dep ends only on x s , • for any 0 ď t ă T , the tr ansition pr ob abilities satisfy, as ∆ t Ñ 0 ` : Pr p x t ` ∆ t “ y | x t “ x q “ I t x “ y u ` Q t p x, y q ∆ t ` o p ∆ t q . (1) Her e, the r ate matric es satisfy Q t p x, y q ě 0 for al l x ‰ y P X and Q t p x, x q “ ´ ř y ‰ x Q t p x, y q . W e refer to F einberg et al. ( 2014 ); F eller ( 1940 ) for a rigorous treatment of CTMCs. F or a giv en q 0 , the marginals p q t q t Pr 0 ,T s satisfying Eqn. ( 1 ) are the solutions to the Kolmo gor ov forwar d e quation : d q t d t “ Q J t q t , for 0 ď t ď T . The rev erse pro cess. F or suc h a CTMC, there exists a time-rev ersed process with an initial distribution q T , rate matrices p Ð Q t q t Pr 0 ,T s , and marginals p Ð q t q t Pr 0 ,T s , suc h that q t ” Ð q T ´ t , for t P r 0 , T s . The forward and 5 rev erse rate matrices are explicitly related ( Campbell et al. , 2022 ) by Ð Q t p x, y q “ Q T ´ t p y , x q q T ´ t p y q q T ´ t p x q , for x ‰ y P X and 0 ď t ď T . (2) In this paper, w e focus on rate matrices that satisfy three conditions: 1. they are time-homogeneous, Q t ” Q , 2. Q p x, y q “ 0 whenever d H p x, y q ě 2 , 3. if d H p x, y q “ 1 and x i ‰ y i , then Q p x, y q “ Q tok p x i , y i q , for some fixed matrix Q tok . In particular, we consider tw o important instances of CTMCs that are widely adopted in practice, namely the uniform noising pr o c ess and the masking (or absorbing) noising pr o c ess , which are defined through the c hoice of Q tok . • uniform noising pro cess : A CTMC is a uniform noising pr o c ess if for a ‰ b P r S s Q tok p a, b q “ 1 { S. (3) This CTMC con verges to the uniform distribution on the domain X : “ r S s d in the limit. • masking noising process : A CTMC on the domain X : “ pr S s Y t MASK uq d is a masking noising pr o c ess if for a ‰ b P r S s Y t MASK u Q tok p a, b q “ I t a ‰ MASK and b “ MASK u . (4) The corresp onding CTMC conv erges to the Dirac measure p δ MASK q b d as t Ñ 8 . Note that we constrain the initial distribution q 0 to b e supp orted on non-masked data, i.e., on r S s d . 2.2 Score estimation Recall that the reverse process is a CTMC with rate matrices satisfying the relation ( 2 ), whic h is similar to the rev erse pro cess in the contin uous case. The density ratio here generalizes the typical score function ∇ x log q t p x q in the con tinuous case and is often referred to as the (concrete) score function for discrete diffusion mo dels ( Meng et al. , 2022 ). F ormally , w e define the sc or e function s t p y , x q as s t p y , x q “ q t p y q q t p x q , for x ‰ y P X . Score entrop y loss. F or b oth the uniform and masking noising pro cesses, the marginals p q t q , and conse- quen tly the score function, are in tractable in general. In practice, one therefore resorts to an appro ximation p s t p y , x q of the true score function s t p y , x q , whic h is learned from data sampled from the target distribution q 0 . T o ev aluate the quality of the estimated score, a widely used loss function is the sc or e entr opy loss , originally in tro duced in Lou et al. ( 2024 ), which has since b ecome the de facto standard for training score-based dis- crete diffusion models. This loss provides a principled ob jective for matching the approximate score p s t to the true score induced by the forw ard diffusion pro cess. Sp ecifically , for t ě 0 and functions p s, s : X ˆ X Ñ R ` , the score en tropy loss L SE is defined as follows: L SE p t, p s, s q : “ E x „ q t ” ÿ y ‰ x Q t p y , x q s p y , x q D p p s p y , x q , s p y , x qq ı ě 0 . Here, for a, b ě 0 , the Bregman div ergence for ϕ p a q “ a ´ log a is given b y D p a, b q : “ a b ´ 1 ´ log a b ě 0 . In practice, to implemen t an y sampling algorithm, one has to discretize the con tinuous dynamics and obtain score estimates at discrete time steps. Supp ose that the score estimates p s T ´ t are obtained at discrete time p oin ts 0 ď t 0 ă t 1 ă . . . ă t N ď T . W e make the following standard assumption regarding the score estimation errors. 6 parallel up dates CTMC-based Euler metho d T weedie τ -leaping τ -leaping alg. T runcated τ -leaping Algorithm 1 τ -bridging Gillespie’s alg. Uniformization DMPM Figure 1: Overview of score-based samplers. The left part comprises score-based samplers that allo w parallel up dates, defined as τ -leaping strategies in Lou et al. ( 2024 ). The right part consists of samplers that can b e defined through the CTMC framework. At the intersection are τ -bridging strategies, defined in Eqn. ( 8 ). Assumption 1 (Approximation error) . L et N ą 0 and 0 ď t 0 ă t 1 ă . . . ă t N ď T . W e assume that N ´ 1 ÿ k “ 0 p t k ` 1 ´ t k q L SE p T ´ t k , p s T ´ t k , s T ´ t k q ď ε score . (5) This assumption is concerned with the aggregated estimation errors o ver all N steps. Several w orks ha ve constructed estimates that satisfy this assumption; examples include Benton et al. ( 2024 ); Lou et al. ( 2024 ); Ou et al. ( 2025 ). 2.3 Score-based sampling algorithms Armed with the score estimates p p s T ´ t q t Pt t 0 ,...,t N u , we aim to construct a generative model p q 0 that approxi- mates the data distribution q 0 . A natural approach prop osed in Campb ell et al. ( 2022 ) is to define a surrogate CTMC that starts from an easy-to-sample distribution p 0 « q T and approximates the backw ard dynamics in ( 2 ). Concretely , we define the time-inhomogeneous rate matrix p Q t p x, y q “ Q T ´ t p y , x q p s T ´ t p y , x q . (6) In practice, score estimates are only a v ailable on a fixed discretization τ “ p t 0 , . . . , t N q , and extending these estimates to the full interv al r 0 , T s in tro duces discr etization err or . τ -leaping algorithm. As men tioned abov e, a widely adopted sampler is the τ -leaping algorithm ( Campbell et al. , 2022 ), which appro ximates Eqn. ( 6 ) with m ultiple possible transitions within eac h discretization in terv al. F ormally , for k P t 0 , . . . , N ´ 1 u and t P r t k , t k ` 1 q , giv en x t k and p s T ´ t k , τ -leaping obtains x t k ` 1 as a random vector whose co ordinates are sampled independently via d one-dimensional CTMCs. F or each i P r d s , the initial distribution is δ x i t k and the rate matrices are giv en b y 2 : p Q i t p a, b q “ p s T ´ t k p x t k , x t k ‘ i p b ´ a qq , for a ‰ b P V . (7) The formulation in Eqn. ( 7 ) requires either an additiv e structure on the state space or the restriction that each co ordinate undergoes at most one transition b et ween discretization p oin ts. Existing analyses for uniform and masking diffusions ( Campb ell et al. , 2022 ; Liang et al. , 2025c ) adopt the latter assumption. In Section 3.1 , w e explore the necessit y of this requirement for the uniform noising process. Lou et al. ( 2024 ) generalizes τ - leaping by in tro ducing a class of samplers termed τ -le aping str ate gies , which allow arbitrary transformations x i t k ` 1 “ f i k p p s T ´ t k , x t k q . Both the Euler metho d and T w eedie τ -leaping fall in to this class. How ev er, they remain challenging for direct theoretical analysis due to the absence of a CTMC structure. 2 The algorithm admits an equiv alent P oisson formulation, in which dS Poisson random v ariables corresp onding to coordinate- v alue transitions are sampled and applied in parallel. 7 This pap er: τ -bridging strategies. W e in tro duce a structured class of samplers that generalizes the τ - leaping algorithm. W e name this class of algorithms τ -bridging str ate gies , whic h retain the parallel updating structure while remaining analytically tractable. A τ -bridging strategy generates x t k ` 1 from x t k b y evolving d independent one-dimensional CTMCs on r t k , t k ` 1 q . F or each co ordinate i P r d s , the c hain is initialized at δ x i t k and has the rate matrix giv en b y p Q i t “ G i t p p s T ´ t k , x t k q , (8) for some mapping G i t : R X ˆ X ` ˆ X Ñ R V ˆ V . Compared to general τ -leaping strategies, τ -bridging strategies restrict updates to CTMC-based transitions. This restriction preserves parallel co ordinate updates while facilitating theoretical analysis. Figure 1 summarizes the relationships b et ween these classes of sampling algorithms. A represen tative instance of a τ -bridging sampler is the trunc ate d τ -le aping sampler of Liang et al. ( 2025c ). F or k P r N s and t P r t k , t k ` 1 q , the corresponding rate matrices take the form: G i t p p s T ´ t k , x t k qp a, b q “ p s T ´ t k p x t k , x t k d i b q I t x i t k “ a u for a ‰ b P V . (9) The indicator I t x i t k “ a u enforces the constraint that at most one transition occurs p er co ordinate i P r d s within eac h discretization in terv al r t k , t k ` 1 q . In Section 3.2 , we show that an instance of this sc heme achiev es sublinear complexity for the masking noising pro cess under mild distributional assumptions. T o the b est of our knowledge, this is the first result establishing such a guarantee. 3 Main results In this section, w e characterize the sampling efficiency of the τ -bridging strategies for b oth the uniform and masking noising pro cesses. W e develop sharp conv ergence guarantees and highligh t cases where adaptivit y is automatically ac hieved. W e pro vide proof sketc hes for all results in this section, with full pro ofs deferred to the appendix. 3.1 Uniform discrete diffusion 3.1.1 A sharp con vergence characterization W e begin with the uniform discrete diffusion models, whose forw ard dynamics are given b y the uniform noising pro cess. W e establish explicit sampling guarantees for the τ -leaping algorithm, measured in KL div ergence. The pro of is given in App endix C.1 . Theorem 1. L et q data “ q 0 b e the data distribution on X : “ r S s d . F or 0 “ t 0 ă t 1 ă . . . ă t N “ T , let ∆ : “ max k t t k ` 1 ´ t k u “ O p 1 q . Set p 0 “ Unif p X q . Under Assumption 1 , the τ -le aping algorithm initialize d at p 0 gener ates a sample fr om p output “ p T such that KL p q data } p output q À ε score ` e ´ T d log p S q ` ∆ d log p S { ∆ q . (10) As exp ected, the KL div ergence b ound in Theorem 1 decomp oses in to three terms. The first term ε score quan tifies the quality of score estimation and captures the accumulation of estimation errors ov er the N discretization steps. The second term corresponds to the initialization error, arising from initializing the sampler with the uniform distribution p 0 instead of the true terminal distribution q T ; this term deca ys exp onen tially with the diffusion horizon T . Finally , the third term accoun ts for the discretization error incurred by appro ximating the contin uous-time reverse process with a discrete-time τ -leaping sc heme. T o further interpret Theorem 1 and place it in context with existing results, we highligh t sev eral of its salien t features. First, the discretization error scales linearly with the dimension d and only logarithmically with the v o cabulary size S . This matches the result obtained for the random walk mo del ( Conforti et al. , 2025 ) and rev eals that the discretization error is insensitive to the distribution scale, as has b een shown for con tinuous diffusion mo dels (e.g., Huang et al. ( 2024 )). Second, the theorem p ermits a flexible c hoice of step size sc hedules and do es not require early stopping. In contrast to prior analyses that rely on carefully selected step sizes and introduce an early stopping time δ (where the algorithm outputs p T ´ δ in place of 8 p T ), the b ound in Theorem 1 dep ends only on the maximum step size. Moreov er, the same b ound applies uniformly to early stopping v ariants: the right-hand side of ( 10 ) remains unc hanged for an y δ ! 1 . The only requiremen t we hav e on score estimation is Assumption 1 , with no additional b oundedness or regularit y conditions (typically assumed in the existing l iterature). As a result, the theorem applies to a broad class of score estimation pro cedures commonly used in practice. W e pro vide a sk etch of its pro of to illustrate the main pro of ideas. Pr o of sketch of The or em 1 . In view of the data-pro cessing inequality and the c hain rule for KL divergence, w e upp er b ound the KL div ergence b et ween q 0 and p T b y the KL divergence b et ween the paths q T ´ t 0 ,...,T ´ t N and p t 0 ,...,t N , which can b e decomp osed as KL p q 0 } p T q ď KL p q T ´ t 0 ,...,T ´ t N } p t 0 ,...,t N q “ KL p q T } p 0 q ` N ´ 1 ÿ k “ 0 E x t k „ Ð q t k ” KL ´ Ð q t k ` 1 | t k p¨| x t k q } p t k ` 1 | t k p¨| x t k q ¯ı . The first term is the initialization error, which can be upper bounded b y the log-Sobolev inequalit y in Lemma 7 . F or the second term, we apply Girsanov’s change-of-measure theorem for con tinuous-time Marko v c hains to obtain the follo wing upper b ound: 1 S N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t k ,x t „ Ð q t k ,t ÿ i Pr d s ÿ c Pr S s s T ´ t p x t ‘ i c, x t q D ` p s T ´ t k p x t k ‘ i c, x t k q , s T ´ t p x t ‘ i c, x t q ˘ d t. The details can b e found around Eqn. ( 67 ). T o further control the right-hand side, we apply the law of cosines for Bregman divergence and derive that (with ℓ : “ t k ) ÿ i Pr d s ÿ c Pr S s s T ´ t p x t ‘ i c, x t q D ` p s T ´ t k p x t k ‘ i c, x t k q , s T ´ t p x t ‘ i c, x t q ˘ “ ÿ y ℓ :d H p y ℓ ,x ℓ q“ 1 s T ´ ℓ p y ℓ , x ℓ q D ` p s T ´ ℓ p y ℓ , x ℓ q , s T ´ ℓ p y ℓ , x ℓ q ˘ lo oooooooooooooooooooooooooooooooooo omo oooooooooooooooooooooooooooooooooo on Controlled by Assumption 1 ` ÿ i Pr d s ÿ c Pr S s ` s T ´ ℓ p x ℓ ‘ i c, x ℓ q ´ s T ´ t p x t ‘ i c, x t q ˘ log p s T ´ ℓ p x ℓ ‘ i c, x ℓ q lo oooooooooooooooooooooooooooooooooooooooooooomo ooooooooooooooooooooooooooooooooooooooooooo on Expectation controlled by Lemma 8 ` ÿ y t :d H p y t ,x t q“ 1 ` ´ log s t p y t , x t q ˘ ´ ÿ y ℓ :d H p y ℓ ,x ℓ q“ 1 ` ´ log s T ´ ℓ p y ℓ , x ℓ q ˘ . The first term can b e con trolled by Assumption 1 after taking the exp ectation o ver x ℓ „ Ð q ℓ and in tegrating o ver time. The second term can b e shown to b e zero with the help of Lemma 8 after taking the expectation o ver x t „ Ð q t | ℓ p¨ | x ℓ q . Th us, the problem b oils do wn to upp er bounding the third term abov e, whose prop erties are characterized in Lemma 10 . After taking the exp ectation and integrating ov er time, we can upp er b ound the third term by ∆ d log p S { ∆ q . Combining the bounds for all three terms completes the pro of. Next, w e sp ecialize Theorem 1 to the concrete c hoice of a discretization schedule to derive the iteration complexit y required to obtain an ε -accurate sampler in KL div ergence. F or a simple step size sc hedule, it turns out that d { ε steps (up to logarithmic factors) suffice for con vergence, significantly improving on the state-of-the-art complexity of d 2 S { ε from Liang et al. ( 2025c ). Refer to App endix C.2 for the pro of. Corollary 1. F or the setting in The or em 1 and ε ą 0 , the output of the τ -le aping algorithm with c onstant step size sche dule t k ` 1 ´ t k “ T { N for k P r N ´ 1 s , achieves KL p q data } p output q À ε score ` ε, 9 pr ovide d that the time horizon T “ log p d log p S q{ ε q and the iter ation numb er N “ r O ˆ d ε ˙ . (11) R emark 1 (Step size sc hedule) . In Corollary 1 , w e adopt the constant step size schedule for simplicity . This c hoice is optimal in the sense that it minimizes the w orst-case upp er b ound for a fixed num b er of steps N , and it is also empirically effective ( Campb ell et al. , 2022 ). How ever, other step size sc hedules commonly used in practice and theory achiev e the same iteration complexity , including the exp onential-then-constan t sc hedule (defined as in Corollary 2 and used in Liang et al. ( 2025c )) and the log-linear schedule ( Lou et al. , 2024 ). In these works, early stopping is introduced to maintain numerical stabilit y in score estimation during training and also to ensure a small discretization error. Our result shows that, under Assum ption 1 , early stopping is not necessary for a small discretization error. 3.1.2 A matching low er b ound for τ -leaping While Theorem 1 establishes an upp er b ound for the τ -leaping algorithm scaling nearly linearly with the dimension d and logarithmically with the v o cabulary size S , the fundamen tal question remains: is this dep endence an intrinsic limit or merely a technical artifact? W e show that the former is indeed the case b y establishing a matc hing low er bound. W e note that for targ et distributions sufficien tly close to the uniform distribution, sampling can be ac hieved with very few steps, as the forward CTMC conv erges efficien tly to its limit. T o av oid these patho- logical instances, we restrict our fo cus to the class of distributions that remain sufficiently w ell-separated from the uniform distribution. Sp ecifically , for any γ P r 0 , 1 s , define the subset P γ p X q Ď P p X q as P γ p X q “ ␣ q 0 P P p X q : H p q 1 q ď p 1 ´ γ q ¨ H p Unif p X qq “ p 1 ´ γ q d log p S q ( , where q 1 is the marginal distribution at t “ 1 of the uniform noising pro cess initialized at q 0 , Unif p X q is the uniform distribution on X , and H p¨q denotes the entrop y function of a distribution. Intuitiv ely , for γ P p 0 , 1 q , the class P γ p X q imp oses a structural constraint on the conv ergence of the forward pro cess; it describes distributions that do not mix rapidly . In this sense, for γ “ O p 1 q , P γ p X q contains distributions that remain informativ e enough at time t “ 1 in the forward pro cess. This cov ers most distributions of practical interest, since they carry nontrivial information characterized b y relatively low en tropy . The follo wing low er bound shows that, when sampling from a distribution in P γ p X q with the τ -leaping algorithm, the iteration complexity bound in Corollary 1 cannot b e improv ed up to logarithmic factors. The pro of is given in App endix C.3 . Theorem 2. F or any tar get distribution q 0 P P γ p X q and e arly stopping time 0 ď δ ! 1 , denote the p ath me asur e of the b ackwar d pr o c ess by Q d “ t Ð q t u t Pr 0 ,T ´ δ s and the sampling pr o c ess by P d “ t p t u t Pr 0 ,T ´ δ s . L et γ “ Ω p 1 q . Then, for any step size sche dule 0 “ t 0 ă t 1 ă . . . ă t N “ T ´ δ wit h max k t t k ` 1 ´ t k u ď 1 2 , it takes the τ -le aping algorithm at le ast N “ Ω p d log p S qq (12) iter ations to achieve KL p Q } P q ď ε score ` O p 1 q . W e mak e several remarks concerning the nature and implications of our low er b ound. Theorem 2 reveals that for informative target distributions in P γ p X q , ensuring that the KL div ergence b et w een the sampling pro cess and the rev erse pro cess is small requires the n umber of steps to scale at least linearly with the dimension d , whic h cannot be av oided for general distributions. In addition, the low er b ound is uniform ov er both early stopping sc hedules ( 0 ă δ ! 1 ) and non-early stopping sc hemes ( δ “ 0 ). This low er b ound is algorithm-dep enden t: it relies on structural prop erties of the τ -leaping algorithm and therefore differs fundamen tally from information-theoretic or minimax lo wer b ounds. In principle, alternative sampling sc hemes ma y circum ven t the linear dependence o n d . Indeed, in Section 3.2 , w e sho w that a 10 mo dified τ -leaping pro cedure achiev es sublinear dep endence on d for structured target distributions under the masking noising pro cess. Whether analogous impro vemen ts are possible for uniform discrete diffusion through mo dified algorithms remains an op en question. When the target distribution has high entrop y , the low er bound need not apply . Indeed, when q data satisfies KL p q data } Unif p X qq “ o p d q , one can sho w that H p q 1 q “ Θ p d log S q , and that a sample from a distribution with the KL error at most ε score ` ε can b e obtained using N “ o p d q steps. A precise formulation of this claim is given in App endix C.4 . W e remark that the quantit y controlled in Theorem 2 is the KL divergence b etw een t wo path measures, rather than the divergence b et w een the terminal output distributions, which may app ear weak er than the upp er bound in Corollary 1 . Ho wev er, to the b est of our kno wledge, all existing upper-b ound analyses for the KL divergence, including ours, pro ceed by first bounding the KL divergence b et w een path measures and then in voking the data-pro cessing inequalit y . Consequently , the low er b ound applies to all current analysis tec hniques. In this sense, Theorem 2 establishes the optimalit y of the iteration complexity in Corollary 1 within the scope of the existing analysis techniques. Finally , w e provide the pro of sketc h of Theorem 2 to illustrate the main pro of techniques. Pr o of sketch of The or em 2 . The pro of is based on a refined analysis of the deca y of KL div ergence along the forw ard pro cess for an y distribution q 0 P P γ p X q . While w e state our result for γ “ Ω p 1 q , the pro of works for every γ P p 0 , 1 q . It can b e shown that the KL div ergence along the forward pro cess is a differentiable function of time t , and w e denote its negative rate of c hange as φ p t q , i.e., φ p t q “ ´ d d t KL p q t } p 0 q “ ÿ x,y :d H p x,y q“ 1 q t p x q s t p y , x q log ˆ s t p y , x q s t p x, y q ˙ , where p 0 “ Unif p X q is the limit distribution of the forw ard noising pro cess. First, we sho w that the condition KL p Q } P q ď ε score ` O p 1 q with the definition of P γ p X q implies that T ą 1 and the following b ound N ´ 1 ÿ k “ 1 ż t k ` 1 t k ` φ p T ´ t q ´ φ p T ´ t k q ˘ d t “ O p 1 q . (13) F urthermore, w e can sho w that φ p t q is a non-increasing and differen tiable function of t . Thus, Eqn. ( 13 ) and the Newton-Leibniz form ula lead to a stronger condition: N ´ 1 ÿ k “ 1 inf t k ď t ď t k ` 1 p´ φ 1 p T ´ t qq ¨ 1 2 p t k ` 1 ´ t k q 2 ď N ´ 1 ÿ k “ 1 ż t k ` 1 t k ż T ´ t k T ´ t ´ φ 1 p u q d u d t “ N ´ 1 ÿ k “ 1 ż t k ` 1 t k ` φ p T ´ t q ´ φ p T ´ t k q ˘ “ O p 1 q . (14) Next, w e view the forward pro cess as an S -ary symmetric channel ( Makur and Poly anskiy , 2018 ) and apply the strong data-pro cessing inequality to prov e that for any q 0 P P γ p X q , the function ´ φ 1 p t q has a lo wer b ound scaling with γ d log p S q for all t P p 0 , 1 q . Since max k t t k ` 1 ´ t k u ď 1 2 , w e can choose a suitable M , such that 1 ă M ă N and T ´ t M P r 1 2 , 1 s . Com bining this with Eqn. ( 14 ), w e obtain N ´ 1 ÿ k “ M p t k ` 1 ´ t k q 2 À 1 γ d log p S q , whic h implies that N “ Ω p γ d log p S qq by the Cauch y-Sch warz inequalit y . 3.2 Masking discrete diffusion W e now turn our atten tion to the masking noising process. Our main result in this setting is an upp er bound that in trinsically dep ends on the structural prop erties of the target distribution q data , rather than scaling with the ambien t dimension d . This aligns with the in tuition that for highly structured distributions — such as a sparse mixture of Dirac measures — a sensible sampler should con verge at a sublinear scale, or p erhaps ev en logarithmically with d . 11 3.2.1 Preliminaries W e b egin b y recalling tw o fundamen tal quan tities in information theory: the total correlation and the dual total correlation. F or a distribution q o ver r S s d and x „ q , the total correlation C p q q and the dual total correlation B p q q are defined as C p q q : “ d ÿ i “ 1 H p x i q ´ H p x q and B p q q : “ H p x q ´ d ÿ i “ 1 H p x i | x ´ i q . (15) W e no w in tro duce a time-dep enden t quan tity asso ciated with the masking noising process. Consider a masking noising pro cess defined by Eqn. ( 4 ) with marginals p q t q t ě 0 . F or x P pr S s Y t MASK uq d and i ‰ j P r d s , let x ´p i,j q denote the collection of all unmasked elements of x , excluding the i -th and the j -th co ordinates. W e define the effe ctive total c orr elation of the target distribution as D p q 0 q : “ ż 8 0 min p 1 , t q I p t q d t with I p t q : “ ÿ i ‰ j Pr d s I p x i t ; x j t | x ´p i,j q t q ě 0 , (16) where I p A ; B | C q denotes the conditional m utual information, and x t „ q t . Lemma 16 sho ws that the total correlation and the dual total correlation can be expressed through I p t q by B p q 0 q “ ż 8 0 I p t q d t and C p q 0 q “ ż 8 0 p e t ´ 1 q I p t q d t. Consequen tly , D p q 0 q ď min p B p q 0 q , C p q 0 qq . The statemen t and the proof of this result are giv en in Ap- p endix E.1 . Note that b oth B p q 0 q , C p q 0 q , and hence D p q 0 q are upp er b ounded by d log p S q . Moreo ver, there exist distributions q 0 with B p q 0 q “ O p 1 q while C p q 0 q “ Ω p d log p S qq , and vice versa. W e refer to Austin ( 2020 ) for a detailed study of the total correlation and the dual total correlation. Importantly , there are also natural distributions for whic h b oth B p q 0 q and C p q 0 q are of order d , while D p q 0 q remains small. See Prop osition 5 for an example of such a distribution. 3.2.2 An adaptive characterization Equipp ed with the ab o ve preliminaries, we present our main result on the masking noising pro cess. The pro of is given in App endix D.1 . Theorem 3. L et q data “ q 0 b e the tar get distribution on r S s d . F or 0 “ t 0 ă t 1 ă . . . ă t N “ T , let h k : “ t k ` 1 ´ t k b e the step sizes and assume that ∆ : “ max k h k “ O p 1 q . L et p 0 : “ ˜ ` 1 ´ e ´ T ˘ δ MASK ` S ´ 1 e ´ T S ÿ k “ 1 δ k ¸ b d . Under Assumption 1 , Algorithm 1 initialize d at p 0 pr o duc es a sample fr om p output “ p T such that KL p q data } p output q À ε score ` e ´ T d log p S q ` N ´ 1 ÿ k “ 0 h k ż T ´ t k T ´ t k ` 1 I p t q d t. (17) A few remarks on the consequences and implications of Theorem 3 are in order. As in Theorem 1 , the last term in the upp er b ound corresp onds to the discretization error measured using the in tegrated mutual information defined in Eqn. ( 16 ). While the first t wo terms are generic, the third term go verns the dep endence on the dimension d and reflects the information-theoretic prop erties of the target distribution. F or structured distributions, our algorithm implicitly adapts to the underlying structure of the target distribution without requiring an y prior knowledge of that structure or an y modification to the algorithm itself. 12 Algorithm 1: Mo dified truncated τ -leaping Input: Initial distribution: p 0 , Discretization steps: 0 “ t 0 ă t 1 ă . . . ă t N “ T , Score estimate function: p s T ´ t for t P t t 0 , . . . , t N ´ 1 u . Output: Sample p x P r S s d . 1 Sample x 0 from p 0 2 for k “ 0 , . . . , N ´ 1 do 3 for i P m p x t k q : “ t i , such that x i t k “ MASK u do 4 p Q i k p a q Ð p s T ´ t k p x t k d i a, x t k q , for a P r S s 5 p Q i k p MASK q Ð ´ ř a Pr S s p Q i k p a q 6 if k ă N ´ 1 then 7 ∆ k Ð p e T ´ t k ´ 1 q log ´ e T ´ e t k e T ´ e t k ` 1 ¯ 8 P k Ð exp p p Q i k p MASK q ∆ k q 9 end 10 else 11 P k Ð 0 12 end 13 x i t k ` 1 Ð $ & % MASK , with probability P k , a, with probability p Q i k p a q ř b Pr S s p Q i k p b q p 1 ´ P k q , for a P r S s . 14 end 15 end 16 return x t N In App endix D.3 , w e analyze the p erformance of truncated τ -leaping as an alternative to Algorithm 1 , whic h has an additional d { N 2 term in the upp er b ound Eqn. ( 17 ), ignoring low er-order contributions. Al- though for structured target distributions the resulting iteration complexity already scales as ? d rather than d (as in the existing literature), it do es not fully adapt to the geometry of the target distribution. T o pro vide some intuition, the standard (or truncated) τ -leaping algorithm informally satisfies for t P r t k , t k ` 1 q (see Eqn. ( 9 )) G i t p s T ´ t k , x t k q « G i t k p s T ´ t k , x t k q , and thus p Q t « Ð Q t k , (18) where we recall the mapping G i t from Eqn. ( 8 ). That is, ev en when the score estimation is exact, p s T ´ t k ” s T ´ t k , the τ -leaping algorithm in tro duces a mismatc h betw een the surrogate and true rate matrices as s T ´ t k ı s T ´ t . Algorithm 1 corrects this discrepancy by enforcing G i t p s T ´ t k , x t k q « G i t p s T ´ t , x t k q , and thus p Q t « Ð Q t , (19) through the rescaling of the score estimate function: p s T ´ t “ e T ´ t k ´ 1 e T ´ t ´ 1 p s T ´ t k . As it is a linear transformation of the score estimate function, w e can simulate its dynamics only at discrete points T ´ t 0 , . . . , T ´ t N (see Algorithm 1 and Lemma 13 ). This leads to a sharper upp er bound in Theorem 3 relativ e to the analogous b ound for truncated τ -leaping (Theorem 5 ; see also Remark 3 ). Empirically , the b enefit of rescaling the score function in masking discrete diffusion mo dels has also been observed in prior w ork; see, for example, Lou et al. ( 2024 ); Ou et al. ( 2025 ). Notably , our results are closely connected to an intriguing parallel line of work on the masking diffusion mo dels ( Chen et al. ( 2025 ); Li and Cai ( 2025 )), whic h fo cuses on the design of unmasking schedules without adopting a CTMC persp ectiv e. In particular, Chen et al. ( 2025 ) deriv es optimal unmasking sc hedules and discusses t wo represen tative instances in whic h the num b er of steps scales linearly with B p q data q and C p q data q , respectively . Their algorithms require an a priori estimate of B p q data q and C p q data q or a doubling searc h procedure to calibrate the unmasking schedule and rely on a different sampling mec hanism. The 13 fact that our score-based sampler automatically exploits similar information-theoretic quan tities without additional hyperparameters underscores b oth the fundamental nature of these quantities and the robustness of the CTMC framework. Belo w w e pro vide a pro of sk etch of Theorem 3 with the details deferred to App endix D.1 . Pr o of sketch of The or em 3 . First, Lemma 13 sho ws that Algorithm 1 outputs a sample from a CTMC with initial distribution p 0 and rate matrices p Q t p x, y q : “ $ ’ & ’ % p s T ´ t k p x t k d i y i , x t k q e T ´ t k ´ 1 e T ´ t ´ 1 I t x i “ MASK u , if d H p x, y q “ 1 , x i ‰ y i , and x i t k “ MASK , ´ ř z ‰ x p Q t p x, z q , if y “ x, 0 , otherwise. (20) This corresp onds to a τ -bridging strategy (Eqn. ( 8 )) with the follo wing function G i t p p s T ´ t k , x t k q : G i t p p s T ´ t k , x t k qp a, b q “ e T ´ t k ´ 1 e T ´ t ´ 1 p s T ´ t k p x t k , x t k d i b q I t x i t k “ a u for a ‰ b P V . By the data-pro cessing inequality , we upp er b ound the KL div ergence b et ween q 0 and p T b y the KL diver- gence betw een the paths q T ´ t 0 ,...,T ´ t N and p t 0 ,...,t N . Next, w e apply the Marko vian prop ert y of the paths along with Girsano v’s change-of-measure theorem to upp er b ound KL p q 0 } p T q by KL p q T } p 0 q ` N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t k ,x t „ Ð q t k ,t « ÿ y t : Q p y t ,x t qą 0 s T ´ t p y t , x t q ˆ D ˆ e T ´ t k ´ 1 e T ´ t ´ 1 p s T ´ t k p y t , x t q , s T ´ t p y t , x t q ˙ d t ff . The first term is the initialization error and is con trolled b y c ho osing the time horizon T “ Ω p log d ` log log p ε ´ 1 S qq . F or the second term, we apply the law of cosines for Bregman div ergence and obtain (with ℓ : “ t k and y ℓ : “ x ℓ d i c , where y t “ x t d i c ): s T ´ t p y t , x t q D ˆ e T ´ ℓ ´ 1 e T ´ t ´ 1 p s T ´ ℓ p y t , x t q , s T ´ t p y t , x t q ˙ “ e T ´ ℓ ´ 1 e T ´ t ´ 1 s T ´ ℓ p y ℓ , x ℓ q D p p s T ´ ℓ p y ℓ , x t q , s T ´ t p y t , x t qq lo ooooooooooooooooooooooooooooooo omo ooooooooooooooooooooooooooooooo on Controlled by Assumption 1 ` p s T ´ t p y ℓ , x ℓ q ´ s T ´ t p y t , x t qq log p s T ´ ℓ p y ℓ , x ℓ q s T ´ ℓ p y ℓ , x ℓ q lo ooooooooooooooooooooooooooo omo ooooooooooooooooooooooooooo on Expectation controlled by Lemma 14 ` s T ´ t p y t , x t q D p s T ´ t p y ℓ , x ℓ q , s T ´ t p y t , x t qq . Similar to the pro of of the uniform discrete diffusion mo del, the first term can b e controlled by Assumption 1 after taking the exp ectation ov er x t k „ Ð q t k and integrating ov er time, and the second term can be prov ed to b e zero by the martingale prop ert y from Lemma 14 . Finally , using Dynkin’s formula, w e relate the third term to the effective total correlation D p q 0 q . Next, w e deriv e iteration complexity guaran tees for our algorithm under specific c hoices of step size sc hedules. The pro of is given in App endix D.2 . Corollary 2. Consider the setting in The or em 3 . L et T “ log p d log p S qq . F or a fixe d ε ą 0 , the distribution p output satisfies KL p q data } p output q À ε score ` ε , • under the c onstant step size sche dule, t k ´ t k ´ 1 “ T { N for al l k P r N s , pr ovide d N “ r O ˆ B p q data q ε ˙ ; 14 • under the exp onential-then-c onstant step size sche dule, when t k ` 1 ´ t k ď κ min p 1 , T ´ t k ` 1 q for k P t 0 , . . . , N ´ 2 u , T ´ t N ´ 1 “ ε {p d log p S qq , and κ “ N ´ 1 p T ` log p ε ´ 1 d log p S qqq , pr ovide d N “ r O ˆ D p q data q ε ˙ ď r O ˆ min t B p q data q , C p q data qu ε ˙ . In words, Corollary 2 shows that the sampling complexit y of Algorithm 1 required to obtain an ε -accurate distribution is gov erned by intrinsic complexit y measures of the target distribution. Under the constant step size schedule, the iteration complexit y is controlled b y the dual total correlation of the target distribution, whereas under the exp onen tial-then-constant schedule, the effectiv e total correlation becomes the relev ant quan tity . F or illustration, let us consider the following tw o simple examples. • Consider first the uniform distribution on r S s d . In this case, b oth complexit y measures scale indepen- den tly of the ambien t dimension d , which means N “ r O ´ 1 ε ¯ , (21) reflecting the fact that it is exceptionally easy to sample from uniform distributions. While in tuitive in hindsight, this phenomenon has not b een previously formalized in the literature. • As a second example, consider a mixture of tw o Dirac measures, 1 2 δ k 1 ` 1 2 δ k 2 . A direct calculation sho ws that the dual total correlation remains indep enden t of d , which means N “ r O ´ 1 ε ¯ , (22) indicating that suc h distributions are also handled automatically by our algorithm. T o further illustrate the implications of Theorem 3 , we consider some representativ e distributions for whic h one or more of the quan tities B p q data q , C p q data q , or D p q data q are small. Since the iteration complexity scales linearly with these quantities, our result shows that discrete diffusion models can pro v ably ac hieve efficien t sampling. Appendix A dev elops these examples in detail and pro vides rigorous pro ofs of the stated claims. • Hidden Marko v mo dels. Here, the observed v ariables corresp ond to w ords or tokens in a sentence, while the hidden states enco de laten t seman tic topics. Under the natural assumption that topics ev olve slo wly , we show that B p q data q grows sublinearly with the sequence length. • Lo w-dimensional structures. Motiv ated b y image generation, when the discrete data arise from the quan tization of a contin uous distribution with intrinsic dimension k , the dual total correlation B p q data q scales linearly with k rather than with the am bient dimension d . • Random graph mo dels. Such mo dels define distributions o ver d “ ` n 2 ˘ binary v ariables corre- sp onding to the edges of a graph with n vertices. Besides Erdős-Rén yi random graphs, which hav e indep enden t edges and are therefore easy to sample, we consider both sparse random regular graphs and sto c hastic blo c k mo dels. In these cases, B p q data q grows at most linearly (up to logarithmic factors) with n , rather than quadratically . • Structure-with-noise distributions. Finally , we present an example in which b oth the total corre- lation C p q data q and the dual total correlation B p q data q are of order d , while the effective total correlation D p q data q remains of constant order. Such distributions are motiv ated b y applications such as error- correcting co des and DNA sequences, where substantial noise may b e present, yet the underlying signal is highly structured. 15 4 Discussion In this w ork, we establish nov el theoretical results for b oth uniform and masking discrete diffusions. F or uniform diffusion mo dels, we sho w that the τ -leaping algorithm requires r O p d { ε q iterations to ac hieve ε accu- racy in KL divergence, improving on the prior b ound r O ` d 2 S { ε ˘ . W e further establish the first algorithmic lo wer b ound for the τ -leaping sampler, which sho ws that our upp er b ound is unimpro v able for a large class of distributions. F or the masking discrete diffusion, w e derive an upp er b ound that captures the intrinsic complexit y of the data distribution and can scale logarithmically with the am bient dimension. Imp ortan tly , our results for b oth mo dels only require a small score estimation error and, in contrast to prior work, do not rely on early stopping or the b oundedness assumptions of the score estimator. The impro v ed b ound for the masking noising pro cess is ac hieved via a modification of the τ -leaping algorithm. This modification falls within a structured sub class of τ -leaping strategies that (i) allow for parallel co ordinate updates, and thus sublinear rates, and (ii) preserv e CTMC dynamics, whic h facilitates theoretical analysis. W e hope that this persp ectiv e motiv ates the dev elopmen t of adaptiv e samplers for uniform discrete diffusion as well in the future. Sev eral other op en questions remain. Understanding which noising mechanisms — masking, uniform, or others — are b est suited to different classes of target distributions is an imp ortan t direction for future w ork. Moreov er, the problem of learning accurate score functions in discrete diffusion mo dels remains largely unexplored and w arrants further inv estigation. A c kno wledgemen ts This work is supp orted in part by the NSF grants CCF-2106778, CCF-2418156 and CAREER aw ard DMS- 2143215. References Austin, J., Johnson, D. D., Ho, J., T arlow, D., and v an den Berg, R. (2021). Structured denoising diffusion mo dels in discrete state-spaces. A dvanc es in Neur al Information Pr o c essing Systems , 34:17981–17993. Austin, T. (2020). Multi-v ariate correlation and mixtures of pro duct measures. Kyb ernetika , pages 459–499. Bac h, F. and Saremi, S. (2025). Sampling binary data by denoising through score functions. arXiv pr eprint arXiv:2502.00557 . Ben ton, J., Shi, Y., De Bortoli, V., Deligiannidis, G., and Doucet, A. (2024). F rom denoising diffusions to denoising marko v mo dels. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 86(2):286–301. Campb ell, A., Benton, J., De Bortoli, V., Rainforth, T., Deligiannidis, G., and Doucet, A. (2022). A con tinuous time framew ork for discrete denoising mo dels. A dvanc es in Neur al Information Pr o c essing Systems , 35:28266–28279. Chen, H., Lee, H., and Lu, J. (2023a). Impro ved analysis of score-based generativ e modeling: User-friendly b ounds under minimal smo othness assumptions. In International Confer enc e on Machine L e arning , pages 4735–4763. PMLR. Chen, H. and Ying, L. (2025). Con vergence analysis of discrete diffusion model: exact implemen tation through uniformization. Journal of Machine L e arning , 4(2):108–127. Chen, S., Chewi, S., Li, J., Li, Y., Salim, A., and Zhang, A. (2023b). Sampling is as easy as learning the score: theory for diffusion mo dels with minimal data assumptions. In The Eleventh International Confer enc e on L e arning R epr esentations . Chen, S., Cong, K., and Li, J. (2025). Optimal inference sc hedules for mask ed diffusion models. arXiv pr eprint arXiv:2511.04647 . 16 Conforti, G., Durmus, A., and Pham, L.-T.-N. (2025). Non-asymptotic conv ergence of discrete diffusion mo dels: Mask ed and random walk dynamics. arXiv pr eprint arXiv:2512.00580 . Co ver, T. M. (1999). Elements of information the ory . John Wiley & Sons. Dhariw al, P . and Nichol, A. (2021). Diffusion mo dels b eat GANs on image synthesis. A dvanc es in Neur al Information Pr o c essing Systems , 34:8780–8794. F einberg, E. A., Manda v a, M., and Shiryaev, A. N. (2014). On solutions of k olmogorov’s equations for nonhomogeneous jump marko v pro cesses. Journal of Mathematic al A nalysis and Applic ations , 411(1):261– 270. F eller, W. (1940). On the in tegro-differential equations of purely discon tinuous mark off processes. T r ansac- tions of the Americ an Mathematic al So ciety , 48(3):488–515. Gales, M. and Y oung, S. (2024). The application of hidden marko v mo dels in sp eec h recognition. F oundations and T r ends ® in Signal Pr o c essing , 1(3):195–304. Gillespie, D. T. (1976). A general metho d for numerically simulating the sto chastic time evolution of coupled c hemical reactions. Journal of c omputational physics , 22(4):403–434. Gillespie, D. T. (2001). Appro ximate accelerated sto c hastic sim ulation of chemically reacting systems. The Journal of chemic al physics , 115(4):1716–1733. Gorban, A. N. and T yukin, I. Y. (2018). Blessing of dimensionalit y: mathematical foundations of the statistical ph ysics of data. Philosophic al T r ansactions of the R oyal So ciety A: Mathematic al, Physic al and Engine ering Scienc es , 376(2118):20170237. Ho, J., Jain, A., and Abbeel, P . (2020). Denoising diffusion probabilistic models. A dvanc es in Neur al Information Pr o c essing Systems , 33:6840–6851. Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., and Fleet, D. J. (2022). Video diffusion models. A dvanc es in Neur al Information Pr o c essing Systems , 35:8633–8646. Huang, Z., W ei, Y., and Chen, Y. (2024). Denoising diffusion probabilistic mo dels are optimally adaptiv e to unkno wn lo w dimensionalit y . arXiv pr eprint arXiv:2410.18784 . Ingraham, J., Garg, V., Barzila y , R., and Jaakkola, T. (2019). Generative models for graph-based protein design. A dvanc es in Neur al Information Pr o c essing Systems , 32. Li, G. and Cai, C. (2025). Breaking AR’s sampling b ottleneck: Pro v able acceleration via diffusion language mo dels. A dvanc es in Neur al Information Pr o c essing Systems , 38. Li, G., Cai, C., and W ei, Y. (2025). Dimension-free conv ergence of diffusion mo dels for approximate gaussian mixtures. arXiv pr eprint arXiv:2504.05300 . Li, G., W ei, Y., Chen, Y., and Chi, Y. (2023). T ow ards faster non-asymptotic con vergence for diffusion-based generativ e models. arXiv pr eprint arXiv:2306.09251 . Li, G. and Y an, Y. (2024). Adapting to unkno wn low-dimensional structures in score-based diffusion mo dels. A dvanc es in Neur al Information Pr o c essing Systems , 37:126297–126331. Li, L., Carver, R., Lop ez-Gomez, I., Sha, F., and Anderson, J. (2024). Generative emulation of weather forecast ensembles with diffusion models. Scienc e A dvanc es , 10(13). Liang, J., Huang, Z., and Chen, Y. (2025a). Lo w-dimensional adaptation of diffusion mo dels: Con vergence in total v ariation. arXiv pr eprint arXiv:2501.12982 . Liang, Y., Huang, R., Lai, L., Shroff, N., and Liang, Y. (2025b). Absorb and con verge: Prov able conv ergence guaran tee for absorbing discrete diffusion mo dels. A dvanc es in Neur al Information Pr o c essing Systems , 39. 17 Liang, Y., Liang, Y., Lai, L., and Shroff, N. (2025c). Discrete diffusion mo dels: Nov el analysis and new sampler guarantees. A dvanc es in Neur al Information Pr o c essing Systems , 39. Lieb enau, A. and W ormald, N. (2024). Asymptotic en umeration of graphs by degree sequence, and the degree sequence of a random graph. Journal of the Eur op e an Mathematic al So ciety , 26:1–40. Lou, A., Meng, C., and Ermon, S. (2024). Discrete diffusion mo deling b y estimating the ratios of the data distribution. In International Confer enc e on Machine L e arning , pages 4735–4763. PMLR. Makur, A. and Poly anskiy , Y. (2018). Comparison of channels: Criteria for domination by a symmetric c hannel. IEEE T r ansactions on Information The ory , 64(8):5704–5725. Meng, C., Choi, K., Song, J., and Ermon, S. (2022). Concrete score matc hing: Generalized score matc hing for discrete data. A dvanc es in Neur al Information Pr o c essing Systems , 35:34532–34545. Mor, B., Garh wal, S., and Kumar, A. (2021). A systematic review of hidden mark ov mo dels and their applications. Ar chives of c omputational metho ds in engine ering , 28(3). Ou, J., Nie, S., Xue, K., Zhu, F., Sun, J., Li, Z., and Li, C. (2025). Y our absorbing discrete diffusion secretly mo dels the conditional distributions of clean data. In The Thirte enth International Confer enc e on L e arning R epr esentations . P ark, Y.-H., Lai, C.-H., Ha yak aw a, S., T akida, Y., and Mitsufuji, Y. (2025). Jump your steps: Optimizing sampling schedule of discrete diffusion mo dels. In The Thirte enth International Confer enc e on L e arning R epr esentations . Pham, L.-T.-N., Shariatian, D., Ocello, A., Conforti, G., and Durm us, A. O. (2025). Discrete mark o v probabilistic mo dels: An improv ed discrete score-based framew ork with sharp conv ergence b ounds under minimal assumptions. In International Confer enc e on Machine L e arning . P op e, P ., Zh u, C., Ab delk ader, A., Goldblum, M., and Goldstein, T. (2021). The in trinsic dimension of images and its impact on learning. In International Confer enc e on L e arning R epr esentations . Ren, Y., Chen, H., Rotsk off, G. M., and Ying, L. (2025). How discrete and contin uous diffusion meet: Comprehensiv e analysis of discrete diffusion models via a sto c hastic integral framework. In International Confer enc e on L e arning R epr esentations . Saho o, S., Arriola, M., Schiff, Y., Gok aslan, A., Marro quin, E., Chiu, J., Rush, A., and Kulesho v, V. (2024). Simple and effectiv e masked diffusion language mo dels. A dvanc es in Neur al Information Pr o c essing Systems , 37:130136–130184. Sohl-Dic kstein, J., W eiss, E., Mahesw aranathan, N., and Ganguli, S. (2015). Deep unsup ervised learning using nonequilibrium thermodynamics. In International Confer enc e on Machine L e arning , pages 2256– 2265. Song, Y. and Ermon, S. (2019). Generativ e modeling b y estimating gradien ts of the data distribution. A dvanc es in Neur al Information Pr o c essing Systems , 32. V an Dijk, N. M. (1992). Uniformization for nonhomogeneous marko v chains. Op er ations r ese ar ch letters , 12(5):283–291. v on Rütte, D., Fluri, J., P o oladzandi, O., Sc hölkopf, B., Hofmann, T., and Orvieto, A. (2025). Scaling b eha vior of discrete diffusion language models. arXiv pr eprint arXiv:2512.10858 . W atson, J. L., Juergens , D., Bennett, N. R., T ripp e, B. L., Yim, J., Eisenach, H. E., Ahern, W., Borst, A. J., Ragotte, R. J., Milles, L. F., et al. (2023). De nov o design of protein structure and function with rfdiffusion. Natur e , 620(7976):1089–1100. Xu, M., Y u, L., Song, Y., Shi, C., Ermon, S., and T ang, J. (2022). Geo diff: A geometric diffusion mo del for molecular conformation generation. In International Confer enc e on L e arning R epr esentations . 18 Zeni, C., Pinsler, R., Zügner, D., F owler, A., Horton, M., F u, X., W ang, Z., Shyshey a, A., Crabbé, J., Ueda, S., et al. (2025). A generativ e model for inorganic materials design. Natur e , 639(8055):624–632. Zhang, Z., Chen, Z., and Gu, Q. (2025). Conv ergence of score-based discrete diffusion mo dels: A discrete-time analysis. In International Confer enc e on L e arning R epr esentations . A Examples of lo w in trinsic dimensions A.1 Details and formal results In this section, we revisit the examples outlined in Section 3.2.2 and develop them in full detail. W e formalize the statements in this section, and provide rigorous pro ofs in Appendix A.2 . Hidden Marko v mo dels. A hidden Marko v mo del (HMM) consists of a latent Mark ov c hain whose states are observed only indirectly through noisy measurements. Suc h models are widely used in natural language pro cessing and pattern recognition ( Gales and Y oung , 2024 ; Mor et al. , 2021 ). In language mo deling, for instance, the hidden states z i ma y enco de the semantic topic or grammatical structure of the i -th token or w ord, while the observed v ariables x i represen t the realized words or tokens. F ormally , let t z i u i Pr d s b e a discrete-state Marko v c hain supp orted on Z , and let t x i u i Pr d s b e observ ations generated according to x i “ f i p z i , ε i q , where t ε i u i Pr d s are i.i.d. noise v ariables indep enden t of z i i Pr d s . When z i represen ts the semantic topic of the i -th paragraph in a do cument, it is natural to assume that topic transitions happ en only infrequen tly; that is, z i “ z i ´ 1 with high probability for i ą 1 . Under this model, w e establish the follo wing prop osition, whose pro of is deferred to Section A.2.1 . Prop osition 1. Consider the HMM describ e d ab ove. Supp ose the tr ansition pr ob ability of t z i u i Pr d s satisfies Pr p z i ‰ z i ´ 1 q ď p for al l i P t 2 , . . . , d u . Assume that 1 { d À p ! 1 . Then B p q data q ď pd log ˆ | Z | p ˙ . (23) T o develop some in tuition, consider generating a do cumen t with a constant n umber of paragraphs, where the transition probabilit y scales p “ Θ p 1 { d q . Supp ose further that the latent space Z P r S s k for some k ! d and S denotes the vocabulary size. Then, the ab ov e bound yields B p q data q À k log p S d q , whic h is substantially smaller than the ambien t dimension d log p S q . As such, with Theorem 3 , the sampling complexit y scales with the in trinsic topic dimension k rather than the do cumen t length d . Lo w-dimensional Structures. In image generation and other structured data settings, it is commonly assumed that the data lie on or near a low-dimensional manifold em b edded in a high-dimensional ambien t space, which often refers to as the manifold h yp othesis ( Gorban and T yukin , 2018 ; P op e et al. , 2021 ). F or example, natural images ma y b e view ed as p oints on a manifold parameterized by a small n umber of underlying factors, suc h as ligh ting conditions, p ose, and ob ject identit y . In discrete settings, the notion of a manifold is not mathematically w ell defined. T o capture lo w- dimensional structure, w e instead mo del the data as arising from a con tinuous mapping from a laten t represen tation into a high-dimensional observ ation space. F or some laten t contin uous random v ariable z supp orted on Z Ă R k , consider a deco ding pro cedure f : r 0 , 1 s k Ñ R d as x con “ f p z q ` ε noise , for additiv e perturbations ε noise . Th us, data lies close to a manifold t f p z q : z P Z u . The final discrete observ ation is obtained via a quantization op erator Q S , i.e., x “ Q S p x con q „ q data . 19 T o align the mo del with standard image pro cessing pip elines, w e w ork w ith the uniform lattice quan- tization function Q S : R d Ñ r S s d defined co ordinate-wise as r Q S p x qs i “ clip p t x i u , 0 , S q for i P r d s , where clip p x, a, b q : “ min t max t x, a u , b u is the clip function and t ¨ u is the flo or function. T o ensure regularity of b oth the manifold and the induced data distribution, we fo cus on the case where Z is a compact set and f is a Lipschitz function. The noise ε noise is tak en to b e Gaussian for simplicity of analysis; the arguments extend readily to more general smooth noise distributions. Prop osition 2. L et Z Ă R k b e c omp act with diameter D , and let f : Z Ñ R d b e L -Lipschitz. Assume the noise satisfies ε noise „ N p 0 , σ 2 I d q indep endenly gener ate d for e ach ob ersvation. Then the r esulting distribution satisfies B p q data q ď k log ˆ 2 ` 2DL σ ˙ . (24) In image generation, the “ideal image” x con ma y b e interpreted as the v ector of contin uous pixel intensities prior to quantization, while the observed image x is obtained by applying pixel-wise quantization to x con . When k ! d , the abov e b ound yields B p q data q “ r O p k q “ o p d q , and hence w e can efficien tly sample such images despite the high dimensionality of the observ ation space. Random graph mo dels. Discrete diffusion mo dels hav e also found applications in scien tific domains such as molecular generation and protein design, where data are naturally represented as random graphs with fixed v ertex sets and random edges ( Ingraham et al. , 2019 ; Xu et al. , 2022 ). T o mak e this concrete, we consider t wo widely studied random graph models on n v ertices, whic h can b e viewed as a discrete distribution ov er adjacency matrices of dimension n 2 . • Regular graphs : A k -regular graph is a graph in whic h eac h vertex has degree exactly k . Supp ose w e wan t to sample a random graph G from some distribution supp orted on the set of k -regular graphs with n v ertices. Prop osition 3. F or sp arse r e gular gr aph mo del, i.e., k ď n { log p n q , we have B p G q À k n log ´ n k ¯ “ o p n 2 q . (25) • Sto c hastic blo c k mo dels : A stochastic blo c k mo del (SBM) is a generativ e model for random graphs that captures communit y structure within netw orks. In an SBM, n v ertices of the graph are partitioned in to r distinct communities or blo c ks, represen ted by laten t v ariables t z i u i Pr n s taking v alues in r r s . Conditioned on the laten t labels, edges are generated indep enden tly . F or t wo vertices i, j P r n s , an edge is created with probability p I t z i “ z j u ` q I t z i ‰ z j u , where p, q P r 0 , 1 s go vern the within- and b et w een-communit y connection probabilities, resp ectiv ely . Prop osition 4. L et G b e a r andom gr aph dr awn fr om the ab ove r -blo ck SBM. Then B p G q ď n log p r q “ o p n 2 q . F or both random graph mo dels, as the n umber of v ertices n grows large, the complexit y satisfies B p G q “ o p n 2 q , whic h is strictly smaller compared to the ambien t dimension n 2 . This indicates that diffusion-based methods can sample efficien tly from suc h graph distributions. In fact, the analyses of Prop ositions 2 and 4 extend naturally to generalized random geometric graphs. Consider the following example. Let each v ertex i P r n s b e asso ciated with latent v ariable z i P Z . F or distinct vertices i and j , an edge is placed indep enden tly with probabilit y β exp ˆ ´ d p z i , z j q r 0 ˙ , where β P r 0 , 1 s , r 0 ą 0 and d p¨ , ¨q is an appropriate metric in the laten t space Z . 20 • When laten t v ariables t z i u are discrete with o p n q entrop y , as is the case, for example, when it tak es v alue in a fixed-dimensional laten t space, the dual total correlation of the resulting random graph is o p n 2 q . • F or contin uous latent v ariables, supp ose Z “ S d z ´ 1 , the unit sphere in R d z . Under some regularity conditions, the dual total correlation scales with d z ¨ n , follo w ed b y an analogous cov ering n umber argumen t in Proposition 2 . In particular, whenev er d z “ o p n q , the complexity is again sub quadratic, leading to sublinear (in n 2 ) conv ergence rates for diffusion-based sampling. Structure-with-noise distributions A prototypical example of a distribution with small dual total cor- relation B p q 0 q and large total correlation C p q 0 q is the mixture of t wo Dirac measures: p m : “ 1 2 δ 0 ` 1 2 δ 1 , where 0 and 1 are v ectors of all-zeros and all-ones, resp ectiv ely . It can b e easily computed that B p p m q “ log p 2 q , whereas, C p p m q “ p d ´ 1 q log p 2 q . The opp osite happ ens, for instance, for the follo wing XOR distribution p XOR : x 1 , . . . , x d ´ 1 i . i . d . „ Bern p 1 { 2 q and x d “ d ´ 1 ÿ i “ 1 x i mo d 2 . In this case, B p p XOR q “ p d ´ 1 q log p 2 q , and C p p XOR q “ log p 2 q . Real-w orld data distributions can combine features of b oth extremes: a strong low-dimensional signal corrupted b y weakly correlated noise. In such cases, both B p q data q and C p q data q can b e large, while D p q data q remains small. T o illustrate this phenomenon, consider the follo wing entrywise mixture of the tw o preceding examples. 1. Fix a bi-partition r d s “ I 0 \ I 1 for non-empty index sets I 0 and I 1 ; 2. F or all indices i P I 0 , set x i “ b for b „ Bern p 1 { 2 q ; 3. Among all indices i P I 1 , sample all but one x i „ Bern p 1 { 2 q indep enden tly; 4. F or the last index i ˚ , set x i ˚ “ ` b ` ř i ‰ i ˚ x i ˘ mo d 2 . Denote this distribution as p ex , and let x “ p x 1 , . . . , x d q „ p ex . Prop osition 5. Supp ose that min t| I 0 | , | I 1 |u{ d “ Θ p 1 q . Distribution p ex satisfies B p p ex q “ Θ p d q , C p p ex q “ Θ p d q and D p p ex q “ O p 1 q . (26) By Prop osition 5 , p ex , which can b e viewed as a non-trivial mixing of p m and p XOR , satisfies D p p ex q ! min t B p p ex q , C p p ex qu . This example highligh ts the fundamen tal role of the effectiv e total correlation in c haracterizing sampling efficiency . A.2 Pro ofs of results in Section A.1 V ariants of the follo wing lemma will be used repeatedly throughout this section. W e state it here for con venience and to streamline the pro ofs. Lemma 1. Consider any d -dimensional discr ete r andom variable X and any r andom variable W such that X i K K X ´ i | W for any i P r d s , wher e X “ p X 1 , . . . , X d q and X ´ i is the p d ´ 1 q -dimensional mar ginal of X with i -th c o or dinate exclude d. Then, B p X q ď I p X ; W q . If W is discr ete, we additional ly have B p X q ď H p W q . 21 Pr o of of L emma 1 . W e first notice that for any random v ariable W such that X i K K X ´ i | W for any i P r d s , w e ha ve H p X i | X ´ i q ě H p X i | X ´ i , W q “ H p X i | W q , where the first inequality is from the definition of the entrop y . Recalling the definition of B p¨q , we obtain B p X q “ H p X q ´ d ÿ i “ 1 H p X i | X ´ i q ď H p X q ´ d ÿ i “ 1 H p X i | W q . Using the conditional indep endence condition again, w e ha ve H p X | W q “ H ` p X 1 , . . . , X d q| W ˘ “ d ÿ i “ 1 H p X i | W q , whic h implies B p X q ď H p X q ´ H p X | W q “ I p X ; W q p a q “ H p W q ´ H p W | X q p b q ď H p W q , where (a) and (b) apply when W is a discrete random v ariable. A.2.1 Proof of Prop osition 1 F or the hidden Marko v structure of tp x i , z i qu i Pr d s , it satisfies x i K K x j | p z i , z j q , since ε i K K ε j | p z i , z j q . Considering Lemma 1 abov e, w e can upp er bound B p q data q b y H p z q , which is the entrop y of the laten t Mark ov chain. By the additivity of the en tropy , we ha ve B p q data q ď H p z q “ H p z 1 q ` d ÿ i “ 2 H ` z i |t z j u j Pr i ´ 1 s ˘ “ H p z 1 q ` d ÿ i “ 2 H p z i | z i ´ 1 q . When t z i u i Pr d s is supported on a single p oin t, w e hav e | Z | “ 1 and H p z q “ 0 . When the state space Z satisfies 2 ď | Z | ă 8 , the maxim um en tropy distribution is ac hieved when z 1 „ Unif p Z q and z i | z i ´ 1 „ p 1 ´ p q δ z i ´ 1 ` p Unif ` Z z t z i ´ 1 u ˘ . W e obtain H p z q ď log p| Z |q ` d ÿ i “ 2 „ ´p 1 ´ p q log p 1 ´ p q ` ´p| Z | ´ 1 q ¨ p | Z | ´ 1 log ˆ p | Z | ´ 1 ˙ȷ p a q ď log p| Z |q ` p d ´ 1 q ¨ ˆ 2 p ` p log ˆ | Z | p ˙˙ p b q ď pd log ˆ | Z | p ˙ , where in (a), we use ´ log p 1 ´ p q ď 2 p , since p ! 1 ; in (b), we use the condition p Á 1 { d and | Z |{ p ě 2 { p " 1 . This completes the pro of of the desired result. A.2.2 Proof of Prop osition 2 W rite ε noise “ p ε 1 noise , . . . , ε d noise q . Since ε noise „ N p 0 , σ 2 I d q , w e hav e ε i noise K K ε ´ i noise for any i P r d s . Pro cessing through the decoder f , r x con s i “ r f p z qs i ` ε i noise for any i P r d s , whic h leads to r x con s i K K r x con s p´ i q | z , (27) where r x con s p´ i q is the p d ´ 1 q -dimensional marginal of x con with i -th coordinate excluded. Note that Q S is a entry-wise quantization, i.e., w e can write Q S p x q “ p r Q S p x 1 q , . . . , r Q S p x d qq for entry-wise deterministic 22 quan tization function r Q S : R Ñ r S s , and x i “ r Q S pr x con s i q by the generation process. Eqn. ( 27 ) therefore implies that for any i P r d s , x i K K x ´ i | z . Applying Lemma 1 , we obtain B p q data q “ B p x q ď I p x ; z q ď I p x con ; z q , (28) where the last inequality follo ws from the data processing inequalit y of the mutual information. In the following pro of, we pro ceed to control I p x con ; z q . Since ε noise is independent noise, using data- pro cessing inequalit y , we reac h I p x con ; z q ď I ` f p z q ` ε noise ; f p z q ˘ “ I ` f p z q ; f p z q ` ε noise ˘ . (29) Without loss of generalit y , w e assume Z Ď r 0 , D s k . Partition r 0 , D s k in to hypercub es of size h J “ σ { L , and write this partition as t C 1 , . . . , C r D { h J s k u such that r 0 , D s k Ď r D { h J s k ğ i “ 1 C i . Define J “ J p z q to b e the hypercub e index i p z q such that z P C i p z q , and F J to b e σ -algebra generated by J p z q . By the chain rule and data pro cessing inequalit y for mutual information, we ha ve I ` f p z q ; f p z q ` ε noise ˘ ď I ` J p z q , f p z q ; f p z q ` ε noise ˘ “ I ` J p z q ; f p z q ` ε noise ˘ ` I ` f p z q ; f p z q ` ε noise | J ˘ ď k log ˆ 1 ` D h J ˙ ` I ` f p z q ; f p z q ` ε noise | J ˘ , (30) where in the last line, w e use I p J p z q ; f p z q ` ε noise q ď H p J p z qq ď log p| supp p J p z qq|q . T o upp er bound the second term ab o ve, w e in tro duce the following lemma on Gaussian channel, whose pro of is giv en in Section F.1 . Lemma 2. F or any r andom variable W P R d and indep endent noise ε noise „ N p 0 , σ 2 I d q , we have I p W ; W ` ε noise q ď T r ` V ar r W s ˘ 2 σ 2 , wher e T r p¨q is the tr ac e function. In Lemma 2 , taking W d “ f p z q | F J , we arriv e at I ` f p z q ; f p z q ` ε noise | J ˘ ď T r ´ V ar “ f p z q | F J ‰ ¯ 2 σ 2 . (31) T o further control the right hand side, direct calculations show T r ´ V ar “ f p z q | F J ‰ ¯ “ d ÿ i “ 1 V ar “ r f p z qs i ˇ ˇ F J ‰ “ E „ › › › f p z q ´ E “ f p z q | F J ‰ › › › 2 2 ˇ ˇ ˇ F J ȷ . (32) It is therefore sufficient to consider the quan tity } f p z q ´ E r f p z q | F J s} 2 2 . W e make the observ ation that › › › f p z q ´ E “ f p z q | F J ‰ › › › 2 p a q ď sup w P Conv p f p C J p z q qq › › f p z q ´ w › › 2 p b q “ sup w P f p C J p z q q › › f p z q ´ w › › 2 23 p c q ď } f } Lip ¨ sup z 1 P C J p z q › › z ´ z 1 › › 2 p d q ď L ? k h J , where } ¨ } 2 denotes Euclidean norm in R d , and Conv p¨q denotes the conv ex null of a given set. In (a), w e use the fact that E r f p z q | F J s P Conv p f p C J p z q qq ; in (b), w e adopt f is contin uous and hence f p C J p z q qq is b ounded, and the prop erty of the conv ex hull that diam ` Con v p A q ˘ “ diam p A q for an y bounded subset A Ď R d ; in (c), we recall the Lipschitz condition on f ; in (d), we notice that diam p C i q ď ? k h J for any hypercub e C i . Putting pieces together gives T r ´ V ar “ f p z q | F J ‰ ¯ ď ` L ? k h J ˘ 2 “ k σ 2 . (33) Finally , plugging Eqns. ( 31 ) and ( 33 ) into Eqn. ( 30 ), w e obtain I ` f p z q ; f p z q ` ε noise ˘ ď k log ˆ 1 ` DL σ ˙ ` k 2 ď k log ˆ 2 ` 2DL σ ˙ . Com bining the ab o ve inequalit y with Eqns. ( 28 ) and ( 29 ), we conclude B p q data q ď I p x con ; z q “ I ` f p z q ; f p z q ` ε noise ˘ ď k log ˆ 2 ` 2DL σ ˙ . A.2.3 Proof of Prop osition 3 Define the set of all k -regular graphs with n vertices as G n,k . Without loss of generalit y , w e assume that nk is ev en, as otherwise G n,k is empt y . By a corollary of Liebenau and W ormald ( 2024 , Theorem 1.4), w e hav e the following asymptotic result: | G n,k | “ Θ ˜ ˆ n ´ 1 k ˙ n ˆ n p n ´ 1 q 2 m ˙ˆ n p n ´ 1 q 2 m ˙ ´ 1 ¸ . where m “ k n { 2 . By Stirling’s form ula of the form log p a ! q “ a log p a q ´ a ` O p log p a qq , w e can compute that log p| G n,k |q À n log ˆ n ´ 1 k ˙ ` log ˆ n p n ´ 1 q 2 m ˙ ´ log ˆ n p n ´ 1 q 2 m ˙ “ k n 2 log ˆ n ´ 1 ´ k k ˙ ` n p n ´ 1 q 2 log ˆ n ´ 1 n ´ 1 ´ k ˙ ď k n 2 log ´ n k ¯ ` n 2 2 log ˆ 1 ` k n ´ 1 ´ k ˙ ď k n 2 log ´ n k ¯ ` k n 2 2 p n ´ 1 ´ k q À k n log ´ n k ¯ , where in the last line, we in vok e the condition that k ď n { log p n q ! n ´ 1 ´ k . Recalling the definition of B p¨q , w e can conclude B p G q ď H p G q ď log p| G n,k |q À k n log ´ n k ¯ “ o p n 2 q . 24 A.2.4 Proof of Prop osition 4 By definition of r -block SBM, the laten t v ariable vector p z 1 , . . . , z n q is supported on r r s n , which satisfies H ` p z 1 , . . . , z n q ˘ ď log ` |r r s n | ˘ “ n log p r q . Giv en the laten t v ariable p z 1 , . . . , z n q , the blo c k structure is fixed and hence each edge is sampled indep en- den tly from a Bernoulli distribution. Therefore, w e ha ve e ij K K e kℓ | t z i u i Pr n s for any i, j, k , l P r n s , where e ij and e kℓ are the indicator v ariables of the existence of edges b et ween vertices i, j and betw een vertices k , ℓ . By Lemma 1 , w e c onclude B p G q ď H pp z 1 , . . . , z n qq ď n log p r q ď n log p n q “ o p n 2 q , where we use the con ven tion that the n umber of blo c ks satisfies r ď n . R emark 2 . The setting of Prop osition 4 can b e viewed as a sp ecial case of the generalized random geometric graph model, in which the latent v ariable corresponds to the blo c k index. More generally , the same conclusion holds under analogous assumptions, with essen tially the same pro of strategy . A.2.5 Proof of Prop osition 5 Let r : “ | I 0 |{ d b e the prop ortion of co ordinates in I 0 . Throughout, w e assume min t r, 1 ´ r u “ Θ p 1 q . Step 1: Establish B p p ex q “ Θ p d q and C p p ex q “ Θ p d q . F or a random v ariable x „ p ex , we shall demon- strate that d ÿ i “ 1 H p x i q “ d log p 2 q , log p 2 qp| I 1 | ´ 1 q ď H p x q ď log p 2 q| I 1 | and d ÿ i “ 1 H p x i | x ´ i q “ 0 . (34) T ow ards this goal, we mak e the observ ation that for an y i P I 0 or x P I 1 z i ˚ , x i „ Bern p 1 { 2 q and hence H p x i q “ log p 2 q . F or i “ i ˚ , we assert that x i ˚ „ Bern p 1 { 2 q . In fact, we ha ve P ¨ ˝ ÿ i P I 1 z i ˚ x i ” 0 mod 2 ˛ ‚ “ P ˆ Bin ˆ | I 1 | ´ 1 , 1 2 ˙ ” 0 mod 2 ˙ “ 1 2 , where in the last equality , we inv oke the following lemma. Lemma 3. F or any n P N ` and X „ Bin p n, 1 { 2 q , we have P p X ” 0 mod 2 q “ P p X ” 1 mo d 2 q “ 1 2 . As result, the distribution of x i ˚ satisfies P p x i ˚ “ 0 q “ P p b “ 0 q ¨ P ¨ ˝ ÿ i P I 1 z i ˚ x i ” 0 mod 2 ˛ ‚ ` P p b “ 1 q ¨ P ¨ ˝ ÿ i P I 1 z i ˚ x i ” 1 mod 2 ˛ ‚ “ 1 2 , whic h rev eals that x i ˚ „ Bern p 1 { 2 q and hence H p x i ˚ q “ log p 2 q . In conclusion, we obtain d ÿ i “ 1 H p x i q “ ÿ i Pr d sz i ˚ H p x i q ` H p i ˚ q “ d log p 2 q . (35) 25 T o upper b ound H p x q , inv oke the simple prop erty for entrop y function to get H p x q ď log p| supp p x q|q ď log ´ 2 ¨ 2 | I 1 |´ 1 ¯ “ log p 2 q| I 1 | . (36) The low er b ound can be obtained through H p x q ě H pt x i u i P I 1 z i ˚ q “ log ´ 2 | I 1 |´ 1 ¯ “ log p 2 qp| I 1 | ´ 1 q . (37) F or any i P r d s , when x ´ i is giv en, we can recov er x i b y first observing the v alue of b from x j for an y j P I 0 , then applying the formula x i “ b ` ÿ k P I 1 z i x k I t i P I 1 u mod 2 . Th us, x i | x ´ i is alwa ys a Dirac measure, which leads to d ÿ i “ 1 H p x i | x ´ i q “ 0 . (38) Com bining Eqns. ( 35 ), ( 36 ), ( 37 ) and ( 38 ) prov es Eqn. ( 34 ). Equipp ed with Eqn. ( 34 ), w e are ready to b ound B p p ex q and C p p ex q . It can easily seen that B p p ex q “ H p x q ´ d ÿ i “ 1 H p x i | x ´ i q ě log p 2 q ` p 1 ´ r q d ´ 1 ˘ “ Ω p d q , C p p ex q “ d ÿ i “ 1 H p x i q ´ H p x q ě log p 2 qp d ´ | I 1 |q “ log p 2 q rd “ Ω p d q . F or the reverse direction, we can prov e the matching lo wer b ound similarly , whic h leads to B p p ex q “ Θ p d q , C p p ex q “ Θ p d q . Step 2: Sho w D p p ex q “ O p 1 q . Recall the definition of D p¨q in Eqn. ( 16 ): D p p ex q : “ ż 8 0 min p 1 , t q I p t q d t with I p t q : “ ÿ i ‰ j Pr d s I p x i t ; x j t | x ´p i,j q t q ě 0 . T o upper b ound D p p ex q , let us write D p p ex q “ ż 1 d 0 t I p t q d t ` ż log p d q 1 { d min t 1 , t u I p t q d t ` ż 8 log p d q I p t q d t. By direct calculations, one has ż 1 d 0 t I p t q d t ď 1 d ż 1 d 0 I p t q d t ď B p p ex q d “ Θ p 1 q , ż 8 log p d q I p t q d t ď 1 d ´ 1 ż 8 log p d q p e t ´ 1 q I p t q d t ď C p p ex q d ´ 1 “ Θ p 1 q . Therefore, it obeys D p p ex q “ ż log p d q 1 { d min t 1 , t u I p t q d t ` O p 1 q . 26 T o pro ve D p p ex q “ O p 1 q , it suffices to sho w that ż log p d q 1 { d min t 1 , t u I p t q d t “ O p 1 q . (39) In view of the definition of I p t q , w e can decomp ose it as I p t q “ ˜ ÿ i,j P I 0 ,i ‰ j ` ÿ i,j P I 1 ,i ‰ j ` ÿ i P I 0 ,j P I 1 ` ÿ i P I 1 ,j P I 0 ¸ I p x i t ; x j t | x ´p i,j q t q : “ I 1 p t q ` I 2 p t q ` I 3 p t q ` I 4 p t q , and we shall b ound these four terms separately . Before diving into the pro ofs, we make the observ ation that the m utual information can b e computed via I p x i t ; x j t | x ´p i,j q t q “ H p x i t | x ´p i,j q t q ´ H p x i t | x ´ i t q . (40) T o further compute each entrop y terms, let us introduce tw o quantities b elo w H 1 t “ H p e ´ t δ 0 ` p 1 ´ e ´ t q δ MASK q “ H p e ´ t δ 1 ` p 1 ´ e ´ t q δ MASK q “ te ´ t ´ log p 1 ´ e ´ t qp 1 ´ e ´ t q , (41a) H 2 t “ H ˆ 1 2 e ´ t δ 0 ` 1 2 e ´ t δ 1 ` p 1 ´ e ´ t q δ MASK ˙ “ p t ` log p 2 qq e ´ t ´ log p 1 ´ e ´ t qp 1 ´ e ´ t q . (41b) W e shall relate our quan tities of interest to these terms b elo w. Case 1: i, j P I 0 , i ‰ j . F or any giv en x ´p i,j q t , it alw ays holds true that P p x i t “ MASK q “ 1 ´ e ´ t , since the noising pro cess is time-homogeneous and indep enden t betw een co ordinates. Recall the definition m p x q “ t i P r d s : x i “ MASK u . Define the even t E i,j t, 1 P F ´p i,j q t , where F ´p i,j q t is the σ -algebra generated b y x ´p i,j q t , as follo ws: E i,j t, 1 : “ # x ´p i,j q t : ˜ ł k P I 0 zt i,j u t k R m p x t qu ¸ ł ˜ ľ ℓ P I 1 t ℓ P m p x t qu ¸ “ 1 + , where ^ is the logical op erator AND, and _ is the logical operator OR. By construction of p ex , it can b e c heck ed that ´ x i t ˇ ˇ x ´p i,j q t P E i,j t, 1 ¯ „ e ´ t δ 0 { 1 ` p 1 ´ e ´ t q δ MASK ; ´ x i t ˇ ˇ x ´p i,j q t P p E i,j t, 1 q c ¯ „ 1 2 e ´ t δ 0 ` 1 2 e ´ t δ 1 ` p 1 ´ e ´ t q δ MASK , where δ 0 { 1 represen ts either δ 0 or δ 1 . Therefore, b y the definition of the conditional entrop y , we ha ve H p x i t | x ´p i,j q t q “ H 1 t ¨ P p E i,j t, 1 q ` H 2 t ¨ ` 1 ´ P p E i,j t, 1 q ˘ . (42) Define the ev ent E i t, 1 P F ´ i t , where F ´ i t is the σ -algebra generated by x ´ i t , as follo ws: E i t, 1 : “ # x ´ i t : ˜ ł k P I 0 zt i u t k R m p x t qu ¸ ł ˜ ľ ℓ P I 1 t ℓ P m p x t qu ¸ “ 1 + . Then, it can b e c heck ed similarly that ` x i t ˇ ˇ x ´ i t P E i t, 1 ˘ „ e ´ t δ 0 { 1 ` p 1 ´ e ´ t q δ MASK ; 27 ` x i t ˇ ˇ x ´ i t P p E i t, 1 q c ˘ „ 1 2 e ´ t δ 0 ` 1 2 e ´ t δ 1 ` p 1 ´ e ´ t q δ MASK , whic h leads to H p x i t | x ´ i t q “ H 1 t ¨ P p E i t, 1 q ` H 2 t ¨ ` 1 ´ P p E i t, 1 q ˘ . (43) Plugging Eqns. ( 42 ) and ( 43 ) in to Eqn. ( 40 ) gives that for an y i, j P I 0 , i ‰ j , I p x i t ; x j t | x ´p i,j q t q “ H p x i t | x ´p i,j q t q ´ H p x i t | x ´ i t q “ H 1 t ¨ P p E i,j t, 1 q ` H 2 t ¨ ` 1 ´ P p E i,j t, 1 q ˘ ´ H 1 t ¨ P p E i t, 1 q ´ H 2 t ¨ ` 1 ´ P p E i t, 1 q ˘ “ p H 2 t ´ H 1 t q ` P p E i t, 1 q ´ P p E i,j t, 1 q ˘ “ log p 2 q e ´ 2 t ` 1 ´ e ´ t ˘ | I 0 |´ 2 ´ 1 ´ e ´| I 1 | t ¯ “ O ´ e ´ 2 t p 1 ´ e ´ t q rd { 2 ¯ , whose v alue is indep enden t of the indices i and j . Since |t i, j P I 0 : i ‰ j u| “ rd p r d ´ 1 q “ Θ p d 2 q , quan tity I 1 p t q satisfies I 1 p t q “ ÿ i,j P I 0 ,i ‰ j I p x i t ; x j t | x ´p i,j q t q “ O ´ d 2 e ´ 2 t p 1 ´ e ´ t q rd { 2 ¯ . (44) Case 2: i, j P I 1 , i ‰ j . F ollo wing the pro of strategy in Case 1, for an y giv en x ´p i,j q t , it holds that x i t | x ´p i,j q t „ 1 2 e ´ t δ 0 ` 1 2 e ´ t δ 1 ` p 1 ´ e ´ t q δ MASK , whic h implies that H p x i t | x ´p i,j q t q “ H 2 t . (45) Define the ev ent E i t, 2 P F ´ i t as follows: E i t, 2 : “ # x ´ i t : ˜ ł k P I 0 t k R m p x t qu ¸ ľ ˜ ľ ℓ P I 1 zt i u t ℓ P m p x t qu ¸ “ 1 + , whic h induces ` x i t ˇ ˇ x ´ i t P E i t, 2 ˘ „ e ´ t δ 0 { 1 ` p 1 ´ e ´ t q δ MASK ; ` x i t ˇ ˇ x ´ i t P p E i t, 2 q c ˘ „ 1 2 e ´ t δ 0 ` 1 2 e ´ t δ 1 ` p 1 ´ e ´ t q δ MASK , and the conditional entrop y form ula H p x i t | x ´ i t q “ H 1 t ¨ P p E i t, 2 q ` H 2 t ¨ ` 1 ´ P p E i t, 2 q ˘ . (46) Plugging Eqns. ( 45 ) and ( 46 ) in to Eqn. ( 40 ) gives that for an y i, j P I 0 , i ‰ j , I p x i t ; x j t | x ´p i,j q t q “ H p x i t | x ´p i,j q t q ´ H p x i t | x ´ i t q “ H 2 t ´ H 1 t ¨ P p E i t, 2 q ´ H 2 t ¨ ` 1 ´ P p E i t, 2 q ˘ “ ` H 2 t ´ H 1 t ˘ P p E i t, 2 q “ O ´ e ´p 1 ´ r q dt ¯ , whose v alue is, again, indep endent of the indices i and j . Since |t i, j P I 1 : i ‰ j u| “ p 1 ´ r q d pp 1 ´ r q d ´ 1 q “ Θ p d 2 q , we reac h I 2 p t q “ ÿ i,j P I 1 ,i ‰ j I p x i t ; x j t | x ´p i,j q t q “ O p d 2 e ´p 1 ´ r q dt q . (47) 28 Case 3: i P I 0 , j P I 1 . Define the function H B p p q : “ ´ p log p p q ´ p 1 ´ p q log p 1 ´ p q to b e the entrop y of the distribution Bern p p q . F ollo wing the pro ofs of the t wo cases ab o ve, let us define even ts E i,j t, 3 : “ # x ´p i,j q t : ˜ ł k P I 0 zt i u t k ‰ m p x t qu ¸ ł ˜ ľ ℓ P I 1 zt j u t ℓ P m p x t qu ¸ “ 1 + ; E i t, 3 : “ # x ´ i t : ˜ ł k P I 0 zt i u t k ‰ m p x t qu ¸ ł ˜ ľ ℓ P I 1 t ℓ P m p x t qu ¸ “ 1 + . Similar calculations yield H p x i t | x ´p i,j q t q “ H 1 t ¨ P p E i,j t, 3 q ` H 2 t ¨ ` 1 ´ P p E i,j t, 3 q ˘ ; H p x i t | x ´ i t q “ H 1 t ¨ P p E i t, 3 q ` H 2 t ¨ ` 1 ´ P p E i t, 3 q ˘ . Therefore, we obtain I p x i t ; x j t | x ´p i,j q t q “ p H 2 t ´ H 1 t q ` P p E i t, 3 q ´ P p E i,j t, 3 q ˘ “ log p 2 q e ´| I 1 | t ` 1 ´ e ´ t ˘ | I 0 | “ O ´ e ´ H B p r q d ¯ , where the last equalit y is due to the fact that e ´| I 1 | t p 1 ´ e ´ t q | I 0 | is maximized at t “ ´ log p 1 ´ r q . Finally , with |t i P I 0 , j P I 1 u| “ r p 1 ´ r q d 2 , we can b ound I 3 p t q “ ÿ i P I 0 ,j P I 1 I p x i t ; x j t | x ´p i,j q t q “ O ´ d 2 e ´ H B p r q d ¯ . (48) Case 4: i P I 1 , j P I 0 . Notice that I 3 p t q and I 4 p t q are in v ariant under swapping i and j . W e can show in the same w ay as ab o ve that I 4 p t q “ O ´ d 2 e ´ H B p r q d ¯ . (49) Putting ev erything together. Combining Eqns. ( 44 ), ( 47 ), ( 48 ) and ( 49 ), we arrive at I p t q À d 2 ´ e ´ 2 t p 1 ´ e ´ t q rd { 2 ` e ´p 1 ´ r q dt ` e ´ H B p r q d ¯ . (50) W e are now in a p osition to prov e Eqn. ( 39 ). Let us b egin with the integration ov er the time in terv al t P r 1 { d, 1 s . Direct calculation yields that e ´ 2 t p 1 ´ e ´ t q rd { 2 is maximized at t ˚ “ log p 1 ` rd 4 q ą 1 , whic h rev eals that d 2 e ´ 2 t p 1 ´ e ´ t q rd { 2 ď d 2 e ´ 2 t ˚ “ d 2 ˆ 1 ` r d 4 ˙ ´ 2 “ O p 1 q . (51) F or the term from I 2 p t q , we obtain ż 1 1 d t ¨ d 2 e ´p 1 ´ r q dt d t p a q “ ż d 1 se ´p 1 ´ r q s d s ď ż 8 0 se ´p 1 ´ r q s d s “ 1 p 1 ´ r q 2 “ O p 1 q , (52) where in (a), we use the change of v ariable formula with s “ dt . Similarly , we can show that ż 1 1 d t ¨ d 2 e ´ H B p r q d d t ď ż 1 0 t ¨ d 2 e ´ H B p r q d d t “ 1 2 d 2 e ´ H B p r q d “ O p 1 q , (53) where the condition min t r, 1 ´ r u “ Θ p 1 q ensures H B p r q “ Θ p 1 q . T aking collectively Eqns. ( 50 ), ( 51 ), ( 52 ) and ( 53 ), w e arrive at ż 1 1 d min t 1 , t u I p t q d t “ ż 1 1 d t I p t q d t “ O p 1 q . (54) 29 Let us mo ve on to the in tegration o ver time interv al t P r 1 , log p d qs . The integral computation yields that ż log p d q 1 d 2 e ´ 2 t p 1 ´ e ´ t q rd { 2 d t p a q “ ż d { e 1 s ´ 1 ´ s d ¯ rd { 2 d s p b q ď ż 8 0 se ´ rs { 2 d s “ O p 1 q , (55) where in (a), we use the c hange of v ariable form ula with s “ de ´ t , and in (b), we use the inequalit y p 1 ´ x q 1 { x ď e ´ 1 for x P p 0 , 1 s . F or the remaining terms, we hav e ż log p d q 1 d 2 e ´p 1 ´ r q dt d t ď ż log p d q 0 d 2 e ´p 1 ´ r q d d t “ d 2 log p d q e ´p 1 ´ r q d “ O p 1 q ; (56) ż log p d q 1 d 2 e ´ H B p r q d d t ď d 2 log p d q e ´ H B p r q d “ O p 1 q , (57) where the condition min t r, 1 ´ r u “ Θ p 1 q ensures H B p r q “ Θ p 1 q . Now, combining Eqns. ( 50 ), ( 55 ), ( 56 ) and ( 57 ) yields ż log p d q 1 min t 1 , t u I p t q d t “ ż log p d q 1 I p t q d t “ O p 1 q . (58) Finally , equipped with Eqns. ( 54 ) and ( 58 ), w e conclude ż log p d q 1 d min t 1 , t u I p t q d t “ ż 1 1 d min t 1 , t u I p t q d t ` ż log p d q 1 min t 1 , t u I p t q d t “ O p 1 q , whic h pro ves D p p ex q “ O p 1 q . B T ec hnical preparations B.1 Score functions Belo w, w e presen t an equiv alent form ulation of the score functions. Prop osition 6. L et q 0 b e an initial distribution on X 0 . L et x, y P X b e such that Q p y , x q ą 0 . Then, 1. for the uniform noising pr o c ess, s t p y , x q “ E x 0 „ q 0 α d H p y ,x 0 q t E x 0 „ q 0 α d H p x,x 0 q t , (59) wher e α t : “ 1 ´ e ´ t 1 `p S ´ 1 q e ´ t . 2. for the masking noising pr o c ess, s t p y , x q “ 1 e t ´ 1 q 0 p y q q 0 p x q , (60) wher e for x P X z X 0 , q 0 p x q is the mar ginal pr ob ability of the unmask ed c o or dinates of x under q 0 . Pro of of Prop osition 6 . By the definition of the score function, one can write s t p y , x q “ q t p y q q t p x q “ ř x 0 q t | 0 p y | x 0 q q 0 p x 0 q ř x 0 q t | 0 p x | x 0 q q 0 p x 0 q . F or the uniform noising pro cess, one can solv e the Kolmogoro v forw ard equation for ev ery dimension. As a result, the transition can b e written as q t | 0 p y | x 0 q “ ˆ 1 ´ e ´ t S ˙ d H p y ,x 0 q ˆ 1 ` p S ´ 1 q e ´ t S ˙ d ´ d H p y ,x 0 q “ ˆ 1 ` p S ´ 1 q e ´ t S ˙ d α d H p y ,x 0 q t , 30 whic h pro ves Eqn. ( 59 ). More details of this relation can b e found in (e.g., Zhang et al. ( 2025 ), Prop osition 1). F or the masking noising pro cess, for notational conv enience, given any x P pr S s Y t MASK uq d , define m p x q : “ t i P r d s : x i “ MASK u . (61) In view of this piece of notation, as Pr p x i t “ MASK q “ e ´ t , and co ordinates evolv e indep enden tly , one can write q t | 0 p y | x 0 q “ p 1 ´ e ´ t q | m p y q | e ´ t p d ´ | m p y q | q I t for all i P r d s , y i P t x i 0 , MASK u u . As Q p y , x q ą 0 , it must b e that d H p x, y q “ 1 , and for i , suc h that x i ‰ y i , x i “ MASK and y i ‰ MASK . This implies that | m p x q | “ | m p y q | ` 1 , and w e can write ř x 0 q t | 0 p y | x 0 q q 0 p x 0 q ř x 0 q t | 0 p x | x 0 q q 0 p x 0 q “ e ´ t 1 ´ e ´ t ř x 0 q 0 p x 0 q I t for all i P r d s , y i P t x i 0 , MASK u u ř x 0 q 0 p x 0 q I t for all i P r d s , x i P t x i 0 , MASK u u “ 1 e t ´ 1 q 0 p y q q 0 p x q . B.2 T echnical lemmas Lemma 4 (Chain rule of KL div ergence) . F or N ą 0 , let a 0: N , b 0: N b e the joint distributions of two Markov pr o c esses. Then, KL p a 0: N } b 0: N q “ KL p a 0 } b 0 q ` N ´ 1 ÿ k “ 0 E x „ a k KL ` a k ` 1 | k p¨ | x q} b k ` 1 | k p¨ | x q ˘ . Pro of of Lemma 4 . Inv oking the definition of KL divergence with some direct calculations yields KL p a 0: N } b 0: N q “ E x 0: N „ a 0: N log a 0: N p x 0: N q b 0: N p x 0: N q “ E x 0: N „ a 0: N log ˜ a 0 p x 0 q b 0 p x 0 q N ´ 1 ź k “ 0 a k ` 1 | k p x k ` 1 | x k q b k ` 1 | k p x k ` 1 | x k q ¸ “ E x 0 „ a 0 log a 0 p x 0 q b 0 p x 0 q ` N ´ 1 ÿ k “ 0 E x k „ a k E x k ` 1 „ a k ` 1 | k p¨| x k q log a k ` 1 | k p x k ` 1 | x k q b k ` 1 | k p x k ` 1 | x k q “ KL p a 0 } b 0 q ` N ´ 1 ÿ k “ 0 E x k „ a k KL ` a k ` 1 | k p¨ | x k q} b k ` 1 | k p¨ | x k q ˘ . Lemma 5 (Deriv ative of KL: an upp er bound) . L et p q t q and p p t q b e the mar ginals of CTMCs with r ate matric es p Q t q and p p Q t q , r esp e ctively, and Ð q t ” q T ´ t b e the mar ginals of the r everse pr o c ess. Then, for any t ą t k and for any z , B B t KL ´ Ð q t | t k p¨ | z q} p t | t k p¨ | z q ¯ ď E x t „ Ð q t | t k p¨| z q ÿ y ‰ x t « p Q t p x t , y q ´ Ð Q t p x t , y q ` Ð Q t p x t , y q log Ð Q t p x t , y q p Q t p x t , y q ff . Pro of of Lemma 5 . Let us omit the conditioning on z for the notation brevit y . By direct calculations, one can write A : “ B B t KL ´ Ð q t | t k } p t | t k ¯ “ ÿ x P X ˆ B B t Ð q t | t k p x q ˙ log Ð q t | t k p x q p t | t k p x q ´ ÿ x P X Ð q t | t k p x q B B t p t | t k p x q p t | t k p x q . 31 Recall the K olmogorov equation: B B t Ð q t | t k p x q “ ÿ y P X Ð Q p y , x q Ð q t | t k p y q and B B t p t | t k p x q “ ÿ y P X p Q p y , x q p t | t k p y q Putting the abov e together, w e obtain (relab eling x and y ) A “ E x „ Ð q t | t k ÿ y P X « Ð Q p x, y q log ˜ Ð q t | t k p y q p t | t k p y q ¸ ´ p Q p x, y q Ð q t | t k p y q Ð q t | t k p x q ¨ p t | t k p x q p t | t k p y q ff “ E x „ Ð q t | t k ÿ y ‰ x « Ð Q p x, y q log ˜ Ð q t | t k p y q p t | t k p y q ¸ ´ Ð Q p x, y q log ˜ Ð q t | t k p x q p t | t k p x q ¸ ´ p Q p x, y q Ð q t | t k p y q Ð q t | t k p x q ¨ p t | t k p x q p t | t k p y q ff ´ p Q p x, x q “ E x „ Ð q t | t k ÿ y ‰ x « Ð Q p x, y q log ˜ Ð q t | t k p y q Ð q t | t k p x q ¨ p t | t k p x q p t | t k p y q ¸ ´ p Q p x, y q Ð q t | t k p y q Ð q t | t k p x q ¨ p t | t k p x q p t | t k p y q ff ´ p Q p x, x q “ E x „ Ð q t | t k ÿ y ‰ x « Ð Q p x, y q log ˜ Ð q t | t k p y q Ð q t | t k p x q ¨ p t | t k p x q p t | t k p y q ¸ ´ p Q p x, y q Ð q t | t k p y q Ð q t | t k p x q ¨ p t | t k p x q p t | t k p y q ` p Q p x, y q ff , (62) where we inv ok e the the prop ert y that Ð Q t p x, x q “ ´ ř y ‰ x Ð Q t p x, y q and p Q t p x, x q “ ´ ř y ‰ x p Q t p x, y q . Then, letting C xy b e suc h that (recall that z is fixed) Ð q t | t k p y | z q Ð q t | t k p x | z q “ Ð Q t p x, y q C xy , (63) it satisfies that A “ E x „ Ð q t | t k ÿ y ‰ x « p Q p x, y q ` Ð Q p x, y q log ˜ Ð Q t p x, y q p Q t p x, y q ¸ ` Ð Q t p x, y q log ˜ p Q t p x, y q C xy ¨ p t | t k p x q p t | t k p y q ¸ ´ p Q p x, y q Ð Q t p x, y q C xy ¨ p t | t k p x q p t | t k p y q ff (64) Finally , since log z ď z ´ 1 , A ď E x „ Ð q t | t k ÿ y ‰ x « p Q p x, y q ` Ð Q p x, y q log ˜ Ð Q t p x, y q p Q t p x, y q ¸ ´ Ð Q p x, y q ` Ð Q t p x, y q p Q t p x, y q C xy ¨ p t | t k p x q p t | t k p y q ´ p Q p x, y q Ð Q t p x, y q C xy ¨ p t | t k p x q p t | t k p y q ff “ E x „ Ð q t | t k ÿ y ‰ x « p Q t p x, y q ´ Ð Q t p x, y q ` Ð Q t p x, y q log ˜ Ð Q t p x, y q p Q t p x, y q ¸ff . Lemma 6 (Itô’s Lemma for Poisson jump pro cess) . F or the Poisson jump pr o c ess t x t u t ě 0 with gener ator t L t u t ě 0 and r ate matrix t R t u t ě 0 . Itô’s L emma formula c an b e written as f p t, x t q “ f p 0 , x 0 q ` ż t 0 “ B s f p s, x s ´ q ` p L s f q p s, x s ´ q ‰ d t ` M t , (65) wher e x s ´ “ lim u Ñ s ´ x s , which exists for almost everywher e s P r 0 , t q under the L eb esgue me asur e. The c omp ensation pr o c ess t M u u u Pr ℓ,t s is define d as M u “ ÿ y s : y s ‰ x s ż u ℓ ` f p s, y s q ´ f p s, x s q ˘` d N x s ,y s s ´ λ x s ,y s s d s ˘ , 32 wher e N x,y s is the c ounting pr o c ess of jumps fr om x to y up to time t and λ x,y s is the F s ´ -intensity of N x,y s , i.e., λ x,y s “ I t x s ´ “ y u R t p x, y q . See ( Conforti et al. , 2025 , Appendix A.5) for more details. C Pro ofs of results in Section 3.1 C.1 Pro of of Theorem 1 W e first decomp ose the KL divergence betw een the output distribution p T and the target distribution q 0 as KL p q 0 } p T q ď KL p q T ´ t 0 ,...,T ´ t N } p t 0 ,...,t N q “ KL p q T } p 0 q ` N ´ 1 ÿ k “ 0 E x t k „ Ð q t k ” KL ´ Ð q t k ` 1 | t k p¨| x t k q} p t k ` 1 | t k p¨| x t k q ¯ı , (66) where the first inequalit y follo ws from the data-processing inequalit y for KL div ergence and the second inequalit y follo ws from the chain rule for KL div ergence in Lemma 4 . The first term is the initialization error, which can b e upp er b ounded by the following lemma. Lemma 7. F or the uniform noising pr o c ess, for any initial distribution q 0 P P p X q , time index t ě 0 , one has the same limit distribution q t d Ñ p 0 “ Unif p X q , as t Ñ 8 . F urther, the mo difie d lo g-Sob olev c onstant 3 of q t satisfies C LSI “ 2 , which le ads to KL p q t } p 0 q ď e ´ t KL p q 0 } p 0 q ď e ´ t d log p S q . The pro of of Lemma 7 can be found in previous works, e.g., Zhang et al. ( 2025 , Proposition 2). Applying the lemma abov e together with Lemma 5 and Eqn. ( 7 ) on the second term in Eqn. ( 66 ), w e obtain KL p p 0 } q T q ď KL p q T } p 0 q ` N ´ 1 ÿ k “ 0 E x t k „ Ð q t k „ ż t k ` 1 t k B B t KL ´ Ð q t | t k p¨| x t k q} p t | t k p¨| x t k q ¯ d t ȷ ď e ´ T d log p S q ` N ´ 1 ÿ k “ 0 E x t k „ Ð q t k ż t k ` 1 t k E x t „ Ð q t | t k ÿ y ‰ x t « p Q t p x t , y q ´ Ð Q t p x t , y q ` Ð Q t p x t , y q log Ð Q t p x t , y q p Q t p x t , y q ff d t ď e ´ T d log p S q ` 1 S N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t k ,x t „ Ð q t k ,t „ ÿ i Pr d s ÿ c Pr S s s T ´ t p x t ‘ i c, x t q D ` p s T ´ t k p x t k ‘ i c, x t k q , s T ´ t p x t ‘ i c, x t q ˘ d t ȷ . (67) In the follo wing, we fo cus on the quan tity E x t k ,x t „ Ð q t k ,t ÿ i Pr d s ÿ c Pr S s s T ´ t p x t ‘ i c, x t q D ` p s T ´ t k p x t k ‘ i c, x t k q , s T ´ t p x t ‘ i c, x t q ˘ . (68) F or simplicit y , w e write t k : “ ℓ . Direct calculations yield the follo wing decomposition ÿ i Pr d s ÿ c Pr S s s T ´ t p x t ‘ i c, x t q D ` p s T ´ t k p x t k ‘ i c, x t k q , s T ´ t p x t ‘ i c, x t q ˘ 3 C LSI is defined as the smallest num b er suc h that for any q P P p X q , KL p q | Unif p X qq ď C LSI { 2 ¨ E p q , log p q qq , where E is the Dirichlet form asso ciated with the uniform noising process, i.e., E p f , g q “ ´p 2 | X |q ´ 1 ř x,y P X p f p x q ´ f p y qqp g p x q ´ g p y qq Q p x, y q . 33 “ ÿ y ℓ :d H p y ℓ ,x ℓ q“ 1 s T ´ ℓ p y ℓ , x ℓ q D p p s T ´ ℓ p y ℓ , x ℓ q , s T ´ ℓ p y ℓ , x ℓ qq lo oooooooooooooooooooooooooooooooooo omo oooooooooooooooooooooooooooooooooo on T t,ℓ 1 ` ÿ i Pr d s ÿ c Pr S s p s T ´ ℓ p x ℓ ‘ i c, x ℓ q ´ s T ´ t p x t ‘ i c, x t qq log p s T ´ ℓ p x ℓ ‘ i c, x ℓ q looooooooooooooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooooooooooooon T t,ℓ 2 ` ÿ y t :d H p y t ,x t q“ 1 h p s T ´ t p y t , x t qq ´ ÿ y ℓ :d H p y ℓ ,x ℓ q“ 1 h p s T ´ ℓ p y ℓ , x ℓ qq lo oooooooooooooooooooooooooooooooooooooomo ooooooooooooooooooooooooooooooooooooo on T t,ℓ 3 , where h p x q “ x log x ´ x ` 1 . W e pro ceed by b ounding eac h term separately . • F or term T t,ℓ 1 , notice that T t,ℓ 1 is indep enden t of t . In view of definition of score en tropy loss, we ha ve E x ℓ ,x t „ Ð q ℓ,t ” T t,ℓ 1 ı (69) “ E x ℓ ,x t „ Ð q ℓ,t » – ÿ y ℓ :d H p y ℓ ,x ℓ q“ 1 s T ´ ℓ p y ℓ , x ℓ q D p p s T ´ ℓ p y ℓ , x ℓ q , s T ´ ℓ p y ℓ , x ℓ qq fi fl “ S ¨ E x ℓ ,x t „ Ð q ℓ,t » – ÿ y ℓ :d H p y ℓ ,x ℓ q“ 1 Q T ´ ℓ p y ℓ , x ℓ q s T ´ ℓ p y ℓ , x ℓ q D p p s T ´ ℓ p y ℓ , x ℓ q , s T ´ ℓ p y ℓ , x ℓ qq fi fl “ S ¨ L SE p T ´ ℓ, p s T ´ ℓ , s T ´ ℓ q , (70) where we use the fact that Q T ´ t p y , x q “ S ´ 1 for any d H p y , x q “ 1 . • F or term T t,ℓ 2 , we establish the follo wing lemma, whose pro of is pro vided in Section E.2 . Lemma 8. Consider the uniform noising pr o c ess and let 0 ď ℓ ă t ă T . Then, for any c P r S s , i P r d s and x ℓ P X , it ob eys E x t „ Ð q t | ℓ p¨| x ℓ q ” ` s T ´ ℓ p x ℓ ‘ i c, x ℓ q ´ s T ´ t p x t ‘ i c, x t q ˘ log p s T ´ ℓ p x ℓ ‘ i c, x ℓ q ı “ 0 . With Lemma 8 , it is easily seen that E x ℓ ,x t „ Ð q ℓ,t ” T t,ℓ 2 ı “ E x ℓ ,x t „ Ð q ℓ,t « ÿ i Pr d s ÿ c Pr S s ` s T ´ ℓ p x ℓ ‘ i c, x ℓ q ´ s T ´ t p x t ‘ i c, x t q ˘ log p s T ´ ℓ p x ℓ ‘ i c, x ℓ q ff “ ÿ i Pr d s ÿ c Pr S s E x ℓ „ Ð q ℓ „ E x t „ Ð q t | ℓ p¨| x ℓ q ” ` s T ´ ℓ p x ℓ ‘ i c, x ℓ q ´ s T ´ t p x t ‘ i c, x t q ˘ log p s T ´ ℓ p x ℓ ‘ i c, x ℓ q ı ȷ “ 0 . (71) • F or term T t,ℓ 3 , w e mak e the crucial observ ation that E x t „ Ð q t r ř y t :d H p y t ,x t q“ 1 h p s T ´ t p y t , x t qqs admits a simple representation. The statement is formalized in the following lemma. Lemma 9. F or any t P r 0 , T s , we have E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 h p s T ´ t p y t , x t qq fi fl “ E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 ´ log p s T ´ t p y t , x t qq fi fl . 34 In view of this lemma, w e can further express the term T t,ℓ 3 as E x ℓ ,x t „ Ð q ℓ,t ” T t,ℓ 3 ı (72) “ E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 h p s T ´ t p y t , x t qq fi fl ´ E x ℓ „ Ð q ℓ » – ÿ y ℓ :d H p y ℓ ,x ℓ q“ 1 h p s T ´ ℓ p y ℓ , x ℓ qq fi fl “ E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 ´ log p s T ´ t p y t , x t qq fi fl ´ E x ℓ „ Ð q ℓ » – ÿ y ℓ :d H p y ℓ ,x ℓ q“ 1 ´ log p s T ´ ℓ p y ℓ , x ℓ qq fi fl . (73) Plugging Eqns. ( 70 ), ( 71 ), and ( 73 ) into Eqn. ( 68 ), we end up with E x t k ,x t „ Ð q t k ,t ÿ i Pr d s ÿ c Pr S s s T ´ t p y t , x t q D p p s T ´ t k p x t k ‘ i c, x t k q , s T ´ t p y t , x t qq “ S ¨ L SE p T ´ ℓ, p s T ´ ℓ , s T ´ ℓ q ` E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 ´ log p s T ´ t p y t , x t qq fi fl ´ E x ℓ „ Ð q ℓ » – ÿ y ℓ :d H p y ℓ ,x ℓ q“ 1 ´ log p s T ´ ℓ p y ℓ , x ℓ qq fi fl “ S ¨ L SE p T ´ ℓ, p s T ´ ℓ , s T ´ ℓ q ` S ` φ p T ´ t q ´ φ p T ´ ℓ q ˘ , where we define φ p t q as φ p t q : “ 1 S E x t „ q t » – ÿ y t :d H p y t ,x t q“ 1 ´ log p s t p y t , x t qq fi fl . (74) Returning to Eqn. ( 67 ), we conclude that KL p p 0 } q T q ď e ´ T d log p S q ` N ´ 1 ÿ k “ 0 ż t k ` 1 t k L SE p T ´ t k , p s T ´ t k , s T ´ t k q ` ` φ p T ´ t q ´ φ p T ´ t k q ˘ d t ď ε score ` e ´ T d log p S q ` N ´ 1 ÿ k “ 0 ż t k ` 1 t k ` φ p T ´ t q ´ φ p T ´ t k q ˘ d t. (75) T o establish Theorem 1 , it is only left for us to control the last term in Eqn. ( 75 ). First, by Jensen’s inequalit y , φ p t q is low er b ounded by φ p t q ě ´ 1 S ÿ i Pr d s ÿ c Pr S s log ` E x t „ q t r s t p x t ‘ i c, x t qs ˘ “ 0 . (76) F or the upp er b ound, from the definition of φ p t q , it satisfies that φ p t q ď 1 S E x t „ q t « ˇ ˇ t y t : d H p y t , x t q “ 1 u ˇ ˇ ¨ sup x,y :d H p x,y q“ 1 ˇ ˇ log p s t p y , x qq ˇ ˇ ff , (77) where |t y t : d H p y t , x t q “ 1 u| denotes the cardinality of the set t y t : d H p y t , x t q “ 1 u , which equals d p S ´ 1 q for an y x t P X . It therefore suffices to con trol the quantit y | log p s t p y t , x t qq | , which is achiev ed through the follo wing lemma. Lemma 10. F or any distribution q 0 on X , let q t b e the mar ginal distribution of the uniform noising pr o c ess at time t . Then, for any x, y P X such that d H p x, y q “ 1 , it holds that | log s t p y , x q| À log p S q ` max t log p t ´ 1 q , 0 u . 35 As a consequence of Lemma 10 , we arriv e at φ p t q ď d p S ´ 1 q S ¨ sup x,y :d H p x,y q“ 1 | log p s t p y , x qq | À d ` log p S q ` max t log p t ´ 1 q , 0 u ˘ . (78) In addition, w e make the observ ation in Lemma 12 that φ p t q is a non-increasing function in t . No w w e are ready to combine everything and b ound the last term of Eqn. ( 75 ). Define ∆ “ max k t t k ` 1 ´ t k u , and c ho ose 1 ď M ď N ´ 1 suc h that T ´ t M P r ∆ , 2∆ s . Armed with Eqns. ( 76 ), ( 78 ) and the monotonicit y of φ p t q , we obtain N ´ 1 ÿ k “ 0 ż t k ` 1 t k ` φ p T ´ t q ´ φ p T ´ t k q ˘ d t ď ż T ´ t M 0 φ p t q d t ` M ´ 1 ÿ k “ 0 ż t k ` 1 t k ` φ p T ´ t k ` 1 q ´ φ p T ´ t k q ˘ d t ď 2∆ d ` log p S q ` log p 2 { ∆ q ` 1 ˘ ` ∆ N ´ 2 ÿ k “ 0 ` φ p T ´ t k ` 1 q ´ φ p T ´ t k q ˘ ď 2∆ d ` log p S q ` log p 2 { ∆ q ` 1 ˘ ` ∆ φ p ∆ q À ∆ d log p S { ∆ q . Com bining the inequality abov e with Eqn. ( 75 ) ac hieves KL p p 0 } q T q ď ε score ` e ´ T d log p S q ` ∆ d log p S { ∆ q , whic h completes the pro of of Theorem 1 . C.2 Pro of of Corollary 1 Cho ose time horizon T “ log p d log p S q{ ε q and num b er of discretization steps N “ Θ ˆ d log p S q log 3 p d log p S q{ ε q ε ˙ “ r Θ ˆ d ε ˙ . A dopting the upp er b ound in Theorem 1 leads to KL p q data } p output q “ KL p q 0 } p T q À ε score ` e T d log p S q ` T d N log ˆ S N T ˙ À ε score ` ε ` ε log p S q T 2 ´ log p S q ` 3 T ¯ À ε score ` ε. C.3 Pro of of Theorem 2 Recall that the path measures of the bac kward process and the sampling pro cess are denoted b y Q d “ t Ð q t u t Pr 0 ,T ´ δ s and P d “ t p t u t Pr 0 ,T ´ δ s , respectively . It can b e chec ked that the path measure Q is absolutely con tinuous with respect to P . By Girsano v’s theorem for the bac kward pro cess (e.g. Ren et al. ( 2025 , Corollary 3.4)), it satisfies KL p Q } P q “ KL p Ð q 0 } p 0 q ` 1 S N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t k ,x t „ Ð q t k ,t „ ÿ i Pr d s ÿ c Pr S s s T ´ t p x t ‘ i c, x t q D p p s T ´ t k p x t k ‘ i c, x t k q , s T ´ t p x t ‘ i c, x t qq d t ȷ . 36 F ollowing the same analysis for Eqn. ( 68 ) in App endix C.1 , w e arriv e at KL p Q } P q “ ε score ` KL p Ð q 0 } p 0 q ` N ´ 1 ÿ k “ 0 ż t k ` 1 t k p φ p T ´ t q ´ φ p T ´ t k qq d t “ ε score ` KL p q T } p 0 q ` N ´ 1 ÿ k “ 0 ż t k ` 1 t k p φ p T ´ t q ´ φ p T ´ t k qq d t, where the function φ p¨q is defined as in Eqn. ( 74 ) φ p t q : “ 1 S E x t „ q t » – ÿ y t :d H p y t ,x t q“ 1 ´ log p s t p y t , x t qq fi fl . Th us, to achiev e KL p Q } P q ď ε score ` O p 1 q , we need to select N , T and step size schedule such that KL p q T } p 0 q ` N ´ 1 ÿ k “ 0 ż t k ` 1 t k p φ p T ´ t q ´ φ p T ´ t k q q d t “ O p 1 q . (79) In order to understand the first term in Eqn. ( 79 ), let us first consider the case when T “ 1 . By the assumption q 0 P P γ p X q , therefore, we are ensured that KL p q 1 } p 0 q “ ÿ x P X q 1 p x q log ` q 1 p x qq ´ ÿ x P X q 1 p x q log ` p 0 p x qq ě γ d log p S q " 1 . Hence, it implies that T must satisfy T ą 1 . W e then fo cus on the analysis of the second term in Eqn. ( 79 ). W e aim to show that the changing rate, i.e., ´ φ 1 p t q , is low er b ounded as we come close to the target data distribution (i.e., t P r 0 , 1 sq , whic h in turn leads to a low er b ound on the difference φ p T ´ t q ´ φ p T ´ t k q . W e proceed with our analysis under the information-theoretic framework. F or notational conv enience, giv en ev ery i P r d s and c P r S s , let us define φ i,c p t q “ E x t „ q t “ ´ log ` s t p x t ‘ i c, x t q ‰ “ E x t „ q t “ ´ log ` s t p N i,c p x t q , x t q ‰ (80) where the operator N i,c : X Ñ X is defined as N i,c p x q “ x ‘ i c . It is easy to c heck that φ p t q “ 1 S ÿ i Pr d s ÿ c Pr S s φ i,c p t q . Notice that N i,c is a bijection in X . W e define N i, ´ c “ p N i,c q ´ 1 “ N i,S ´ c , where p N i,c q ´ 1 is denoted as the in verse function of N i,c . Since φ p t q can b e written as a linear combination of φ i,c p t q , it suffic es to study the prop erties of the individual φ i,c p t q to c haracterize φ p t q . T o b egin with, the following lemma provides a characterization of φ p t q and φ i,c p t q as information-theoretic quantities. Lemma 11. F or φ p t q and φ i,c p t q define d in Eqns. ( 74 ) and ( 80 ) , we have φ i,c p t q “ KL p q t } p N i, ´ c q # q t q ; φ p t q “ ´ B B t KL p q t } p 0 q “ ÿ i Pr d s ÿ c Pr S s KL p q t } p N i,c q # q t q , wher e p N i,c q # is denote d as the pushforwar d me asur e of q t under op er ator N i,c . Here, the pushforw ard measure giv es, for any x P X , p N i, ´ c q # q t p x q “ q t p N i,c p x qq . (81) 37 Lemma 11 allows us to write φ i,c p t q as the KL divergence b et ween the marginal forward pro cess and its pushforw ard under N i,c . By viewing N i,c as an information channel, we can show it is in a sp ecial family of c hannels, named S -ary symmetric c hannel ( Makur and P olyanskiy , 2018 ), whic h satisfies strong data pro cessing inequalit y . Through this idea, w e can prov e the following lemma. The details are provided in Section E.6 . Lemma 12. F or t P p 0 , T s , φ i,c p t q is differ entiable in t and it holds that ´ φ 1 i,c p t q ě φ i,c p t q . Consequen tly , Lemma 12 leads to ´ φ 1 p t q “ ´ 1 S ÿ i Pr d s ÿ c Pr S s φ 1 ic p t q ě 1 S ÿ i Pr d s ÿ c Pr S s φ ic p t q “ φ p t q . Recall the log-Sob olev inequality in Lemma 7 . W e ha ve, for an y target distribution q 0 P P γ p X q and t P p 0 , 1 q , ´ φ 1 p t q ě φ p t q ě KL p q t } p 0 q ě KL p q 1 } p 0 q ě γ d log p S q . Equipp ed with the ab ov e relation, we are ready to control the second term in Eqn. ( 79 ). By the funda- men tal theorem of calculus, w e obtain ż t k ` 1 t k p φ p T ´ t q ´ φ p T ´ t k q q d t “ ż T ´ t k T ´ t k ` 1 ż T ´ t k T ´ t ´ φ 1 p τ q d τ d t Á ż T ´ t k T ´ t k ` 1 p t ´ t k q γ d log p S q d t “ 1 2 p t k ` 1 ´ t k q 2 γ d log p S q . Cho ose M such that T ´ t M P r 1 2 , 1 s . Such M exists due to the fact that T ą 1 and max k t t k ` 1 ´ t k u ď 1 2 . It holds in this case that O p 1 q “ N ´ 1 ÿ k “ 0 ż t k ` 1 t k p φ p T ´ t q ´ φ p T ´ t k qq d t Á N ´ 1 ÿ k “ M ż t k ` 1 t k p φ p T ´ t q ´ φ p T ´ t k qq d t Á N ´ 1 ÿ k “ M p t k ` 1 ´ t k q 2 γ d log p S q . (82) By Cauch y-Sch warz inequality , it is direct to show that N ´ 1 ÿ k “ M p t k ` 1 ´ t k q 2 ě 1 N ´ M ˜ N ´ 1 ÿ k “ M p t k ` 1 ´ t k q ¸ 2 “ p T ´ t M ´ δ q 2 N ´ M Á 1 N ´ M , (83) where in the last inequalit y , we use the fact that T ´ t M ě 1 2 and δ ! 1 . Plugging Eqn. ( 83 ) into Eqn. ( 82 ) leads to N ě N ´ M “ Ω p γ d log p S qq “ r Ω p d q . C.4 Efficien t sampling for high-en tropy distributions In the discussion follo wing Theorem 2 , w e p oin ted out that the τ -leaping algorithm can attain sublinear iteration complexity in d for the uniform noising pro cess when the target distribution is close to the uniform distribution on X . W e now state this result formally . 38 Theorem 4. L et q 0 P P p X q denote the data distribution. Cho ose time p oints 0 “ t 0 ă t 1 ă . . . ă t N “ T ´ δ with exp onential-then-c onstant step size sche dule, i.e., t k ` 1 ´ t k ď κ min p 1 , T ´ t k ` 1 q for k “ 0 , . . . , N ´ 2 . Supp ose 0 ă κ ă 0 . 9 . Then, KL p q T ´ δ } p output q À ε score ` ` e ´ T ` κ log p δ ´ 1 q ˘ ¨ KL p q 0 } Unif p X qq . Theorem 4 reveals that, with an exponential-then-constan t schedule and early stopping time δ , the error upp er b ound dep ends only on the initial KL divergence KL p q 0 } Unif p X qq , which can potentially b e small if q 0 is close to the forward limit distribution Unif p X q . T o b e more concrete, we can c ho ose T “ log p KL p q 0 } Unif p X qq{ ε q , δ ´ 1 “ p oly p d q and κ “ e ´ T { log p d q to ac hieve KL p q T ´ δ } p output q À ε score ` ε, with iteration complexit y N “ r Θ ˆ KL p q 0 } Unif p X qq ε ˙ . In particular, this b ound is sublinear in d when KL p q 0 } Unif p X qq “ o p d q . Pro of of Theorem 4 . The pro of pro ceeds along the same lines as the pro of of Theorem 1 . W rite p 0 “ Unif p X q as the initial distribution of the sampling pro cess. F ollowing the proof of Eqn. ( 75 ), w e bound KL p q T ´ δ } p T ´ δ q ď ε score ` e ´ T KL p q 0 } p 0 q ` N ´ 1 ÿ k “ 0 ż t k ` 1 t k ` φ p T ´ t q ´ φ p T ´ t k q ˘ d t, (84) where as shown in Lemma 11 , φ p t q “ ´B t KL p q t } p 0 q ě 0 . By Lemma 12 , φ p t q is a non-increasing function of t P p 0 , T s , which leads to φ p t q ď 1 t ż t 0 B s KL p q s } p 0 q d s “ KL p q 0 } p 0 q t . (85) Without loss of generalit y , Cho ose M such that 1 ď M ď N ´ 1 such that T ´ t M “ 1 . F or 1 ď k ă M , t k ` 1 ´ t k “ t k ´ t k ´ 1 “ κ . With Eqn. ( 85 ), it can b e seen that N ´ 1 ÿ k “ 0 ż t k ` 1 t k ` φ p T ´ t q ´ φ p T ´ t k q ˘ d t ď N ´ 1 ÿ k “ 0 p t k ` 1 ´ t k q ` φ p T ´ t k ` 1 q ´ φ p T ´ t k q ˘ d t “ p t N ´ t N ´ 1 q φ p T ´ t N q ` N ´ 1 ÿ k “ 1 ` p t k ´ t k ´ 1 q ` p t k ´ t k ` 1 q ˘ φ p T ´ t k q ´ p t 1 ´ t 0 q φ p T q p a q ď κδ 1 ´ κ ¨ KL p q 0 } p 0 q δ ` N ´ 1 ÿ k “ M κ 2 1 ´ κ p T ´ t k q ¨ KL p q 0 } p 0 q T ´ t k À p N ´ M q κ 2 KL p q 0 } p 0 q “ log p 1 ´ κ q p δ q κ 2 KL p q 0 } p 0 q p b q ď κ log p δ ´ 1 q KL p q 0 } p 0 q , where we apply Eqn. ( 85 ) in (a) and log p 1 ´ κ q ď ´ κ in (b). Putting the ab o v e b ound and Eqn. ( 84 ) together pro ves the desired result. 39 D Pro ofs of results in Section 3.2 D.1 Pro of of Theorem 3 F or t P t t 0 , . . . , t N u , let p t denote the marginal distribution of x t k in Algorithm 1 . Using the data-pro cessing inequalit y KL p Ð q T } p T q ď KL p Ð q t 0 ,...,t N } p t 0 ,...,t N q and Lemma 4 , w e decomp ose the KL divergence b et ween the target distribution q 0 ” Ð q T and the output distribution p T as follows: KL p Ð q T } p T q ď KL p Ð q 0 } p 0 q ` N ´ 1 ÿ k “ 0 E x t k „ Ð q t k ” KL ´ Ð q t k ` 1 | t k p¨| x t k q} p t k ` 1 | t k p¨| x t k q ¯ı . (86) The first term, initialization err or , was bounded in Conforti et al. ( 2025 ); Liang et al. ( 2025b ) as follo ws: KL p Ð q 0 } p 0 q ď e ´ T d p 1 ` log S ` T q À e ´ T d log S. (87) Next, we mov e on to control the second term. The following lemma states that for each k , conditioned on x t k , one can consider a CTMC on the interv al r t k , t k ` 1 s , with marginals p t k ` 1 | t k p¨ | x t k q at time t k ` 1 . The pro of is given in Section F.3 . Lemma 13. Fix k “ 0 , . . . , N ´ 1 . L et x t k and x t k ` 1 b e as in Algorithm 1 . L et p y t q t Pr t k ,t k ` 1 s b e a CTMC with y t k “ x t k and the fol lowing r ate matrix: p Q t p a, b q : “ $ ’ & ’ % p s T ´ t k p y t k d i b i , y t k q e T ´ t k ´ 1 e T ´ t ´ 1 I t a i “ MASK u , if d H p a, b q “ 1 , a i ‰ b i , and y i t k “ MASK , ´ ř c ‰ a p Q t p a, c q , if a “ b, 0 , otherwise. (88) Then, x t k ` 1 has the same distribution as y t k ` 1 . Armed with this result, w e rewrite the right hand side of Eqn. ( 86 ) with marginals p t | t k p¨ | x t k q of this CTMC: KL p Ð q T } p T q À e ´ T d log S ` N ´ 1 ÿ k “ 0 E x t k „ Ð q t k ” KL ´ Ð q t k ` 1 | t k p¨| x t k q} p t k ` 1 | t k p¨| x t k q ¯ı (89) “ e ´ T d log S ` N ´ 1 ÿ k “ 0 E x t k „ Ð q t k „ ż t k ` 1 t k B B t KL ´ Ð q t | t k p¨ | x t k q} p t | t k p¨ | x t k q ¯ d t ȷ . (90) T o further control the second term ab o ve, we apply Lemma 5 with rate matrices specified in Lemma 13 . W e can write KL p Ð q T } p T q À e ´ T d log S ` N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t k ,x t „ Ð q t k ,t ÿ y ‰ x t « p Q t p x t , y q ´ Ð Q t p x t , y q ` Ð Q t p x t , y q log ˜ Ð Q t p x t , y q p Q t p x t , y q ¸ff d t. (91) Fix k P t 0 , . . . , N ´ 1 u and t P r t k , t k ` 1 q . Let ℓ : “ t k . Inv oking Eqn. ( 88 ) further leads to ÿ y ‰ x t « p Q t p x t , y q ´ Ð Q t p x t , y q ` Ð Q t p x t , y q log ˜ Ð Q t p x t , y q p Q t p x t , y q ¸ff “ ÿ i P m p x t q ÿ c Pr S s « p Q t p x t , x t d i c q ´ Ð Q t p x t , x t d i c q ` Ð Q t p x t , x t d i c q log ˜ Ð Q t p x t , x t d i c q p Q t p x t , x t d i c q ¸ff 40 “ ÿ i P m p x t q ÿ c Pr S s « e T ´ ℓ ´ 1 e T ´ t ´ 1 p s T ´ ℓ p x ℓ d i c, x ℓ q ´ s T ´ t p x t d i c, x t q ` s T ´ t p x t d i c, x t q log ˜ s T ´ t p x t d i c, x t q e T ´ ℓ ´ 1 e T ´ t ´ 1 p s T ´ ℓ p x ℓ d i c, x ℓ q ¸ ff “ ÿ i P m p x t q ÿ c Pr S s s T ´ t p x t d i c, x t q D ˆ e T ´ ℓ ´ 1 e T ´ t ´ 1 p s T ´ ℓ p x ℓ d i c, x ℓ q , s T ´ t p x t d i c, x t q ˙ . (92) T o proceed, w e mak e the observ ation that the Bregman div ergence satisfies the following law of cosines: D p α, γ q “ D p α, β q ` D p β , γ q ` p α ´ β q β ´ γ β γ . W e apply this decomp osition to eac h term of Eqn. ( 92 ) with α “ e T ´ ℓ ´ 1 e T ´ t ´ 1 p s T ´ ℓ p x ℓ d i c, x ℓ q , β “ e T ´ ℓ ´ 1 e T ´ t ´ 1 s T ´ ℓ p x ℓ d i c, x ℓ q , and γ “ s T ´ t p x t d i c, x t q . In the follo wing, we sligh tly abuse the notation and write x t : “ p x t d i c, x t q and x ℓ : “ p x ℓ d i c, x ℓ q whenev er i P m p x t q and c P r S s are fixed. F or fixed i, c , each term in Eqn. ( 92 ) can b e decomp osed as s T ´ t p x t q D ˆ e T ´ ℓ ´ 1 e T ´ t ´ 1 p s T ´ ℓ p x ℓ q , s T ´ t p x t q ˙ “ s T ´ t p x t q D p p s T ´ ℓ p x ℓ q , s T ´ ℓ p x ℓ qq ` s T ´ t p x t q D ˆ e T ´ ℓ ´ 1 e T ´ t ´ 1 s T ´ ℓ p x ℓ q , s T ´ t p x t q ˙ ` p s T ´ ℓ p x ℓ q ´ s T ´ ℓ p x ℓ q s T ´ ℓ p x ℓ q ˆ e T ´ ℓ ´ 1 e T ´ t ´ 1 s T ´ ℓ p x ℓ q ´ s T ´ t p x t q ˙ . Note that w e simplified the first term as D p αx, αy q “ D p x, y q . Observing that e T ´ ℓ ´ 1 e T ´ t ´ 1 s T ´ ℓ ” s T ´ t b y Eqn. ( 60 ), this can b e rearranged as follo ws: s T ´ t p x t q D ˆ e T ´ ℓ ´ 1 e T ´ t ´ 1 p s T ´ ℓ p x ℓ q , s T ´ t p x t q ˙ “ e T ´ ℓ ´ 1 e T ´ t ´ 1 s T ´ ℓ p x ℓ q D p p s T ´ ℓ p x ℓ q , s T ´ ℓ p x ℓ qq looooooooooooooooooooooooo omooooooooooooooooooooooooo on “ : T 1 ` ` s T ´ t p x ℓ q ´ s T ´ t p x t q ˘ ¨ log p s T ´ ℓ p x ℓ q s T ´ ℓ p x ℓ q loooooooooooooooooooooo omoooooooooooooooooooooo on “ : T 2 ` s T ´ t p x t q D p s T ´ t p x ℓ q , s T ´ t p x t qq lo oooooooooooooooooomo oooooooooooooooooon “ : T 3 . T aking the ab o ve collectiv ely with Eqns. ( 92 ) and ( 91 ) leads to KL p Ð q T } p T q À e ´ T d log S ` N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t k ,x t „ Ð q t k ,t ÿ i P m p x t q ÿ c Pr S s p T 1 ` T 2 ` T 3 q . No w, it suffices to con trol eac h term on the righ t, respectively . • After taking a summation o ver i P m p x t q and c P r S s we connect the first term, T 1 , to the score entrop y loss. T o see that, direct calculations show E x t ,x ℓ „ Ð q t,ℓ ÿ i P m p x t q ÿ c Pr S s e T ´ ℓ ´ 1 e T ´ t ´ 1 s T ´ ℓ p x ℓ d i c, x ℓ q D p p s T ´ ℓ p x ℓ d i c, x ℓ q , s T ´ ℓ p x ℓ d i c, x ℓ qq 41 “ E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s e t ´ ℓ s T ´ ℓ p x ℓ d i c, x ℓ q D p p s T ´ ℓ p x ℓ d i c, x ℓ q , s T ´ ℓ p x ℓ d i c, x ℓ qq “ e t ´ ℓ L SE p T ´ ℓ, p s T ´ ℓ , s T ´ ℓ q , where w e used in the second line that Pr p x i t “ MASK | x i ℓ “ MASK q “ 1 ´ e ´p T ´ t q 1 ´ e ´p T ´ ℓ q . Therefore, recalling that ℓ : “ t k , N ´ 1 ÿ k “ 0 ż t k ` 1 t k e t ´ t k L SE p T ´ t k , p s T ´ t k , s T ´ t k q d t “ N ´ 1 ÿ k “ 0 p e t k ` 1 ´ t k ´ 1 q L SE p T ´ t k , p s T ´ t k , s T ´ t k q À ε score , (93) where we used ∆ “ O p 1 q and Assumption 1 in the last inequality . • T o control the second term, T 2 , the next lemma describ es a martingale prop ert y of the score function. The pro of is given in Section E.7 . Lemma 14. Consider the masking noising pr o c ess and let 0 ď ℓ ă t ă T . Then, for any c P V and i P m p x ℓ q , E x t „ Ð q t | ℓ p¨| x ℓ q rp s T ´ t p x ℓ d i c, x ℓ q ´ s T ´ t p x t d i c, x t qq I t i P m p x t qus “ 0 . In view of Lemma 14 , we conclude that the second term, T 2 , contributes zero after conditioning on x t k : ÿ i Pr d s ÿ c Pr S s E x t „ Ð q t | t k p¨| x t k q I t i P m p x t qup s T ´ t p x t k d i c, x t k q ´ s T ´ t p x t d i c, x t qq “ 0 . • Lastly , we mo ve on to con trol the last term, T 3 . T o wards this goal, w e introduce the follo wing lemma, whose pro of is provided in Section F.4 . Lemma 15. L et 0 ď ℓ ă t ď T . Then, for I p t q define d in Eqn. ( 16 ) , E x ℓ ,x t „ Ð q ℓ,t ÿ i P m p x t q ÿ c Pr S s s T ´ t p x t d i c, x t q D p s T ´ t p x ℓ d i c, x ℓ q , s T ´ t p x t d i c, x t qq “ ż t ℓ e t ´ v I p T ´ v q d v . (94) After the summation ov er i P m p x t q and c P r S s , w e express the contributions of the term T 3 us- ing Lemma 15 as follows: N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t k ,x t „ Ð q t k ,t ÿ i P m p x t q ÿ c Pr S s s T ´ t p x t d i c, x t q D p s T ´ t p x t k d i c, x t k q , s T ´ t p x t d i c, x t qq d t “ N ´ 1 ÿ k “ 0 ż t k ` 1 t k ż t t k e t ´ v I p T ´ v q d v d t ď N ´ 1 ÿ k “ 0 h k ż T ´ t k T ´ t k ` 1 I p t q d t, (95) where we used ∆ “ O p 1 q and non-negativity of conditional m utual information in the last inequality . Collecting Eqns. ( 87 ), ( 93 ), and ( 95 ) prov es KL p q 0 } p T q À ε score ` e ´ T d log S ` N ´ 1 ÿ k “ 0 h k ż T ´ t k T ´ t k ` 1 I p t q d t. 42 D.2 Pro of of Corollary 2 W e upp er b ound the last term of Eqn. ( 17 ) under uniform and exp onen tial-then-constant step size sc hedules. First, under the constant step size schedule, the quantit y of in terest satisfies N ´ 1 ÿ k “ 0 h k ż T ´ t k T ´ t k ` 1 I p t q d t “ T N ż T 0 I p t q d t ď T N ż 8 0 I p t q d t “ T N B p q 0 q , where the last inequality follo ws from Lemma 16 . Therefore, as long as N ě T B p q 0 q ε “ r O ˆ B p q 0 q ε ˙ , Eqn. ( 17 ) leads to KL p q 0 } p T q À ε score ` ε . Next, under exp onential-then-constan t step size sc hedule, w e b ound the last term of Eqn. ( 17 ) as follows: N ´ 1 ÿ k “ 0 h k ż T ´ t k T ´ t k ` 1 I p t q d t “ ε d log p S q ż ε {p d log S q 0 I p t q d t ` N ´ 2 ÿ k “ 0 ż T ´ t k T ´ t k ` 1 p t k ` 1 ´ t k q I p t q d t ď ε 2 log p S q ` κ N ´ 2 ÿ k “ 0 ż T ´ t k T ´ t k ` 1 min p 1 , T ´ t k ` 1 q I p t q d t ď ε ` κ ż T 0 min p 1 , t q I p t q d t ď ε ` κ D p q 0 q . F or N ą 0 , suc h step size schedule is p ossible with κ “ O ´ T ` log p ε ´ 1 d log p S qq N ¯ . Thus, choosing N ě p T ` log p ε ´ 1 d log p S qqq D p q 0 q ε “ r O ˆ D p q 0 q ε ˙ giv es KL p q 0 } p T q À ε score ` ε . D.3 τ -leaping for masking discrete diffusion In this section, w e pro ve the analogue of Theorem 3 for the truncated τ -leaping algorithm. Note that since applying m ultiple jumps on a single coordinate is ill-defined in masking noising pro cess (where should w e transition if the τ -leaping algorithm requires tw o transitions MASK Ñ 1 at some coordinate?), w e analyse the truncated v ersion (Eqn. ( 9 )) instead of the classical τ -leaping algorithm. Theorem 5. L et q data “ q 0 b e the tar get distribution on r S s d . L et 0 ă δ ă T and 0 “ t 0 ă t 1 ă . . . ă t N “ T ´ δ , such that h k : “ t k ´ t k ´ 1 ď κ min p 1 , T ´ t k q for k P r N s , and κ “ O p 1 q . L et p 0 : “ ˜ ` 1 ´ e ´ T ˘ δ MASK ` e ´ T S S ÿ k “ 1 δ k ¸ b d . Under Assumption 1 , trunc ate d τ -le aping Eqn. ( 9 ) initialize d at p 0 pr o duc es a sample fr om p output : “ p T ´ δ , such that KL p q δ } p output q À ε score ` e ´ T d log p S q ` N ´ 1 ÿ k “ 0 h k ` 1 ż T ´ t k T ´ t k ` 1 I p t q d t ` κ 3 N d ` κC, (96) wher e C : “ N ´ 1 ÿ k “ 0 p t k ` 1 ´ t k q E x t k „ Ð q t k ÿ i P m p x t k q ÿ c Pr S s s T ´ t k p x t k d i c, x t k q log p s T ´ t k p x t k d i c, x t k q s T ´ t k p x t k d i c, x t k q . Conse quently, with exp onential-then-c onstant step size sche dule wher e κ “ O ´ T ` log p δ ´ 1 q N ¯ , for any ε ą 0 , for T “ O p log p ε ´ 1 d log S qq and N “ r O ˜ D p q data q ` C ε ` c d ε ¸ , (97) 43 it satisfies KL p q δ } p output q À ε score ` ε. Note that the guaran tee in Eqn. ( 96 ) closely parallels Eqn. ( 17 ) in Theorem 3 for Algorithm 1 . In particular, t wo additional terms arise in the analysis of the truncated τ -leaping algorithm. W e exp ect the constan t C to b e small and remark that it also arises in the analysis of Conforti et al. ( 2025 ), as C M 2 in Theorem 3.2.1, in the form of the maximum rather than the a verage. Under the assumption of (one-sided) b oundedness p s T ´ t k ě M ´ 1 , the constant C can be upp er b ounded via the Cauc hy-Sc hw arz inequalit y , as the next corollary shows. Corollary 3. Consider the setting of The or em 5 . If, additional ly, ther e exists M ą 0 such that for al l k P t 0 , . . . , N ´ 1 u , x P pr S s Y t MASK uq d , i P m p x q , and c P r S s it holds log p s T ´ t k p x d i c, x q ě ´ log M , it is sufficient for N “ r O ˜ D p q data q ` ? ε score d log M ε ` c d ε ¸ to ensur e KL p q δ } p T ´ δ q À ε score ` ε. Pr o of. F or fixed k P t 0 , . . . , N u and ℓ : “ t k , by the Cauch y-Sch warz inequality , it satisfies E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q log p s T ´ ℓ p x ℓ d i c, x ℓ q s T ´ ℓ p x ℓ d i c, x ℓ q ď ¨ ˝ E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q ˛ ‚ 1 { 2 ¨ ˝ E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q ˆ log p s T ´ ℓ p x ℓ d i c, x ℓ q s T ´ ℓ p x ℓ d i c, x ℓ q ˙ 2 ˛ ‚ 1 { 2 ď ? d ¨ ˝ E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q ˆ log p s T ´ ℓ p x ℓ d i c, x ℓ q s T ´ ℓ p x ℓ d i c, x ℓ q ˙ 2 ˛ ‚ 1 { 2 . Next, using z ´ 1 ´ log z Á p log z q 2 B for log z ě ´ B , together with log p s T ´ ℓ p x ℓ d i c, x ℓ q s T ´ ℓ p x ℓ d i c, x ℓ q ě ´ log M ` log ` e T ´ ℓ ´ 1 ˘ ě ´ log M ` log p T ´ ℓ q ě ´ log p M δ ´ 1 q , w e upper b ound E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q ˆ log p s T ´ ℓ p x ℓ d i c, x ℓ q s T ´ ℓ p x ℓ d i c, x ℓ q ˙ 2 ď log p M δ ´ 1 q ˆ E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q D p p s T ´ ℓ p x ℓ d i c, x ℓ q , s T ´ ℓ p x ℓ d i c, x ℓ qq . As a result, C can b e con trolled as C : “ N ´ 1 ÿ k “ 0 p t k ` 1 ´ t k q E x t k „ Ð q t k ÿ i P m p x t k q ÿ c Pr S s s T ´ t k p x t k d i c, x t k q log p s T ´ t k p x t k d i c, x t k q s T ´ t k p x t k d i c, x t k q ď a d log p M δ ´ 1 qˆ N ´ 1 ÿ k “ 0 p t k ` 1 ´ t k q ¨ ˝ E x t k „ Ð q t k ÿ i P m p x t k q ÿ c Pr S s s T ´ t k p x t k d i c, x t k q D p p s T ´ t k p x t k d i c, x t k q , s T ´ t k p x t k d i c, x t k qq ˛ ‚ 1 { 2 44 p a q ď a κN d log p M δ ´ 1 qˆ ¨ ˝ N ´ 1 ÿ k “ 0 p t k ` 1 ´ t k q E x t k „ Ð q t k ÿ i P m p x t k q ÿ c Pr S s s T ´ t k p x t k d i c, x t k q D p p s T ´ t k p x t k d i c, x t k q , s T ´ t k p x t k d i c, x t k qq ˛ ‚ 1 { 2 ď a κN d log p M δ ´ 1 q ε score , (98) where in p a q we used t k ` 1 ´ t k ď κ together with the Cauch y-Sch warz inequalit y . Combining the bound of Eqn. ( 98 ) with Eqn. ( 97 ) and κ “ r O p 1 { N q completes the proof. W e emphasize that, in con trast to Theorem 3 , in Theorem 5 we require early stopping for some δ ą 0 , whic h in turn leads to the exp onential-then-constan t step size schedule. W e now elaborate on the difference b et w een Theorem 5 and Theorem 3 , and provide some in tuition for the app earance of the t wo additional terms in Eqn. ( 96 ). R emark 3 . T o obtain an accurate sampler, it is natural to require that p Q t « Ð Q t uniformly for all t P r 0 , T s . The main challenge is that w e only hav e access to score estimates at discrete time points. In the truncated τ -leaping algorithm analyzed in this section, this results in appro ximating p s T ´ t : “ p s T ´ t k « s T ´ t . Informally , w e establish this by sho wing p s T ´ t k « s T ´ t k « s T ´ t , (99) where the first approximation is ensured by Assumption 1 and the second results from the prop erties of the score function for the masking noising pro cess, requiring the step size t k ` 1 ´ t k to b e small. In contrast, for Algorithm 1 considered in Theorem 3 , the condition p Q t « Ð Q t translates to p s T ´ t : “ e T ´ t k ´ 1 e T ´ t ´ 1 p s T ´ t k « s T ´ t . In view of Prop osition 6 , the abov e condition is equiv alent to p s T ´ t : “ e T ´ t k ´ 1 e T ´ t ´ 1 p s T ´ t k « e T ´ t k ´ 1 e T ´ t ´ 1 s T ´ t k “ s T ´ t , whic h is guaran teed b y Assumption 1 . Notably , this simple rescaling eliminates the need for the second appro ximation step that is required in the truncated τ -leaping analysis. This distinction explains wh y Theorem 5 contains tw o additional error terms and necessitates early stopping, in con trast to the cleaner guaran tee obtained in Theorem 3 . Pro of of Theorem 5 . The proof follows the pro of of Theorem 3 closely with several additional steps. W e b egin with Eqn. ( 91 ) KL p Ð q T } p T q À e ´ T d log S ` N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t k ,x t „ Ð q t k ,t ÿ y ‰ x t « p Q t p x t , y q ´ Ð Q t p x t , y q ` Ð Q t p x t , y q log ˜ Ð Q t p x t , y q p Q t p x t , y q ¸ff d t. (100) Next, since for this sampler, the rate matrices p Q t are given b y the following: p Q t p x, y q : “ $ ’ & ’ % p s T ´ t k p x t k d i y i , x t k q I t x i “ MASK u , if d H p x, y q “ 1 , x i ‰ y i , and x i t k “ MASK , ´ ř z ‰ x p Q t p x, z q , if y “ x, 0 , otherwise, 45 w e can therefore b ound ÿ y ‰ x t « p Q t p x t , y q ´ Ð Q t p x t , y q ` Ð Q t p x t , y q log ˜ Ð Q t p x t , y q p Q t p x t , y q ¸ff “ ÿ i P m p x t q ÿ c Pr S s « p Q t p x t , x t d i c q ´ Ð Q t p x t , x t d i c q ` Ð Q t p x t , x t d i c q log ˜ Ð Q t p x t , x t d i c q p Q t p x t , x t d i c q ¸ff “ ÿ i P m p x t q ÿ c Pr S s „ p s T ´ ℓ p x ℓ d i c, x ℓ q ´ s T ´ t p x t d i c, x t q ` s T ´ t p x t d i c, x t q log ˆ s T ´ t p x t d i c, x t q p s T ´ ℓ p x ℓ d i c, x ℓ q ˙ȷ “ ÿ i P m p x t q ÿ c Pr S s s T ´ t p x t d i c, x t q D p p s T ´ ℓ p x ℓ d i c, x ℓ q , s T ´ t p x t d i c, x t qq . T o proceed, w e again apply the la w of cosines D p α, γ q “ D p α , β q ` D p β , γ q ` p α ´ β q β ´ γ β γ with α “ p s T ´ ℓ p x ℓ d i c, x ℓ q , β “ s T ´ ℓ p x ℓ d i c, x ℓ q , and γ “ s T ´ t p x t d i c, x t q . In the follo wing, we sligh tly abuse the notation and write x t : “ p x t d i c, x t q and x ℓ : “ p x ℓ d i c, x ℓ q whenev er i P m p x t q and c P r S s are fixed. As a result, for fixed i, c , one has s T ´ t p x t q D p p s T ´ ℓ p x ℓ q , s T ´ t p x t qq “ s T ´ t p x t q D p p s T ´ ℓ p x ℓ q , s T ´ ℓ p x ℓ qq ` s T ´ t p x t q D p s T ´ ℓ p x ℓ q , s T ´ t p x t qq ` p s T ´ ℓ p x ℓ q ´ s T ´ ℓ p x ℓ q s T ´ ℓ p x ℓ q p s T ´ ℓ p x ℓ q ´ s T ´ t p x t qq . This can be rearranged as follows: s T ´ t p x t q D p p s T ´ ℓ p x ℓ q , s T ´ t p x t qq “ s T ´ ℓ p x ℓ q D p p s T ´ ℓ p x ℓ q , s T ´ ℓ p x ℓ qq lo oooooooooooooooooomo oooooooooooooooooon “ : T 1 ` s T ´ t p x t q D p s T ´ ℓ p x ℓ q , s T ´ t p x t qq lo ooooooooooooooooo omoooooooooooooooooo on “ : T 2 ` p s T ´ ℓ p x ℓ q ´ s T ´ t p x t qq log p s T ´ ℓ p x ℓ q s T ´ ℓ p x ℓ q loooooooooooooooooooooomooooooooooooooooooooo on “ : T 3 . It is therefore sufficien t to control each term separately . Similar to the pro of of Theorem 3 , the first term, T 1 , after the summation is upper b ounded b y the score entrop y loss E x t ,x ℓ „ Ð q t,ℓ ÿ i P m p x t q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q D p p s T ´ ℓ p x ℓ d i c, x ℓ q , s T ´ ℓ p x ℓ d i c, x ℓ qq ď E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q D p p s T ´ ℓ p x ℓ d i c, x ℓ q , s T ´ ℓ p x ℓ d i c, x ℓ qq “ L SE p T ´ ℓ, p s T ´ ℓ , s T ´ ℓ q , and, by Assumption 1 , N ´ 1 ÿ k “ 0 ż t k ` 1 t k L SE p T ´ ℓ, p s T ´ ℓ , s T ´ ℓ q d t ď ε score . (101) No w, w e turn to controlling the terms T 2 and T 3 whic h require a different treatment than in the pro of of Theorem 3 . 46 F or the term T 2 , we again apply the law of cosines with α “ s T ´ ℓ p x ℓ q , β “ s T ´ t p x ℓ q , and γ “ s T ´ t p x t q . This leads to the following decomp osition s T ´ t p x t q D p s T ´ ℓ p x ℓ q , s T ´ t p x t qq “ s T ´ t p x t q D p s T ´ ℓ p x ℓ q , s T ´ t p x ℓ qq looooooooooooooooooomooooooooooooooooooon “ : T 21 ` s T ´ t p x t q D p s T ´ t p x ℓ q , s T ´ t p x t qq loooooooooooooooooo omooooooooooooooooooon “ : T 22 ` p s T ´ t p x ℓ q ´ s T ´ t p x t qq s T ´ ℓ p x ℓ q ´ s T ´ t p x ℓ q s T ´ t p x ℓ q lo oooooooooooooooooooooooooo omo oooooooooooooooooooooooooo on “ : T 23 . • F or T 21 , using Eqn. ( 60 ), observe that D p s T ´ ℓ p x ℓ q , s T ´ t p x ℓ qq “ e T ´ t ´ 1 e T ´ ℓ ´ 1 ´ 1 ´ log e T ´ t ´ 1 e T ´ ℓ ´ 1 ď p e T ´ ℓ ´ e T ´ t q 2 2 p e T ´ ℓ ´ 1 qp e T ´ t ´ 1 q À κ 2 , where κ is a parameter of the step size sc hedule: t k ` 1 ´ t k ď κ min p 1 , T ´ t k ` 1 q . The total contribution of terms T 21 is: N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t ,x t k „ Ð q t,t k ÿ i P m p x t q ÿ c Pr S s s T ´ t p x t d i c, x t q D p s T ´ t k p x t k d i c, x t k q , s T ´ t p x t k d i c, x t k qq d t À κ 3 N ´ 1 ÿ k “ 0 E x t „ Ð q t ÿ i P m p x t q ÿ c Pr S s s T ´ t p x t d i c, x t q “ κ 3 N ´ 1 ÿ k “ 0 E x t „ Ð q t ÿ i P m p x t q ř c Pr S s q t p x t d i c q q t p x t q ď κ 3 N d, (102) as ř c Pr S s q t p x t d i c q “ q t p x t q . • The term T 22 is identical to the term T 2 from the proof of Theorem 3 , thus we use Lemma 15 and obtain that after summation of o ver i P m p x t q and c P r S s : N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t k ,x t „ Ð q t k ,t ÿ i P m p x t q ÿ c Pr S s s T ´ t p x t d i c, x t q D p s T ´ t p x t k d i c, x t k q , s T ´ t p x t d i c, x t qq d t. “ N ´ 1 ÿ k “ 0 ż t k ` 1 t k ż t t k e t ´ v I p T ´ v q d v d t À N ´ 1 ÿ k “ 0 h k ` 1 ż T ´ t k T ´ t k ` 1 I p t q d t. (103) Here, w e in vok e the assumption κ “ O p 1 q and the non-negativit y of conditional mutual information in the last inequalit y . • T o con trol T 23 , observe that by Eqn. ( 60 ), the score function satisfies the relation s T ´ ℓ p x ℓ q ´ s T ´ t p x ℓ q s T ´ t p x ℓ q “ e T ´ t ´ e T ´ ℓ e T ´ ℓ ´ 1 , and importantly it do es not depend on x ℓ . This implies that up on summation o ver i P m p x t q and c P r S s , the term T 23 con tributes zero, i.e., ÿ i P m p x t q ÿ c Pr S s p s T ´ t p x ℓ d i c, x ℓ q ´ s T ´ t p x t d i c, x t qq e T ´ t ´ e T ´ ℓ e T ´ ℓ ´ 1 47 “ e T ´ t ´ e T ´ ℓ e T ´ ℓ ´ 1 ÿ i P m p x t q ˜ ř c Pr S s q t p x ℓ d i c q q t p x ℓ q ´ ř c Pr S s q t p x t d i c q q t p x t q ¸ “ e T ´ t ´ e T ´ ℓ e T ´ ℓ ´ 1 ÿ i P m p x t q p 1 ´ 1 q “ 0 . Putting pieces together, we can conclude N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t k ,x t „ Ð q t k ,t ÿ i P m p x t q ÿ c Pr S s T 2 ď κ 3 N d ` N ´ 1 ÿ k “ 0 h k ` 1 ż T ´ t k T ´ t k ` 1 I p t q d t. (104) It therefore remains to control term T 3 . Recall the definition T 3 : “ p s T ´ ℓ p x ℓ q ´ s T ´ t p x t qq log p s T ´ ℓ p x ℓ q s T ´ ℓ p x ℓ q . Crucially , unlik e in the pro of of Theorem 3 , w e no longer ha ve a martingale prop ert y for this term. How ever, w e can decomp ose T 3 “ p s T ´ ℓ p x ℓ q ´ s T ´ t p x ℓ qq log p s T ´ ℓ p x ℓ q s T ´ ℓ p x ℓ q ` p s T ´ t p x ℓ q ´ s T ´ t p x t qq log p s T ´ ℓ p x ℓ q s T ´ ℓ p x ℓ q loooooooooooooooooooooomoooooooooooooooooooooon contributes zero by Lemma 14 . It remains to b ound the first term, which can b e written as p s T ´ ℓ p x ℓ q ´ s T ´ t p x ℓ qq log p s T ´ ℓ p x ℓ q s T ´ ℓ p x ℓ q “ e T ´ t ´ e T ´ ℓ e T ´ t ´ 1 s T ´ ℓ p x ℓ q log p s T ´ ℓ p x ℓ q s T ´ ℓ p x ℓ q . In view of the step size assumption t k ` 1 ´ t k ď κ min p 1 , T ´ t k ` 1 q , it satisfies e T ´ t ´ e T ´ ℓ e T ´ t ´ 1 À κ. The total con tribution of terms T 3 can therefore be upp er bounded b y the following: κ N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q log p s T ´ ℓ p x ℓ d i c, x ℓ q s T ´ ℓ p x ℓ d i c, x ℓ q d t “ κ N ´ 1 ÿ k “ 0 p t k ` 1 ´ t k q E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q log p s T ´ ℓ p x ℓ d i c, x ℓ q s T ´ ℓ p x ℓ d i c, x ℓ q . (105) Collecting Eqns. ( 100 ), ( 101 ), ( 102 ), ( 106 ), and ( 105 ) pro ves KL p q δ } p T ´ δ q À ε score ` e ´ T d log p S q ` N ´ 1 ÿ k “ 0 h k ` 1 ż T ´ t k T ´ t k ` 1 I p t q d t ` κ 3 N d ` κC, where C : “ N ´ 1 ÿ k “ 0 p t k ` 1 ´ t k q E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q log p s T ´ ℓ p x ℓ d i c, x ℓ q s T ´ ℓ p x ℓ d i c, x ℓ q . F or our step size sc hedule, as in Corollary 2 , we upp er bound N ´ 1 ÿ k “ 0 h k ` 1 ż T ´ t k T ´ t k ` 1 I p t q d t ď κ N ´ 1 ÿ k “ 0 ż T ´ t k T ´ t k ` 1 min p 1 , T ´ t k ` 1 q I p t q d t ď κ ż T δ min p 1 , t q I p t q d t ď κ D p q data q . (106) 48 Plugging in the choices κ “ O ´ T ` log p δ ´ 1 q N ¯ , T “ O p log p ε ´ 1 d log S qq , and N “ r O ˜ D p q data q ` C ε ` c d ε ¸ , yields KL p q δ } p T ´ δ q À ε score ` ε. E Pro ofs of the main lemmas E.1 Characterization of B p q data q and C p q data q The characterization of B p q data q and C p q data q is summarized in the follo wing lemma. Lemma 16. Consider a masking noising pr o c ess with initial distribution q 0 “ q data . L et C p q data q and B p q data q b e the total c orr elation and dual total c orr elation. Then, B p q data q “ ż 8 0 I p t q d t and C p q data q “ ż 8 0 p e t ´ 1 q I p t q d t. Conse quently, D p q data q ď min p B p q data q , C p q data qq . Pr o of. Let p ” p p t q “ e ´ t b e the probabilit y that at time t a co ordinate is unmask ed and X p p q ” p X 1 , . . . , X d q : “ p x 1 t , . . . , x d t q . W e also denote X R : “ x ´p i,j q t and X R : “ p X i q i P R for R Ď r d s . With a sligh t abuse of notation w e write I p p q : “ I p t p p qq . W e ha ve I p p q : “ ÿ i ‰ j I p X i ; X j | X R q “ ÿ i ‰ j p 2 ÿ R Ďr d szt i,j u p | R | p 1 ´ p q d ´ 2 ´ | R | I p X i ; X j | X R q , where p 2 app ears since for the term I p X i ; X j | X R q to b e non-zero, both X i and X j m ust b e unmasked. F or i P r d s , define h i p p q : “ ÿ R Ďr d szt i u p | R | p 1 ´ p q d ´ 1 ´ | R | H p X i | X R q , with d h i p p q d p “ ÿ R Ďr d szt i u ” | R | p | R | ´ 1 p 1 ´ p q d ´ 1 ´ | R | ´ p d ´ 1 ´ | R | q p | R | p 1 ´ p q d ´ 2 ´ | R | ı H p X i | X R q “ ÿ j ‰ i ÿ R Ďr d szt i,j u p | R | p 1 ´ p q d ´ 2 ´ | R | p H p X i | X R Yt j u q ´ H p X i | X R qq “ ´ ÿ j ‰ i ÿ R Ďr d szt i,j u p | R | p 1 ´ p q d ´ 2 ´ | R | I p X i ; X j | X R q . Therefore, I p p q “ d ÿ i “ 1 p 2 ˆ ´ d h i p p q d p ˙ . Since p “ e ´ t w e ha ve that d t “ ´ d p p and we can write ż 8 0 ÿ i ‰ j I p X i ; X j | X R q d t “ ż 1 0 d ÿ i “ 1 p ˆ ´ d h i p p q d p ˙ d p “ d ÿ i “ 1 ` ´ ph i p p q ˘ ˇ ˇ ˇ 1 0 ` ż 1 0 d ÿ i “ 1 h i p p q d p. Observ e that d H p X p p qq d p “ d ÿ i “ 1 h i p p q , 49 therefore, ż 1 0 d ÿ i “ 1 h i p p q d p “ H p X p 1 qq ´ H p X p 0 qq “ H p x 0 q . Since ř d i “ 1 h i p 1 q “ ř d i “ 1 H p x i 0 | x ´ i 0 q , we pro ved the first part: ż 8 0 I p t q d t “ H p x 0 q ´ d ÿ i “ 1 H p x i 0 | x ´ i 0 q “ B p q 0 q . W e proceed similarly for the total correlation: ż 8 0 p e t ´ 1 q ÿ i ‰ j I p t q d t “ ż 1 0 p 1 ´ p q d ÿ i “ 1 ˆ ´ d h i p p q d p ˙ d p “ ´ ˜ d ÿ i “ 1 p 1 ´ p q h i p p q ˇ ˇ ˇ 1 0 ` ż 1 0 d ÿ i “ 1 h i p p q d p ¸ “ d ÿ i “ 1 H p x i 0 q ´ H p x 0 q “ C p q 0 q . E.2 Pro of of Lemma 8 F or an y i P r d s and c P r S s , let us define f i,c p t, x t q : “ s T ´ t p x t ‘ i c, x t q . (107) The following analysis holds for all i P r d s and c P r S s , so we safely omit the index i, c in the following analysis, and write it as f p t, x t q . Consider the case that the bac kward pro cess t x t u t Pr 0 ,T s „ t Ð q t u t Pr 0 ,T s , whic h is a P oisson jump pro cess with generator Ð L t suc h that ´ Ð L t f ¯ p t, x q “ ÿ y :d H p y ,x qď 1 Q T ´ t p y , x q s T ´ t p y , x q ` f p t, y q ´ f p t, x q ˘ “ 1 S ÿ y :d H p y ,x q“ 1 s T ´ t p y , x q ` f p t, y q ´ f p t, x q ˘ . By Itô’s form ula for Poisson p oin t pro cess in Lemma 6 , f p t, x t q satisfies the following sto c hastic differential equation: for 0 ď ℓ ď t ă T , f p t, x t q “ f p ℓ, x ℓ q ` ż t ℓ ” B t f p s, x s ´ q ` ´ Ð L s f ¯ p s, x s ´ q ı d s ` M t , (108) where x s ´ “ lim u Ñ s ´ x s , which exists for almost everywhere s P r 0 , T q under the Leb esgue measure, since w e hav e finite num b er of jumps almost surely . The comp ensation pro cess t M u u u Pr ℓ,t s in Eqn. ( 108 ) is defined as M u “ ÿ y s :d H p y s ,x s q“ 1 ż u ℓ ` f p s, y s q ´ f p s, x s q ˘` d N x s ,y s s ´ λ x s ,y s s d s ˘ , (109) where N x,y s is the coun ting pro cess of jumps from x to y and we write the random coun ting measure as d N x,y s . W e define λ x,y s “ S ´ 1 s T ´ t p y , x q I t x s ´ “ x u to b e intensit y of the pro cess N x,y s . Since x s ´ “ x s almost ev erywhere s P p 0 , T q due to the finite num b er of jumps for eac h path almost surely , we can rewrite Eqn. ( 108 ) as f p t, x t q ´ f p ℓ, x ℓ q “ ż t ℓ ” B t f p s, x s q ` ´ Ð L s f ¯ p s, x s q ı d s ` M t . (110) 50 T o further simplify the righ t hand side, we assert that B t f p s, x s q ` ´ Ð L s f ¯ p s, x s q “ 0 . (111) In order to see this, first, recall the definition ( 107 ) and direct calculations giv e B t f p s, x s q ` ´ Ð L s f ¯ p s, x s q “ B B s ˆ q T ´ s p x s ‘ i c q q T ´ s p x s q ˙ ` 1 S ÿ i 1 Pr d s ÿ c 1 Pr S s s T ´ s p x s ‘ i 1 c 1 , x s q ´ s T ´ s p x s ‘ i 1 c 1 ‘ i c, x s ‘ i 1 c 1 q ´ s T ´ s p x s ‘ i c, x s q ¯ p a q “ 1 S ÿ i 1 Pr d s ÿ c 1 Pr S s s T ´ s p x s ‘ i c, x s q ´ s T ´ s p x s ‘ i 1 c 1 , x s q ´ s T ´ s p x s ‘ i c ‘ i 1 c 1 , x s ‘ i c q ¯ ` 1 S ÿ i 1 Pr d s ÿ c 1 Pr S s s T ´ s p x s ‘ i 1 c 1 , x s q ´ s T ´ s p x s ‘ i 1 c 1 ‘ i c, x s ‘ i 1 c 1 q ´ s T ´ s p x s ‘ i c, x s q ¯ “ 1 S ÿ i 1 Pr d s ÿ c 1 Pr S s ´ s T ´ s p x s ‘ i c, x s q s T ´ s p x s ‘ i 1 c 1 , x s q ´ s T ´ s p x s ‘ i 1 c 1 , x s q s T ´ s p x s ‘ i c, x s q ¯ ` 1 S ÿ i 1 Pr d s ÿ c 1 Pr S s ´ s T ´ s p x s ‘ i 1 c 1 , x s q s T ´ s p x s ‘ i 1 c 1 ‘ i c, x s ‘ i 1 c 1 q ´ s T ´ s p x s ‘ i c, x s q s T ´ s p x s ‘ i c ‘ i 1 c 1 , x s ‘ i c q ¯ p b q “ 1 S ÿ i 1 Pr d s ÿ c 1 Pr S s ´ s T ´ s p x s ‘ i 1 c 1 ‘ i c, x s q ´ s T ´ s p x s ‘ i c ‘ i 1 c 1 , x s q ¯ , where in equality (a), w e apply the Kolmogoro v forward equation on q T ´ t ; in equality (b), we use the fact that s T ´ t p x, y q s T ´ t p y , z q “ s T ´ t p x, z q for any x, y , z P X . It is direct to chec k that the ‘ op erators commute, i.e., for an y x s P X , x s ‘ i 1 c 1 ‘ i c “ x s ‘ i c ‘ i 1 c 1 . for i ‰ i 1 , and the relation holds trivially when i “ i 1 . This relation directly reveals that B t f p s, x s q ` ´ Ð L s f ¯ p s, x s q “ 1 S ÿ i 1 Pr d s ÿ c 1 Pr S s ´ s T ´ s p x s ‘ i 1 c 1 ‘ i c, x s q ´ s T ´ s p x s ‘ i c ‘ i 1 c 1 , x s q ¯ “ 0 , whic h completes the pro of of Eqn. ( 111 ). T aking u “ ℓ in Eqn. ( 109 ), we hav e M ℓ “ 0 almost surely , and M u is a lo cal martingale for u P r ℓ, t s by definition. Recalling Lemma 10 , we can b ound sup s Pr ℓ,t s sup x P X f p s, x q ď log p S q ` max ! log ` p T ´ t q ´ 1 ˘ , 0 ) ă 8 . Similarly , for the intensit y of the counting pro cess, it satisfies sup s Pr ℓ,t s sup x,y P X λ x,y s ď 1 S s T ´ t p y , x q ď 1 S ´ log p S q ` max ! log ` p T ´ t q ´ 1 ˘ , 0 )¯ ă 8 . No w it is direct to chec k that sup s Pr ℓ,t s E r| M s |s À p t ´ ℓ q d p S ´ 1 q ¨ sup s Pr ℓ,t s sup x,y P X r f p s, x q ¨ λ x,y s s ă 8 . 51 As a result, w e can conclude t M u u u Pr ℓ,t s is L 1 and hence a martingale. By the definition of the martingale, w e arriv e at E Ð q t | ℓ p¨| x ℓ q r M t s “ M ℓ “ 0 . Returning to Eqn. ( 110 ), we obtain E x t „ Ð q t | ℓ p¨| x ℓ q “ f p t, x t q ´ f p ℓ, x ℓ q ‰ “ E x t „ Ð q t | ℓ p¨| x ℓ q r M t s “ 0 . (112) Th us, w e conclude that E x t „ Ð q t | ℓ p¨| x ℓ q ” ` s T ´ ℓ p x ℓ ‘ i c, x ℓ q ´ s T ´ t p x t ‘ i c, x t q ˘ log p s T ´ ℓ p x ℓ ‘ i c, x ℓ q ı “ E x t „ Ð q t | ℓ p¨| x ℓ q “ f p ℓ, x ℓ q ´ f p t, x t q ‰ ¨ log p s T ´ ℓ p x ℓ ‘ i c, x ℓ q “ 0 , where we plug in Eqn. ( 112 ) in the last line. E.3 Pro of of Lemma 9 The pro of of Lemma 9 follows directly from exc hanging the order of summation. Sp ecifically , we can write E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 h p s T ´ t p y t , x t qq fi fl “ E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 s T ´ t p y t , x t q log p s T ´ t p y t , x t qq ´ s T ´ t p y t , x t q ` 1 fi fl “ E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 ˆ q T ´ t p y t q q T ´ t p x t q ˙ log p s T ´ t p y t , x t qq fi fl ´ E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 ˆ q T ´ t p y t q q T ´ t p x t q ˙ fi fl ` d p S ´ 1 q “ ÿ x t Pr S s d ÿ y t :d H p y t ,x t q“ 1 q T ´ t p y t q log p s T ´ t p y t , x t qq ´ ÿ x t Pr S s d ÿ y t :d H p y t ,x t q“ 1 q T ´ t p y t q ` d p S ´ 1 q p a q “ ´ ÿ x t Pr S s d ÿ y t :d H p y t ,x t q“ 1 q T ´ t p x t q log p s T ´ t p y t , x t qq ´ ÿ y t Pr S s d ÿ x t :d H p y t ,x t q“ 1 q T ´ t p y t q ` d p S ´ 1 q “ ´ E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 log p s T ´ t p y t , x t qq fi fl ´ d p S ´ 1 q ` d p S ´ 1 q “ E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 ´ log p s T ´ t p y t , x t qq fi fl , where in equalit y (a), w e switc h the p ositions of x t and y t in the summations. E.4 Pro of of Lemma 10 Lemma 10 is a direct consequence of Liang et al. ( 2025c , Lemma 2). Here, w e present a simplified proof based on Proposition 6 . It can b e easily chec ked that α t “ 1 ´ e ´ t 1 ` p S ´ 1 q e ´ t P p 0 , 1 q . By Eqn. ( 59 ), one has, for d H p x, y q “ 1 , s t p y , x q “ E x 0 „ q 0 α d H p y ,x 0 q t E x 0 „ q 0 α d H p x,x 0 q t ď α ´ sup y,x,x 0 | d H p y ,x 0 q´ d H p y ,x 0 q| t 52 “ exp ˆ ´ log p α t q ¨ sup y ,x,x 0 | d H p y , x 0 q ´ d H p y , x 0 q| ˙ ď exp ˆ ´ log p α t q ¨ sup y ,x d H p y , x q ˙ “ exp ` ´ log p α t q ˘ . With similar calculation, one can establish the rev ersed inequality s t p y , x q ě exp ˆ log p α t q ¨ sup y ,x d H p y , x q ˙ “ exp ` log p α t q ˘ . As a result, we conclude | log ` s t p y , x q ˘ | ď ´ log p α t q À log p S q ` max ␣ log p t ´ 1 q , 0 ( . E.5 Pro of of Lemma 11 Let us start by pro ving the first equation, i.e., φ i,c p t q “ KL p q t } p N i, ´ c q # q t q . Recall the definition of φ i,c p t q that φ i,c p t q “ E x t „ q t „ ´ log ˆ q t p N i,c p x t qq q t p x t q ˙ȷ “ ÿ x t P X q t p x q log ˆ q t p x q q t p N i,c p x t qq ˙ . (113) As in Eqn. ( 81 ), the pushforw ard measure satisfies p N i, ´ c q # q t p x q “ q t p N i,c p x qq , for an y x P X . As such, w e can express Eqn. ( 113 ) as φ i,c p t q “ ÿ x t P X q t p x q log ˆ q t p x q p N i, ´ c q # q t p x t q ˙ “ KL p q t } p N i, ´ c q # q t q , whic h pro ves the first equation. F or the second relation, the definition of KL divergence gives ´ B B t KL p q t } p 0 q “ ´ B B t ÿ x Pr S s d q t p x q log p q t p x qq “ ´ ÿ x Pr S s d d q t p x q d t ` log p q t p x qq ` 1 ˘ “ ´ ÿ x Pr S s d d q t p x q d t log p q t p x qq . (114) Using the K olmogorov forw ard equation for the forw ard noising pro cess, w e ha ve d q t p x q d t “ ÿ y P X Q p x, y q q t p x q “ 1 S ÿ y : p y,x q“ 1 q t p y q ´ d p S ´ 1 q S q t p x q . Plugging the equation ab o ve in to Eqn. ( 114 ), we arriv e at ´ B B t KL p q t } p 0 q “ ´ ÿ x Pr S s d ¨ ˝ ÿ y : p y,x q“ 1 ˆ 1 S q t p y q ˙ ´ d p S ´ 1 q S q t p x q ˛ ‚ log p q t p x qq “ ´ 1 S ÿ x Pr S s d ÿ y : p y,x q“ 1 p q t p y q ´ q t p x qq log p q t p x qq 53 “ ´ 1 S ÿ x Pr S s d ÿ y : p y,x q“ 1 q t p x q ` log p q t p y qq ´ log p q t p x qq ˘ “ φ p t q . In addition, recall φ p t q “ 1 { S ř i Pr d s ř c Pr S s φ i,c p t q . W e reach φ p t q “ 1 S ÿ i Pr d s ÿ c Pr S s KL p q t } p N i, ´ c q # q t q “ 1 S ÿ i Pr d s ÿ c Pr S s KL p q t } p N i,c q # q t q . E.6 Pro of of Lemma 12 Let L be the time-homogeneous infinitesimal generator of the forward pro cess. Since each co ordinate i P r d s is up dated indep enden tly in the forw ard pro cess, we can write L “ L i ` L ´ i , where L i only up dates co ordinate i , and L ´ i up dates all other coordinates. It direct to sho w that L i and L ´ i comm ute, therefore, we hav e for an y u ě 0 , q t ` u “ q t e uL i e uL ´ i , p N i, ´ c q # q t ` u “ pp N i, ´ c q # q t q e uL i e uL ´ i , where the second equation is due to the operator N i, ´ c comm utes with the semigroup t e uL u u ě 0 . With this form ulation, w e reac h φ i,c p t ` u q “ KL p q t ` u } p N i, ´ c q # q t ` u q ď KL p q t e uL i } pp N i, ´ c q # q t q e uL i q , (115) where in the last inequality , we apply the w eak data pro cessing inequality for KL divergence. Since b oth N i, ´ c and L i only op erate on the co ordinate i , we arriv e at the decomp osition KL p q t e uL i } pp N i, ´ c q # q t q e uL i q “ E x ´ i „p q t q ´ i “ KL ` q t p¨| x ´ i q e uL i } pp N i, ´ c q # q t p¨| x ´ i qq e uL i ˘‰ , (116) where p q t q ´ i is the marginal distribution of q t with co ordinate i excluded. Define K u to b e the transition k ernel on r S s ˆ r S s induced by e uL i . It is shown that K u p v 1 , v 2 q “ # 1 S ` p 1 ´ 1 S q e ´ u if v 1 “ v 2 ; 1 S p 1 ´ e ´ u q if v 1 ‰ v 2 . It can b e directly chec ked that K u is a S -ary symmetric c hannel with noise scale σ u “ p 1 ´ S ´ 1 qp 1 ´ e ´ u q . By Makur and Poly anskiy ( 2018 , Prop osition 12), a strong data pro cessing inequalit y holds for the c hannel K u , i.e., for any distribution p, q supported on r S s , KL p pe uL i } q e uL i q ď η KL p K u q KL p p } q q , where η KL p K u q satisfies η KL p K u q ď ˇ ˇ ˇ ˇ 1 ´ σ u ´ σ u S ´ 1 ˇ ˇ ˇ ˇ “ 1 ´ S S ´ 1 p 1 ´ S ´ 1 qp 1 ´ e ´ u q “ e ´ u . T aking this strong data proc essing inequalit y with Eqn. ( 116 ) yields KL p q t e uL i }pp N i, ´ c q # q t q e uL i q “ E x ´ i „p q t q ´ i “ KL p q t p¨| x ´ i q e uL i } pp N i, ´ c q # q t p¨| x ´ i qq e uL i q ‰ ď E x ´ i „p q t q ´ i “ e ´ u KL p q t p¨| x ´ i q } pp N i, ´ c q # q t p¨| x ´ i qq ‰ ď e ´ u E x ´ i „p q t q ´ i “ KL p q t p¨| x ´ i q } pp N i, ´ c q # q t p¨| x ´ i qq ‰ “ e ´ u KL p q t } p N i, ´ c q # q t q . Then, by Eqn. ( 115 ), w e hav e φ i,c p t ` u q ď KL p q t e uL i } pp N i, ´ c q # q t q e uL i q ď e ´ u KL p q t } p N i, ´ c q # q t q “ e ´ u φ i,c p t q , 54 whic h holds for any u ě 0 . Therefore, the deriv ative can b e b ound as φ 1 i,c p t q “ lim u Ñ 0 ` φ i,c p t ` u q ´ φ i,c p t q u ď lim u Ñ 0 ` e ´ u u φ i,c p t q “ ´ φ i,c p t q , whic h induces the target result ´ φ 1 i,c p t q ě φ i,c p t q . E.7 Pro of of Lemma 14 The pro of follows from ( Conforti et al. , 2025 , Lemma 5.2.2). W e add the argumen t b elo w for completeness. Let us define f p t, x t q : “ s T ´ t p x t d i c, x t q I t i P m p x t qu , where the dependence on i and c is omitted for simplicity . In view of Lemma 6 , for 0 ď ℓ ď t ă T , we can write f p t, x t q “ f p ℓ, x ℓ q ` ż t ℓ ” B t f p s, x s q ` p Ð L s f qp s, x s q ı d s ` M t , with generator t Ð L s u s Pr ℓ,t s as follows ´ Ð L s f ¯ p s, x q “ ÿ y ‰ x Q T ´ s p y , x q s T ´ s p y , x q ` f p s, y q ´ f p s, x q ˘ “ ÿ i 1 P m p x q ÿ c 1 Pr S s s T ´ s p x d i 1 c 1 , x q ` f p s, x d i 1 c 1 q ´ f p s, x q ˘ , and the compensation pro cess t M u u u Pr ℓ,t s defined as M u “ ż u ℓ ÿ i 1 P m p x s q ÿ c 1 Pr S s ` f p s, x s d i 1 c 1 q ´ f p s, x s q ˘` d N x s ,x s d i 1 c 1 s ´ λ x s ,x s d i 1 c 1 s d s ˘ . With similar argumen t as in the proof of Lemma 8 , one has E Ð q t | ℓ p¨| x ℓ q r M t s “ 0 , whic h leads to E x t „ Ð q t | ℓ p¨| x ℓ q r f p t, x t q ´ f p ℓ, x ℓ qs “ E x t „ Ð q t | ℓ p¨| x ℓ q „ ż t ℓ ` B t f p s, x s q ` p Ð L s f qp s, x s q ˘ d s ȷ . T aking deriv ative with resp ect to t on b oth side, we arrive at d d t E x t „ Ð q t | ℓ p¨| x ℓ q r f p t, x t qs “ E x t „ Ð q t | ℓ p¨| x ℓ q ” B t f p t, x t q ` p Ð L t f qp t, x t q ı . No w let us consider eac h term on the righ t hand side ab ov e separately . By Prop osition 6 , it obeys that s T ´ t p x t d i c, x t q “ 1 e T ´ t ´ 1 q 0 p x t d i c q q 0 p x t q , and we ha ve B B t f p t, x t q “ e T ´ t e T ´ t ´ 1 f p t, x t q . Next, direct calculations yield p Ð L t f qp t, x t q “ ÿ i 1 P m p x t q ÿ c 1 Pr S s s T ´ t p x t d i 1 c 1 , x t q ´ s T ´ t p x t d i c d i 1 c 1 , x t d i 1 c 1 q I t i P m p x t d i 1 c 1 qu 55 ´ s T ´ t p x t d i c, x t q I t i P m p x t qu ¯ “ 1 e T ´ t ´ 1 f p t, x t q ¨ ˝ ÿ i 1 P m p x t qzt i u ÿ c 1 Pr S s q 0 p x t d i c d i 1 c 1 q q 0 p x t d i c q ´ ÿ i 1 P m p x t q ÿ c 1 Pr S s q 0 p x t d i 1 c 1 q q 0 p x t q ˛ ‚ “ 1 e T ´ t ´ 1 f p t, x t qp| m p x t qzt i u| ´ | m p x t q|q “ ´ 1 e T ´ t ´ 1 f p t, x t q . Putting everything together leads to d d t E x t „ Ð q t | ℓ p¨| x ℓ q r f p t, x t qs “ E x t „ Ð q t | ℓ p¨| x ℓ q r f p t, x t qs , and therefore, E x t „ Ð q t | ℓ p¨| x ℓ q r f p t, x t qs “ e t ´ ℓ ¨ f p ℓ, x ℓ q . Finally , in view of the relation Pr p x i t “ MASK | x i ℓ “ MASK q “ 1 ´ e T ´ t 1 ´ e T ´ ℓ , we conclude the follo wing E x t „ Ð q t | ℓ p¨| x ℓ q “ s T ´ t p x t d i c, x t q I t i P m p x t qu ‰ “ e t ´ ℓ ¨ s T ´ ℓ p x ℓ d i c, x ℓ q I t i P m p x ℓ qu “ E x t „ Ð q t | ℓ p¨| x ℓ q “ s T ´ t p x ℓ d i c, x ℓ q I t i P m p x t qu ‰ , whic h completes the pro of of the desired result. F Pro ofs of the auxiliary lemmas F.1 Pro of of Lemma 2 F or a con tinuous random v ariable U in R d with density function p U with resp ect to Leb esgue measure, define the differential en tropy of U as H diff p U q “ ´ ż R d p U log p p U q d x, (117) where we adopt the con ven tion 0 log p 0 q “ 0 again. By definition of m utual information, we ha ve I p W ; W ` ε noise q “ H diff p W ` ε noise q ´ H diff p W ` ε noise | W q p a q “ H diff p W ` ε noise q ´ E w “ H diff p w ` ε noise | W “ w q ‰ p b q “ H diff p W ` ε noise q ´ E w “ H diff p ε noise | W “ w q ‰ p c q “ H diff p W ` ε noise q ´ H diff p ε noise q , (118) where in (a), w e use the c hain rule of differen tial entrop y; in (b), w e apply the translation in v ariance prop ert y , i.e., H diff p U q “ H diff p c 0 ` U q for any constan t c 0 ; in (c), we use the condition that ε noise K K W . Denote the Gaussian densit y function with mean 0 and v ariance σ 2 I d as ϕ σ p¨q . Since ε noise „ N p 0 , σ 2 I d q , w e can compute with Eqn. ( 117 ) that H diff p ε noise q “ ´ ż R d ϕ σ p x q log ` ϕ σ p x q ˘ d x “ ´ ż R d ϕ σ p x q ˆ ´ d 2 log p 2 π σ 2 q ´ } x } 2 2 2 σ 2 ˙ d x “ d 2 log p 2 π σ 2 q ` E r} ε noise } 2 2 s 2 σ 2 “ d 2 log p 2 π eσ 2 q , (119) 56 where } ¨ } 2 is the Euclidean norm in R d . F or H diff p W ` ε noise q , notice that V ar “ W ` ε noise ‰ “ V ar r W s ` V ar r ε noise s ` 2 Cov “ W , ε noise ‰ “ V ar r W s ` σ 2 I d . By Cov er ( 1999 , P age 255), for distributions with the same finite v ariance, H diff is maximized at the centered Gaussian random v ariable. The refore, w e ha ve H diff p W ` ε noise q ď H diff ´ N ` 0 , V ar r W s ` σ 2 I d ˘ ¯ “ d 2 log p 2 π e q ` 1 2 log ` det ` V ar r W s ` σ 2 I d ˘˘ , where det p¨q is the determinant of matrices, and the calculation is the same as in Eqn. ( 119 ). Since V ar r W s is a positive semidefinite matrix, w e can apply the matrix inequality that log ` det ` V ar r W s ` σ 2 I d ˘˘ “ d log p σ 2 q ` log ` det ` I d ` V ar r W { σ 2 s ˘˘ ď d log p σ 2 q ` T r ` V ar r W { σ 2 s ˘ “ d log p σ 2 q ` T r p V ar r W sq σ 2 , whic h leads to H diff p W ` ε noise q ď d 2 log p 2 π eσ 2 q ` T r p V ar r W sq 2 σ 2 . (120) Plugging Eqns. ( 119 ) and ( 120 ) in to Eqn. ( 118 ), we conclude that I p W ; W ` ε noise q ď d 2 log p 2 π eσ 2 q ` T r p V ar r W sq 2 σ 2 ´ d 2 log p 2 π eσ 2 q “ T r p V ar r W sq 2 σ 2 . F.2 Pro of of Lemma 3 F or X „ Bin p n, 1 { 2 q , its pmf is giv en by P p X “ x q “ ˆ n x ˙ ˆ 1 2 ˙ n 9 ˆ n x ˙ . Notice that our desired b ound is equiv alent to the following equation: ÿ x : x mo d 2 “ 0 ˆ n x ˙ “ ÿ x : x mo d 2 “ 1 ˆ n x ˙ , whic h follo ws from the binomial theorem for 0 “ p 1 ´ 1 q n . F.3 Pro of of Lemma 13 The CTMC ( 88 ) in the lemma statement can b e decomp osed into d indep enden t CTMCs for each co ordinate. F or co ordinates i such that x i t k ‰ MASK clearly neither Eqn. ( 88 ) nor Algorithm 1 makes a c hange. Next, w e fix i P m p x t k q . First, w e compute the probability that the i -th co ordinate remains masked for y t k ` 1 : Pr p y i t k ` 1 “ MASK | y t k q “ exp ¨ ˝ ż t k ` 1 t k ¨ ˝ ´ ÿ c Pr S s p s T ´ t k p y t k d i c, y t k q e T ´ t k ´ 1 e T ´ t ´ 1 ˛ ‚ d t ˛ ‚ “ exp p p Q i k p MASK q ∆ k q “ P k , where p Q i k p MASK q , ∆ k , and P k are defined in Algorithm 1 . Next, for c P r S s w e can write Pr p y i t k ` 1 “ c | x t k q “ Pr p x i t k ` 1 “ c | x t k and x i t k ` 1 ‰ MASK qp 1 ´ P k q . 57 Since for an y t P r t k , t k ` 1 q the rates p Q t p x, x d i c q are proportional to p Q i k p c q , we get that Pr p y i t k ` 1 “ c | x t k and y i t k ` 1 ‰ MASK q “ p Q i k p c q ř b Pr S s p Q i k p b q , whic h matc hes the expression in Algorithm 1 . Therefore, the distribution of y t k ` 1 defined by the CTMC matc hes the distribution of x t k ` 1 from the algorithm. F.4 Pro of of Lemma 15 In view of the definition of D p¨ , ¨q , one can write s T ´ t p x t d i c, x t q D p s T ´ t p x ℓ d i c, x ℓ q , s T ´ t p x t d i c, x t qq “ s T ´ t p x ℓ d i c, x ℓ q ´ s T ´ t p x t d i c, x t q ` s T ´ t p x t d i c, x t q log s T ´ t p x t d i c, x t q s T ´ t p x ℓ d i c, x ℓ q . The first t wo terms cancel out in exp ectation by Lemma 14 ; i.e., for any c P r S s , one has E x t „ Ð q t | ℓ p¨| x ℓ q ” ÿ i P m p x t q p s T ´ t p x ℓ d i c, x ℓ q ´ s T ´ t p x t d i c, x t qq ı “ 0 . Next, using Eqn. ( 60 ), we obtain s T ´ t p x t d i c, x t q s T ´ t p x ℓ d i c, x ℓ q “ q 0 p x t d i c q q 0 p x ℓ q q 0 p x t q q 0 p x ℓ d i c q . Using this relation, we further b ound E x ℓ ,x t „ Ð q ℓ,t ÿ i P m p x t q ÿ c Pr S s s T ´ t p x t d i c, x t q log q 0 p x t d i c q q 0 p x ℓ q q 0 p x t q q 0 p x ℓ d i c q “ E y ℓ ,y t „ Ð q ℓ,t ÿ i R m p y t q log q 0 p y t q q 0 p y ℓ d i MASK q q 0 p y t d i MASK q q 0 p y ℓ d i y i t q “ ÿ i Pr d s E y ℓ ,y t „ Ð q ℓ,t log q 0 p y t q q 0 p y ℓ d i MASK q q 0 p y t d i MASK q q 0 p y ℓ d i y i t q , (121) where in the second line, we used the definition of the score function along with the natural bijection b et w een the sets tp x, i, c q , for x P X , i P m p x q , and c P r S su and tp y , i q , for y P X and i R m p y qu to change the measure under the exp ectation: x t Ñ y t d i MASK x ℓ Ñ y ℓ d i MASK x t d i c Ñ y t x ℓ d i c Ñ y ℓ d i y i t . Note that since y ℓ app ears earlier in the backw ard pro cess, y i ℓ can b e masked or unmask ed. Since the i -th elemen t of x ℓ d i c is unmasked by construction, we explicitly set the i -th element of y ℓ to y i t . The third line follo ws from the fact that, for i P m p y t q , the term is equal to zero. Next, we define, for fixed t, y t , and i P r d s , f i p y q : “ log q 0 p y d i y i t q q 0 p y d i MASK q . T o further control the right hand side of Eqn. ( 121 ), w e inv oke Dynkin’s formula as described in Lemma 6 to obtain ÿ i Pr d s E y ℓ ,y t „ Ð q ℓ,t log q 0 p y t q q 0 p y ℓ d i MASK q q 0 p y t d i MASK q q 0 p y ℓ d i y i t q 58 “ ÿ i Pr d s E y ℓ ,y t „ Ð q ℓ,t r f i p y t q ´ f i p y ℓ qs “ ÿ i Pr d s ż t ℓ E y v ,y t „ Ð q v,t ÿ j R m p y v qYt i u log q 0 p y v d i y i t q q 0 p y v d i MASK d j MASK q q 0 p y v d i MASK q q 0 p y v d i y i t d j MASK q d v p i q “ ÿ i ‰ j Pr d s ż t ℓ E y v ,y t „ Ð q v,t log q 0 p y v d i y i t q q 0 p y v d i MASK d j MASK q q 0 p y v d i MASK q q 0 p y v d i y i t d j MASK q d v p ii q “ ÿ i ‰ j Pr d s ż t ℓ e t ´ v E y v „ Ð q v log q 0 p y v q q 0 p y v d i MASK d j MASK q q 0 p y v d i MASK q q 0 p y v d j MASK q d v . (122) Here for part p i q , as before, we extended the sum to j P m p y v qzt i u since additional terms equal zero. As f i p y q only dep ends on d ´ 1 co ordinates of y (all except the i -th), the constraint j ‰ i appears. Regarding part p ii q , it follows from Pr p y i v ‰ MASK | y i t ‰ MASK q “ e v ´ t . Next, let y ´p i,j q v denote all unmask ed elements of y v , except i -th and j -th. W e can write q 0 p y v q q 0 p y v d i MASK d j MASK q q 0 p y v d i MASK q q 0 p y v d j MASK q “ q 0 p y i v , y j v | y ´p i,j q v q q 0 p y i v | y ´p i,j q v q q 0 p y j v | y ´p i,j q v q , and thus, ÿ i ‰ j Pr d s ż t ℓ e t ´ v E y v „ Ð q v log q 0 p y v q q 0 p y v d i MASK d j MASK q q 0 p y v d i MASK q q 0 p y v d j MASK q d v “ ÿ i ‰ j ż t ℓ e t ´ v I p y i v ; y j v | y ´p i,j q v q d v “ ż t ℓ e t ´ v I p T ´ v q d v , (123) as y v „ q T ´ v . Combining Eqns. ( 121 ), ( 122 ), and ( 123 ) concludes the proof. 59
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment