Efficient Sampling with Discrete Diffusion Models: Sharp and Adaptive Guarantees

Eﬃcien t Sampling with Discrete Diﬀusion Mo dels: Sharp and A daptiv e Guaran tees Daniil Dmitriev ∗ † , Zhihan Huang ˚: , Y uting W ei : F ebruary 17, 2026 Abstract Diﬀusion models o v er discrete spaces hav e recently sho wn striking empirical success, yet their the- oretical foundations remain incomplete. In this pap er, w e study the sampling eﬃciency of score-based discrete diﬀusion models under a con tinuous-time Mark ov c hain (CTMC) form ulation, with a focus on τ -leaping-based samplers. W e establish sharp conv ergence guaran tees for attaining ε accuracy in Kullbac k-Leibler (KL) div ergence for b oth uniform and masking noising processes. F or uniform discrete diﬀusion, we show that the τ -leaping algorithm achiev es an iteration complexit y of order r O p d { ε q , with d the ambien t dimension of the target distribution, eliminating linear dependence on the vocabulary size S and impro ving existing bounds b y a factor of d ; moreo ver, we establish a matching algorithmic lo wer bound sho wing that linear dependence on the am bient dimension is unav oidable in general. F or masking discrete diﬀusion, we in tro duce a modiﬁed τ -leaping sampler whose conv ergence rate is gov erned b y an intrinsic information-theoretic quan tity , termed the eﬀe ctive total c orr elation , which is b ounded by d log S but can b e sublinear or even constant for structured data. As a consequence, the sampler pro v- ably adapts to low-dimensional structure without prior knowledge or algorithmic modiﬁcation, yielding sublinear conv ergence rates for v arious practical examples (such as hidden Marko v mo dels, image data, and random graphs). Our analysis requires no b oundedness or smo othness assumptions on the score estimator beyond control of the score entrop y loss. Con ten ts 1 In tro duction 2 1.1 Sampling eﬃciency and adaptivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Preliminaries of discrete diﬀusion mo dels 5 2.1 A contin uous-time Marko v chain formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Score estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Score-based sampling algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Main results 8 3.1 Uniform discrete diﬀusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Masking discrete diﬀusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4 Discussion 16 A Examples of lo w in trinsic dimensions 19 A.1 Details and formal results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A.2 Pro ofs of results in Section A.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 ∗ Equal contribution, alphab etical order. † Department of Statistics and Data Science, the Wharton Sc ho ol, Universit y of Pennsylv ania; email: {daniild,zhihanh,ytwei}@wharton.upenn.edu 1 B T ec hnical preparations 30 B.1 Score functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 B.2 T echnical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 C Proofs of results in Section 3.1 33 C.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 C.2 Proof of Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 C.3 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 C.4 Eﬃcien t sampling for high-entrop y distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 38 D Pro ofs of results in Section 3.2 40 D.1 Pro of of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 D.2 Pro of of Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 D.3 τ -leaping for masking discrete diﬀusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 E Pro ofs of the main lemmas 49 E.1 Characterization of B p q data q and C p q data q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 E.2 Pro of of Lemma 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 E.3 Pro of of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 E.4 Pro of of Lemma 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 E.5 Pro of of Lemma 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 E.6 Pro of of Lemma 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 E.7 Pro of of Lemma 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 F Pro ofs of the auxiliary lemmas 56 F.1 Pro of of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 F.2 Pro of of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 F.3 Pro of of Lemma 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 F.4 Pro of of Lemma 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 1 In tro duction Diﬀusion mo dels ha v e recen tly emerged as state-of-the-art approac hes for high-ﬁdelity image generation and video syn thesis ( Dhariw al and Nic hol ( 2021 ); Ho et al. ( 2020 , 2022 ); Song and Ermon ( 2019 )), and ha ve already led to signiﬁcan t scientiﬁc adv ances in v arious domains, including climate modeling, protein structure prediction, and materials science ( Li et al. ( 2024 ); W atson et al. ( 2023 ); Zeni et al. ( 2025 )). A t their core, diﬀusion models are built upon t wo stochastic pro cesses: a forward pro cess that gradually corrupts the data distribution into pure noise, and a rev erse pro cess that generates samples b y learning the logarithmic gradien t of the p erturb ed marginals, commonly referred to as the score function. Despite their broad empirical success, diﬀusion mo dels ha ve been predominan tly developed for con tinuous data. Their extension to discrete domains, such as natural language, graph-structured data, and categorical lab els, has long remained c hallenging, although already discussed in Sohl-Dic kstein et al. ( 2015 ). This p erspective began to shift follo wing the seminal work of Austin et al. ( 2021 ), which rev ealed the promise of diﬀusion-based approac hes in discrete settings. Analogous to the con tinuous case, discrete diﬀusion mo dels rely on a pair of noisy forward and rev erse pro cesses, with sampling achiev ed by learning appropriate ratios of distributions. Among recent developmen ts ( Bach and Saremi ( 2025 ); Campb ell et al. ( 2022 ); Ou et al. ( 2025 ); Saho o et al. ( 2024 )), score-entrop y discrete diﬀusion (SEDD) has demonstrated striking performance in text generation ( Lou et al. ( 2024 )), c hallenging the long-standing dominance of autoregressiv e language mo dels. In contrast to autoregressiv e approac hes, diﬀusion-based language mo dels are not constrained to a ﬁxed generation order (such as left-to-righ t), and they naturally lend themselves to more ﬂexible forms of con trolled generation, including conditional and structured text synthesis. The promise of discrete diﬀusion mo dels has spurred growing interest in their theoretical foundations. A particularly inﬂuential line of w ork form ulates discrete diﬀusion through the lens of contin uous-time Marko v c hains (CTMCs) ( Campbell et al. , 2022 ), in whic h the forward dynamics is gov erned b y a carefully designed 2 rate matrix, and the backw ard dynamics is appro ximated via a learned score function. Among the prop osed constructions, t wo c hoices hav e emerged as esp ecially prominen t: the uniform rate matrix, which induces a uniform stationary distribution for the forw ard pro cess, and the absorbing rate matrix, whic h yields a degenerate stationary distribution with an absorbing state. In practice, the p erformance of the resulting samplers dep ends sensitiv ely on the choice of the rate matrix ( Lou et al. ( 2024 ); v on Rütte et al. ( 2025 )). Corresp ondingly , tw o parallel lines of work ha ve sought to understand the sampling eﬃciency of discrete diﬀusion mo dels — sp eciﬁcally , the num b er of steps required to pro duce suﬃciently accurate samples — under these respective constructions. Representativ e results include Chen and Ying ( 2025 ); Liang et al. ( 2025c ); Pham et al. ( 2025 ); Ren et al. ( 2025 ); Zhang et al. ( 2025 ) for uniform diﬀusion and Chen et al. ( 2025 ); Conforti et al. ( 2025 ); Li and Cai ( 2025 ); Liang et al. ( 2025b ); Park et al. ( 2025 ) for masking diﬀusion (also referred to as absorbing diﬀusion). Existing theoretical analyses for score-based discrete diﬀusions suggest that conv ergence rates typically scale at least linearly with b oth the vocabulary size S and the am bient dimension d . Suc h scaling can quickly b ecome prohibitiv e in applications; for instance, in GPT-2-based tasks, the vocabulary size is S “ 50 , 257 and the dimension is d “ 10 2 „ 10 3 ( Lou et al. , 2024 ). These considerations naturally motiv ate a fundamen tal question: How eﬃcient ar e discr ete diﬀusion mo dels? When is subline ar c onver genc e p ossible? 1.1 Sampling eﬃciency and adaptivit y T o put our discussion in con text, there has b een substan tial progress in understanding the sample eﬃciency of contin uous diﬀusion mo dels. Seminal w ork by Chen et al. ( 2023b ) characterizes the iteration complexit y of the DDPM sampler under Lipsc hitz (or smo othness) assumptions on the score functions across all steps. Subsequen t studies signiﬁcan tly relax these conditions and establish con vergence guarantees for broader classes of contin uous distributions ( Ben ton et al. , 2024 ; Chen et al. , 2023a ; Li et al. , 2023 ). Nevertheless, it is now well understo od that for general distributions, a linear dep endence on the ambien t dimension d is una voidable. By contrast, when the target distribution exhibits additional structure — such as Gaussian mixture models or supp ort on low-dimensional manifolds — a growing b ody of work shows that popular samplers can adaptively exploit intrinsic low-dimensional geometry , achieving improv ed eﬃciency without explicit dimension reduction (see, e.g., Huang et al. ( 2024 ); Li et al. ( 2025 ); Li and Y an ( 2024 ); Liang et al. ( 2025a )). The landscap e shifts considerably as we mo ve to discrete diﬀusion mo dels. Under the CTMC formulation, algorithms suc h as Gillespie’s metho d and uniformization allow for exact simulation of the reverse process, free of discretization error ( Chen and Ying , 2025 ; Gillespie , 1976 ; V an Dijk , 1992 ). How ever, these methods suﬀer from high computational costs in high-dimensional settings. Moreo ver, their conv ergence guarantees are inherently sto c hastic, as they dep end on a random num b er of transitions. An alternative and widely adopted approach, particularly in diﬀusion-based language mo dels, is pro vided by τ -leaping and its v ariants, including truncated τ -leaping ( Campbell et al. , 2022 ; Gillespie , 2001 ). Originally dev elop ed in chemical kinetics, τ -leaping replaces sequen tial state transitions with parallel up dates across coordinates, oﬀering substan tial computational gains in large systems. Y et, our theoretical understanding of τ -leaping remains incomplete. Current state-of-the-art results exhibit at least linear dep endence on the vocabulary size S , linear dep endence on d for the absorbing case, and quadratic dependence on d for the uniform case ( Conforti et al. ( 2025 ); Liang et al. ( 2025b , c )); see T able 1 for more details. It remains an op en question whether these dep endencies are fundamental information-theoretic barriers or merely analytical artifacts. F urthermore, as in the con tinuous setting, an ideal sampling algorithm should automatically adjust to the in trinsic diﬃculty of the target distribution. F or example, one w ould exp ect substantially faster conv ergence for Dirac delta measures or uniform target distributions, without prior kno wledge of the structure or mo diﬁcations to the algorithm. Existing analyses of τ -leaping do not illuminate whether suc h adaptivit y is possible. More sp eciﬁcally , we aim to address the following question: Can sc or e-b ase d samplers automatic al ly adapt to structur e d tar get distributions? 3 1.2 Our contributions The contributions of this w ork are cen tered on establishing sharp con vergence guaran tees for discrete dif- fusion mo dels, bridging the gap b et w een empirical success and theoretical understanding. Sp eciﬁcally , our con tributions are mainly threefold: • Optimal rates for uniform diﬀusion : W e establish that for the uniform diﬀusion process, the τ -leaping sampler requires only r O p d { ε q discretization steps to ac hieve an ε -error in KL divergence. This result signiﬁcantly sharp ens the previously b est-kno wn b ound of r O p d 2 S { ε q ( Liang et al. , 2025c ), eﬀectiv ely remo ving a factor of d and the dep endence on the v o cabulary size S . • F undamen tal low er b ounds : W e demonstrate that the linear dep endence on the dimension d is essen tially unimprov able for the τ -leaping algorithm. Sp eciﬁcally , w e show that under uniform dif- fusion, an o p d q complexit y b ound is unattainable unless the target distribution is already pro ximal to the uniform measure. This result c haracterizes a fundamen tal price of sampling for informativ e distributions. • A daptivity for masking diﬀusion : F or the masking diﬀusion pro cess, w e introduce a reﬁned τ - leaping sampler that has a complexity go verned b y r O p D { ε q , where D is the eﬀective total correlation, an information-theoretic measure of the target distribution’s intrinsic complexit y . Notably , while D is alw ays b ounded b y the classical total correlation and the dual total correlation (and th us by d log S ), it can be sublinear or ev en O p 1 q for highly structured data, allo wing our sampler to adapt automatically to low-dimensional target distributions. In con trast to prior work, our upp er bounds do not require the b oundedness of the score estimator or an y auxiliary regularity assumptions b ey ond con trol of the score en tropy loss. The key technical ingredients include a Girsanov c hange-of-measure argumen t, combined with establishing the martingale prop erties of the sampling dynamics. The latter eﬀectiv ely separates the approximation error from the discretization error, allo wing each to b e analyzed indep enden tly . F or the lo wer bound, w e leverage a log-Sob olev inequalit y to- gether with a strong data-processing inequality along the uniform noising process. T o demonstrate the scop e of our adaptivity results for masking discrete diﬀusion, we present several examples whose analysis requires careful constructions and delicate con trol of information-theoretic quantities, which may b e of indep enden t in terest. 1.3 Notation F or a p ositiv e integer n , w e deﬁne r n s : “ t 1 , . . . , n u and let I n P R n ˆ n denote the iden tity matrix. Let d ą 0 denote the num b er of dimensions, S ą 0 denote the vocabulary size and T ą 0 denote the time horizon. Let MASK denote a special v alue outside of r S s . Let X : “ V d denote the domain, where, dep ending on the con text, V : “ r S s or V : “ r S s Y t MASK u . W e denote the set of all distributions on X b y P p X q . Let H , KL , and I denote entr opy , Kul lb ack-L eibler (KL) diver genc e , and mutual information , resp ectively . Let δ x denote the Dirac measure at point x . W e adopt the standard asymptotic notation O p¨q , Ω p¨q , Θ p¨q , À , and ! . A dditionally , r O p¨q , r Ω p¨q , and r Θ p¨q are deﬁned analogously , except that the logarithmic dep endency on d, S , 1 { ε , and 1 { δ is hidden. F or a vector x “ p x 1 , x 2 , . . . , x d q P X , i P r d s , and c P V , we deﬁne vectors x ´ i : “ p x 1 , . . . , x i ´ 1 , x i ` 1 , . . . , x d q , and x ‘ i c , x d i c P X as follows: • p x ‘ i c q j “ x j for all j ‰ i , and p x ‘ i c q i “ p x i ` c q mo d |V | 1 , • p x d i c q j “ x j for all j ‰ i , and p x d i c q i “ c , F or x, y P X , we denote the Hamming distance b y d H p x, y q : “ | t i P r d s : x i ‰ y i u | , and for x P pr S s Y t MASK uq d , we denote m p x q : “ | t i P r d s , such that x i “ MASK u | . 1 In this case, w e assume that V has additive structure. W e only apply this notation when V “ r S s . W e use the conv ention that 0 mo d S “ S . 4 Paper Noising process Score Est. Assump. No Early Stopping Sampler Iteration Complexity Adaptation Ren et al. ( 2025 ) Uniform Bounded % τ -leaping d 2 S 2 { ε % Liang et al. ( 2025c ) Uniform Bounded % τ -leaping d 2 S { ε % Our work, Theorems 1 & 2 Uniform No requirement ! τ -leaping d { ε ˚ % Liang et al. ( 2025b ) Masking Bounded % τ -leaping dS { ε % Conforti et al. ( 2025 ) Masking ˆ s t « s t % DMPM dS { ε % Our work, Theorem 3 Masking No requirement ! Algorithm 1 D { ε ! T able 1: Comparison with prior work. Logarithmic factors in the iteration complexit y are omitted. Ren et al. ( 2025 ) and Liang et al. ( 2025b ) describ e b ounds without early stopping under more stringent assumptions on the target distribution, the score function, or the score estimator. The b ound in Conforti et al. ( 2025 ) dep ends on additional quan tities in volving the score estimator beyond Assumption 1 , which are small whenev er s t « ˆ s t . The quan tity D (deﬁned in Eqn. ( 16 )), is upp er b ounded by d log p S q and captures the intrinsic low- dimensional structure of the target distribution. Entry mark ed with ˚ indicates sharp rates, with matching lo wer b ounds established in Theorem 2 . 2 Preliminaries of discrete diﬀusion mo dels 2.1 A contin uous-time Mark o v c hain formulation Our goal is to mo del d -dimensional discrete data X 0 “ p X 1 0 , X 2 0 , . . . , X d 0 q P r S s d . Let q data “ q 0 denote the probabilit y mass function (p.m.f.) of X 0 from whic h we aim to sample, and let q i 0 b e the marginal p.m.f. of the i -th coordinate. Analogous to con tinuous diﬀusion mo dels, their discrete counterparts consist of a forw ard and a reverse pro cess ov er the discrete space. The forw ard pro cess. W e deﬁne a forward noising process that progressively transforms the data distri- bution q 0 to a distribution q T that is close to an easy-to-sample distribution. This pro cess is mo deled using a contin uous-time Marko v chain (CTMC). Deﬁnition 1 (Contin uous-time Marko v chain) . A CTMC with an initial distribution q 0 and r ate matric es p Q t q t Pr 0 ,T s is a right-c ontinuous sto chastic pr o c ess p x t q t Pr 0 ,T s such that • p x t q t Pr 0 ,T s satisﬁes the Mark ov prop ert y : for any 0 ď s ă t ď T , the c onditional distribution of x t given the history t x u , u ď s u dep ends only on x s , • for any 0 ď t ă T , the tr ansition pr ob abilities satisfy, as ∆ t Ñ 0 ` : Pr p x t ` ∆ t “ y | x t “ x q “ I t x “ y u ` Q t p x, y q ∆ t ` o p ∆ t q . (1) Her e, the r ate matric es satisfy Q t p x, y q ě 0 for al l x ‰ y P X and Q t p x, x q “ ´ ř y ‰ x Q t p x, y q . W e refer to F einberg et al. ( 2014 ); F eller ( 1940 ) for a rigorous treatment of CTMCs. F or a giv en q 0 , the marginals p q t q t Pr 0 ,T s satisfying Eqn. ( 1 ) are the solutions to the Kolmo gor ov forwar d e quation : d q t d t “ Q J t q t , for 0 ď t ď T . The rev erse pro cess. F or suc h a CTMC, there exists a time-rev ersed process with an initial distribution q T , rate matrices p Ð Q t q t Pr 0 ,T s , and marginals p Ð q t q t Pr 0 ,T s , suc h that q t ” Ð q T ´ t , for t P r 0 , T s . The forward and 5 rev erse rate matrices are explicitly related ( Campbell et al. , 2022 ) by Ð Q t p x, y q “ Q T ´ t p y , x q q T ´ t p y q q T ´ t p x q , for x ‰ y P X and 0 ď t ď T . (2) In this paper, w e focus on rate matrices that satisfy three conditions: 1. they are time-homogeneous, Q t ” Q , 2. Q p x, y q “ 0 whenever d H p x, y q ě 2 , 3. if d H p x, y q “ 1 and x i ‰ y i , then Q p x, y q “ Q tok p x i , y i q , for some ﬁxed matrix Q tok . In particular, we consider tw o important instances of CTMCs that are widely adopted in practice, namely the uniform noising pr o c ess and the masking (or absorbing) noising pr o c ess , which are deﬁned through the c hoice of Q tok . • uniform noising pro cess : A CTMC is a uniform noising pr o c ess if for a ‰ b P r S s Q tok p a, b q “ 1 { S. (3) This CTMC con verges to the uniform distribution on the domain X : “ r S s d in the limit. • masking noising process : A CTMC on the domain X : “ pr S s Y t MASK uq d is a masking noising pr o c ess if for a ‰ b P r S s Y t MASK u Q tok p a, b q “ I t a ‰ MASK and b “ MASK u . (4) The corresp onding CTMC conv erges to the Dirac measure p δ MASK q b d as t Ñ 8 . Note that we constrain the initial distribution q 0 to b e supp orted on non-masked data, i.e., on r S s d . 2.2 Score estimation Recall that the reverse process is a CTMC with rate matrices satisfying the relation ( 2 ), whic h is similar to the rev erse pro cess in the contin uous case. The density ratio here generalizes the typical score function ∇ x log q t p x q in the con tinuous case and is often referred to as the (concrete) score function for discrete diﬀusion mo dels ( Meng et al. , 2022 ). F ormally , w e deﬁne the sc or e function s t p y , x q as s t p y , x q “ q t p y q q t p x q , for x ‰ y P X . Score entrop y loss. F or b oth the uniform and masking noising pro cesses, the marginals p q t q , and conse- quen tly the score function, are in tractable in general. In practice, one therefore resorts to an appro ximation p s t p y , x q of the true score function s t p y , x q , whic h is learned from data sampled from the target distribution q 0 . T o ev aluate the quality of the estimated score, a widely used loss function is the sc or e entr opy loss , originally in tro duced in Lou et al. ( 2024 ), which has since b ecome the de facto standard for training score-based dis- crete diﬀusion models. This loss provides a principled ob jective for matching the approximate score p s t to the true score induced by the forw ard diﬀusion pro cess. Sp eciﬁcally , for t ě 0 and functions p s, s : X ˆ X Ñ R ` , the score en tropy loss L SE is deﬁned as follows: L SE p t, p s, s q : “ E x „ q t ” ÿ y ‰ x Q t p y , x q s p y , x q D p p s p y , x q , s p y , x qq ı ě 0 . Here, for a, b ě 0 , the Bregman div ergence for ϕ p a q “ a ´ log a is given b y D p a, b q : “ a b ´ 1 ´ log a b ě 0 . In practice, to implemen t an y sampling algorithm, one has to discretize the con tinuous dynamics and obtain score estimates at discrete time steps. Supp ose that the score estimates p s T ´ t are obtained at discrete time p oin ts 0 ď t 0 ă t 1 ă . . . ă t N ď T . W e make the following standard assumption regarding the score estimation errors. 6 parallel up dates CTMC-based Euler metho d T weedie τ -leaping τ -leaping alg. T runcated τ -leaping Algorithm 1 τ -bridging Gillespie’s alg. Uniformization DMPM Figure 1: Overview of score-based samplers. The left part comprises score-based samplers that allo w parallel up dates, deﬁned as τ -leaping strategies in Lou et al. ( 2024 ). The right part consists of samplers that can b e deﬁned through the CTMC framework. At the intersection are τ -bridging strategies, deﬁned in Eqn. ( 8 ). Assumption 1 (Approximation error) . L et N ą 0 and 0 ď t 0 ă t 1 ă . . . ă t N ď T . W e assume that N ´ 1 ÿ k “ 0 p t k ` 1 ´ t k q L SE p T ´ t k , p s T ´ t k , s T ´ t k q ď ε score . (5) This assumption is concerned with the aggregated estimation errors o ver all N steps. Several w orks ha ve constructed estimates that satisfy this assumption; examples include Benton et al. ( 2024 ); Lou et al. ( 2024 ); Ou et al. ( 2025 ). 2.3 Score-based sampling algorithms Armed with the score estimates p p s T ´ t q t Pt t 0 ,...,t N u , we aim to construct a generative model p q 0 that approxi- mates the data distribution q 0 . A natural approach prop osed in Campb ell et al. ( 2022 ) is to deﬁne a surrogate CTMC that starts from an easy-to-sample distribution p 0 « q T and approximates the backw ard dynamics in ( 2 ). Concretely , we deﬁne the time-inhomogeneous rate matrix p Q t p x, y q “ Q T ´ t p y , x q p s T ´ t p y , x q . (6) In practice, score estimates are only a v ailable on a ﬁxed discretization τ “ p t 0 , . . . , t N q , and extending these estimates to the full interv al r 0 , T s in tro duces discr etization err or . τ -leaping algorithm. As men tioned abov e, a widely adopted sampler is the τ -leaping algorithm ( Campbell et al. , 2022 ), which appro ximates Eqn. ( 6 ) with m ultiple possible transitions within eac h discretization in terv al. F ormally , for k P t 0 , . . . , N ´ 1 u and t P r t k , t k ` 1 q , giv en x t k and p s T ´ t k , τ -leaping obtains x t k ` 1 as a random vector whose co ordinates are sampled independently via d one-dimensional CTMCs. F or each i P r d s , the initial distribution is δ x i t k and the rate matrices are giv en b y 2 : p Q i t p a, b q “ p s T ´ t k p x t k , x t k ‘ i p b ´ a qq , for a ‰ b P V . (7) The formulation in Eqn. ( 7 ) requires either an additiv e structure on the state space or the restriction that each co ordinate undergoes at most one transition b et ween discretization p oin ts. Existing analyses for uniform and masking diﬀusions ( Campb ell et al. , 2022 ; Liang et al. , 2025c ) adopt the latter assumption. In Section 3.1 , w e explore the necessit y of this requirement for the uniform noising process. Lou et al. ( 2024 ) generalizes τ - leaping by in tro ducing a class of samplers termed τ -le aping str ate gies , which allow arbitrary transformations x i t k ` 1 “ f i k p p s T ´ t k , x t k q . Both the Euler metho d and T w eedie τ -leaping fall in to this class. How ev er, they remain challenging for direct theoretical analysis due to the absence of a CTMC structure. 2 The algorithm admits an equiv alent P oisson formulation, in which dS Poisson random v ariables corresp onding to coordinate- v alue transitions are sampled and applied in parallel. 7 This pap er: τ -bridging strategies. W e in tro duce a structured class of samplers that generalizes the τ - leaping algorithm. W e name this class of algorithms τ -bridging str ate gies , whic h retain the parallel updating structure while remaining analytically tractable. A τ -bridging strategy generates x t k ` 1 from x t k b y evolving d independent one-dimensional CTMCs on r t k , t k ` 1 q . F or each co ordinate i P r d s , the c hain is initialized at δ x i t k and has the rate matrix giv en b y p Q i t “ G i t p p s T ´ t k , x t k q , (8) for some mapping G i t : R X ˆ X ` ˆ X Ñ R V ˆ V . Compared to general τ -leaping strategies, τ -bridging strategies restrict updates to CTMC-based transitions. This restriction preserves parallel co ordinate updates while facilitating theoretical analysis. Figure 1 summarizes the relationships b et ween these classes of sampling algorithms. A represen tative instance of a τ -bridging sampler is the trunc ate d τ -le aping sampler of Liang et al. ( 2025c ). F or k P r N s and t P r t k , t k ` 1 q , the corresponding rate matrices take the form: G i t p p s T ´ t k , x t k qp a, b q “ p s T ´ t k p x t k , x t k d i b q I t x i t k “ a u for a ‰ b P V . (9) The indicator I t x i t k “ a u enforces the constraint that at most one transition occurs p er co ordinate i P r d s within eac h discretization in terv al r t k , t k ` 1 q . In Section 3.2 , we show that an instance of this sc heme achiev es sublinear complexity for the masking noising pro cess under mild distributional assumptions. T o the b est of our knowledge, this is the ﬁrst result establishing such a guarantee. 3 Main results In this section, w e characterize the sampling eﬃciency of the τ -bridging strategies for b oth the uniform and masking noising pro cesses. W e develop sharp conv ergence guarantees and highligh t cases where adaptivit y is automatically ac hieved. W e pro vide proof sketc hes for all results in this section, with full pro ofs deferred to the appendix. 3.1 Uniform discrete diﬀusion 3.1.1 A sharp con vergence characterization W e begin with the uniform discrete diﬀusion models, whose forw ard dynamics are given b y the uniform noising pro cess. W e establish explicit sampling guarantees for the τ -leaping algorithm, measured in KL div ergence. The pro of is given in App endix C.1 . Theorem 1. L et q data “ q 0 b e the data distribution on X : “ r S s d . F or 0 “ t 0 ă t 1 ă . . . ă t N “ T , let ∆ : “ max k t t k ` 1 ´ t k u “ O p 1 q . Set p 0 “ Unif p X q . Under Assumption 1 , the τ -le aping algorithm initialize d at p 0 gener ates a sample fr om p output “ p T such that KL p q data } p output q À ε score ` e ´ T d log p S q ` ∆ d log p S { ∆ q . (10) As exp ected, the KL div ergence b ound in Theorem 1 decomp oses in to three terms. The ﬁrst term ε score quan tiﬁes the quality of score estimation and captures the accumulation of estimation errors ov er the N discretization steps. The second term corresponds to the initialization error, arising from initializing the sampler with the uniform distribution p 0 instead of the true terminal distribution q T ; this term deca ys exp onen tially with the diﬀusion horizon T . Finally , the third term accoun ts for the discretization error incurred by appro ximating the contin uous-time reverse process with a discrete-time τ -leaping sc heme. T o further interpret Theorem 1 and place it in context with existing results, we highligh t sev eral of its salien t features. First, the discretization error scales linearly with the dimension d and only logarithmically with the v o cabulary size S . This matches the result obtained for the random walk mo del ( Conforti et al. , 2025 ) and rev eals that the discretization error is insensitive to the distribution scale, as has b een shown for con tinuous diﬀusion mo dels (e.g., Huang et al. ( 2024 )). Second, the theorem p ermits a ﬂexible c hoice of step size sc hedules and do es not require early stopping. In contrast to prior analyses that rely on carefully selected step sizes and introduce an early stopping time δ (where the algorithm outputs p T ´ δ in place of 8 p T ), the b ound in Theorem 1 dep ends only on the maximum step size. Moreov er, the same b ound applies uniformly to early stopping v ariants: the right-hand side of ( 10 ) remains unc hanged for an y δ ! 1 . The only requiremen t we hav e on score estimation is Assumption 1 , with no additional b oundedness or regularit y conditions (typically assumed in the existing l iterature). As a result, the theorem applies to a broad class of score estimation pro cedures commonly used in practice. W e pro vide a sk etch of its pro of to illustrate the main pro of ideas. Pr o of sketch of The or em 1 . In view of the data-pro cessing inequality and the c hain rule for KL divergence, w e upp er b ound the KL div ergence b et ween q 0 and p T b y the KL divergence b et ween the paths q T ´ t 0 ,...,T ´ t N and p t 0 ,...,t N , which can b e decomp osed as KL p q 0 } p T q ď KL p q T ´ t 0 ,...,T ´ t N } p t 0 ,...,t N q “ KL p q T } p 0 q ` N ´ 1 ÿ k “ 0 E x t k „ Ð q t k ” KL ´ Ð q t k ` 1 | t k p¨| x t k q } p t k ` 1 | t k p¨| x t k q ¯ı . The ﬁrst term is the initialization error, which can be upper bounded b y the log-Sobolev inequalit y in Lemma 7 . F or the second term, we apply Girsanov’s change-of-measure theorem for con tinuous-time Marko v c hains to obtain the follo wing upper b ound: 1 S N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t k ,x t „ Ð q t k ,t ÿ i Pr d s ÿ c Pr S s s T ´ t p x t ‘ i c, x t q D ` p s T ´ t k p x t k ‘ i c, x t k q , s T ´ t p x t ‘ i c, x t q ˘ d t. The details can b e found around Eqn. ( 67 ). T o further control the right-hand side, we apply the law of cosines for Bregman divergence and derive that (with ℓ : “ t k ) ÿ i Pr d s ÿ c Pr S s s T ´ t p x t ‘ i c, x t q D ` p s T ´ t k p x t k ‘ i c, x t k q , s T ´ t p x t ‘ i c, x t q ˘ “ ÿ y ℓ :d H p y ℓ ,x ℓ q“ 1 s T ´ ℓ p y ℓ , x ℓ q D ` p s T ´ ℓ p y ℓ , x ℓ q , s T ´ ℓ p y ℓ , x ℓ q ˘ lo oooooooooooooooooooooooooooooooooo omo oooooooooooooooooooooooooooooooooo on Controlled by Assumption 1 ` ÿ i Pr d s ÿ c Pr S s ` s T ´ ℓ p x ℓ ‘ i c, x ℓ q ´ s T ´ t p x t ‘ i c, x t q ˘ log p s T ´ ℓ p x ℓ ‘ i c, x ℓ q lo oooooooooooooooooooooooooooooooooooooooooooomo ooooooooooooooooooooooooooooooooooooooooooo on Expectation controlled by Lemma 8 ` ÿ y t :d H p y t ,x t q“ 1 ` ´ log s t p y t , x t q ˘ ´ ÿ y ℓ :d H p y ℓ ,x ℓ q“ 1 ` ´ log s T ´ ℓ p y ℓ , x ℓ q ˘ . The ﬁrst term can b e con trolled by Assumption 1 after taking the exp ectation o ver x ℓ „ Ð q ℓ and in tegrating o ver time. The second term can b e shown to b e zero with the help of Lemma 8 after taking the expectation o ver x t „ Ð q t | ℓ p¨ | x ℓ q . Th us, the problem b oils do wn to upp er bounding the third term abov e, whose prop erties are characterized in Lemma 10 . After taking the exp ectation and integrating ov er time, we can upp er b ound the third term by ∆ d log p S { ∆ q . Combining the bounds for all three terms completes the pro of. Next, w e sp ecialize Theorem 1 to the concrete c hoice of a discretization schedule to derive the iteration complexit y required to obtain an ε -accurate sampler in KL div ergence. F or a simple step size sc hedule, it turns out that d { ε steps (up to logarithmic factors) suﬃce for con vergence, signiﬁcantly improving on the state-of-the-art complexity of d 2 S { ε from Liang et al. ( 2025c ). Refer to App endix C.2 for the pro of. Corollary 1. F or the setting in The or em 1 and ε ą 0 , the output of the τ -le aping algorithm with c onstant step size sche dule t k ` 1 ´ t k “ T { N for k P r N ´ 1 s , achieves KL p q data } p output q À ε score ` ε, 9 pr ovide d that the time horizon T “ log p d log p S q{ ε q and the iter ation numb er N “ r O ˆ d ε ˙ . (11) R emark 1 (Step size sc hedule) . In Corollary 1 , w e adopt the constant step size schedule for simplicity . This c hoice is optimal in the sense that it minimizes the w orst-case upp er b ound for a ﬁxed num b er of steps N , and it is also empirically eﬀective ( Campb ell et al. , 2022 ). How ever, other step size sc hedules commonly used in practice and theory achiev e the same iteration complexity , including the exp onential-then-constan t sc hedule (deﬁned as in Corollary 2 and used in Liang et al. ( 2025c )) and the log-linear schedule ( Lou et al. , 2024 ). In these works, early stopping is introduced to maintain numerical stabilit y in score estimation during training and also to ensure a small discretization error. Our result shows that, under Assum ption 1 , early stopping is not necessary for a small discretization error. 3.1.2 A matching low er b ound for τ -leaping While Theorem 1 establishes an upp er b ound for the τ -leaping algorithm scaling nearly linearly with the dimension d and logarithmically with the v o cabulary size S , the fundamen tal question remains: is this dep endence an intrinsic limit or merely a technical artifact? W e show that the former is indeed the case b y establishing a matc hing low er bound. W e note that for targ et distributions suﬃcien tly close to the uniform distribution, sampling can be ac hieved with very few steps, as the forward CTMC conv erges eﬃcien tly to its limit. T o av oid these patho- logical instances, we restrict our fo cus to the class of distributions that remain suﬃciently w ell-separated from the uniform distribution. Sp eciﬁcally , for any γ P r 0 , 1 s , deﬁne the subset P γ p X q Ď P p X q as P γ p X q “ ␣ q 0 P P p X q : H p q 1 q ď p 1 ´ γ q ¨ H p Unif p X qq “ p 1 ´ γ q d log p S q ( , where q 1 is the marginal distribution at t “ 1 of the uniform noising pro cess initialized at q 0 , Unif p X q is the uniform distribution on X , and H p¨q denotes the entrop y function of a distribution. Intuitiv ely , for γ P p 0 , 1 q , the class P γ p X q imp oses a structural constraint on the conv ergence of the forward pro cess; it describes distributions that do not mix rapidly . In this sense, for γ “ O p 1 q , P γ p X q contains distributions that remain informativ e enough at time t “ 1 in the forward pro cess. This cov ers most distributions of practical interest, since they carry nontrivial information characterized b y relatively low en tropy . The follo wing low er bound shows that, when sampling from a distribution in P γ p X q with the τ -leaping algorithm, the iteration complexity bound in Corollary 1 cannot b e improv ed up to logarithmic factors. The pro of is given in App endix C.3 . Theorem 2. F or any tar get distribution q 0 P P γ p X q and e arly stopping time 0 ď δ ! 1 , denote the p ath me asur e of the b ackwar d pr o c ess by Q d “ t Ð q t u t Pr 0 ,T ´ δ s and the sampling pr o c ess by P d “ t p t u t Pr 0 ,T ´ δ s . L et γ “ Ω p 1 q . Then, for any step size sche dule 0 “ t 0 ă t 1 ă . . . ă t N “ T ´ δ wit h max k t t k ` 1 ´ t k u ď 1 2 , it takes the τ -le aping algorithm at le ast N “ Ω p d log p S qq (12) iter ations to achieve KL p Q } P q ď ε score ` O p 1 q . W e mak e several remarks concerning the nature and implications of our low er b ound. Theorem 2 reveals that for informative target distributions in P γ p X q , ensuring that the KL div ergence b et w een the sampling pro cess and the rev erse pro cess is small requires the n umber of steps to scale at least linearly with the dimension d , whic h cannot be av oided for general distributions. In addition, the low er b ound is uniform ov er both early stopping sc hedules ( 0 ă δ ! 1 ) and non-early stopping sc hemes ( δ “ 0 ). This low er b ound is algorithm-dep enden t: it relies on structural prop erties of the τ -leaping algorithm and therefore diﬀers fundamen tally from information-theoretic or minimax lo wer b ounds. In principle, alternative sampling sc hemes ma y circum ven t the linear dependence o n d . Indeed, in Section 3.2 , w e sho w that a 10 mo diﬁed τ -leaping pro cedure achiev es sublinear dep endence on d for structured target distributions under the masking noising pro cess. Whether analogous impro vemen ts are possible for uniform discrete diﬀusion through mo diﬁed algorithms remains an op en question. When the target distribution has high entrop y , the low er bound need not apply . Indeed, when q data satisﬁes KL p q data } Unif p X qq “ o p d q , one can sho w that H p q 1 q “ Θ p d log S q , and that a sample from a distribution with the KL error at most ε score ` ε can b e obtained using N “ o p d q steps. A precise formulation of this claim is given in App endix C.4 . W e remark that the quantit y controlled in Theorem 2 is the KL divergence b etw een t wo path measures, rather than the divergence b et w een the terminal output distributions, which may app ear weak er than the upp er bound in Corollary 1 . Ho wev er, to the b est of our kno wledge, all existing upper-b ound analyses for the KL divergence, including ours, pro ceed by ﬁrst bounding the KL divergence b et w een path measures and then in voking the data-pro cessing inequalit y . Consequently , the low er b ound applies to all current analysis tec hniques. In this sense, Theorem 2 establishes the optimalit y of the iteration complexity in Corollary 1 within the scope of the existing analysis techniques. Finally , w e provide the pro of sketc h of Theorem 2 to illustrate the main pro of techniques. Pr o of sketch of The or em 2 . The pro of is based on a reﬁned analysis of the deca y of KL div ergence along the forw ard pro cess for an y distribution q 0 P P γ p X q . While w e state our result for γ “ Ω p 1 q , the pro of works for every γ P p 0 , 1 q . It can b e shown that the KL div ergence along the forward pro cess is a diﬀerentiable function of time t , and w e denote its negative rate of c hange as φ p t q , i.e., φ p t q “ ´ d d t KL p q t } p 0 q “ ÿ x,y :d H p x,y q“ 1 q t p x q s t p y , x q log ˆ s t p y , x q s t p x, y q ˙ , where p 0 “ Unif p X q is the limit distribution of the forw ard noising pro cess. First, we sho w that the condition KL p Q } P q ď ε score ` O p 1 q with the deﬁnition of P γ p X q implies that T ą 1 and the following b ound N ´ 1 ÿ k “ 1 ż t k ` 1 t k ` φ p T ´ t q ´ φ p T ´ t k q ˘ d t “ O p 1 q . (13) F urthermore, w e can sho w that φ p t q is a non-increasing and diﬀeren tiable function of t . Thus, Eqn. ( 13 ) and the Newton-Leibniz form ula lead to a stronger condition: N ´ 1 ÿ k “ 1 inf t k ď t ď t k ` 1 p´ φ 1 p T ´ t qq ¨ 1 2 p t k ` 1 ´ t k q 2 ď N ´ 1 ÿ k “ 1 ż t k ` 1 t k ż T ´ t k T ´ t ´ φ 1 p u q d u d t “ N ´ 1 ÿ k “ 1 ż t k ` 1 t k ` φ p T ´ t q ´ φ p T ´ t k q ˘ “ O p 1 q . (14) Next, w e view the forward pro cess as an S -ary symmetric channel ( Makur and Poly anskiy , 2018 ) and apply the strong data-pro cessing inequality to prov e that for any q 0 P P γ p X q , the function ´ φ 1 p t q has a lo wer b ound scaling with γ d log p S q for all t P p 0 , 1 q . Since max k t t k ` 1 ´ t k u ď 1 2 , w e can choose a suitable M , such that 1 ă M ă N and T ´ t M P r 1 2 , 1 s . Com bining this with Eqn. ( 14 ), w e obtain N ´ 1 ÿ k “ M p t k ` 1 ´ t k q 2 À 1 γ d log p S q , whic h implies that N “ Ω p γ d log p S qq by the Cauch y-Sch warz inequalit y . 3.2 Masking discrete diﬀusion W e now turn our atten tion to the masking noising process. Our main result in this setting is an upp er bound that in trinsically dep ends on the structural prop erties of the target distribution q data , rather than scaling with the ambien t dimension d . This aligns with the in tuition that for highly structured distributions — such as a sparse mixture of Dirac measures — a sensible sampler should con verge at a sublinear scale, or p erhaps ev en logarithmically with d . 11 3.2.1 Preliminaries W e b egin b y recalling tw o fundamen tal quan tities in information theory: the total correlation and the dual total correlation. F or a distribution q o ver r S s d and x „ q , the total correlation C p q q and the dual total correlation B p q q are deﬁned as C p q q : “ d ÿ i “ 1 H p x i q ´ H p x q and B p q q : “ H p x q ´ d ÿ i “ 1 H p x i | x ´ i q . (15) W e no w in tro duce a time-dep enden t quan tity asso ciated with the masking noising process. Consider a masking noising pro cess deﬁned by Eqn. ( 4 ) with marginals p q t q t ě 0 . F or x P pr S s Y t MASK uq d and i ‰ j P r d s , let x ´p i,j q denote the collection of all unmasked elements of x , excluding the i -th and the j -th co ordinates. W e deﬁne the eﬀe ctive total c orr elation of the target distribution as D p q 0 q : “ ż 8 0 min p 1 , t q I p t q d t with I p t q : “ ÿ i ‰ j Pr d s I p x i t ; x j t | x ´p i,j q t q ě 0 , (16) where I p A ; B | C q denotes the conditional m utual information, and x t „ q t . Lemma 16 sho ws that the total correlation and the dual total correlation can be expressed through I p t q by B p q 0 q “ ż 8 0 I p t q d t and C p q 0 q “ ż 8 0 p e t ´ 1 q I p t q d t. Consequen tly , D p q 0 q ď min p B p q 0 q , C p q 0 qq . The statemen t and the proof of this result are giv en in Ap- p endix E.1 . Note that b oth B p q 0 q , C p q 0 q , and hence D p q 0 q are upp er b ounded by d log p S q . Moreo ver, there exist distributions q 0 with B p q 0 q “ O p 1 q while C p q 0 q “ Ω p d log p S qq , and vice versa. W e refer to Austin ( 2020 ) for a detailed study of the total correlation and the dual total correlation. Importantly , there are also natural distributions for whic h b oth B p q 0 q and C p q 0 q are of order d , while D p q 0 q remains small. See Prop osition 5 for an example of such a distribution. 3.2.2 An adaptive characterization Equipp ed with the ab o ve preliminaries, we present our main result on the masking noising pro cess. The pro of is given in App endix D.1 . Theorem 3. L et q data “ q 0 b e the tar get distribution on r S s d . F or 0 “ t 0 ă t 1 ă . . . ă t N “ T , let h k : “ t k ` 1 ´ t k b e the step sizes and assume that ∆ : “ max k h k “ O p 1 q . L et p 0 : “ ˜ ` 1 ´ e ´ T ˘ δ MASK ` S ´ 1 e ´ T S ÿ k “ 1 δ k ¸ b d . Under Assumption 1 , Algorithm 1 initialize d at p 0 pr o duc es a sample fr om p output “ p T such that KL p q data } p output q À ε score ` e ´ T d log p S q ` N ´ 1 ÿ k “ 0 h k ż T ´ t k T ´ t k ` 1 I p t q d t. (17) A few remarks on the consequences and implications of Theorem 3 are in order. As in Theorem 1 , the last term in the upp er b ound corresp onds to the discretization error measured using the in tegrated mutual information deﬁned in Eqn. ( 16 ). While the ﬁrst t wo terms are generic, the third term go verns the dep endence on the dimension d and reﬂects the information-theoretic prop erties of the target distribution. F or structured distributions, our algorithm implicitly adapts to the underlying structure of the target distribution without requiring an y prior knowledge of that structure or an y modiﬁcation to the algorithm itself. 12 Algorithm 1: Mo diﬁed truncated τ -leaping Input: Initial distribution: p 0 , Discretization steps: 0 “ t 0 ă t 1 ă . . . ă t N “ T , Score estimate function: p s T ´ t for t P t t 0 , . . . , t N ´ 1 u . Output: Sample p x P r S s d . 1 Sample x 0 from p 0 2 for k “ 0 , . . . , N ´ 1 do 3 for i P m p x t k q : “ t i , such that x i t k “ MASK u do 4 p Q i k p a q Ð p s T ´ t k p x t k d i a, x t k q , for a P r S s 5 p Q i k p MASK q Ð ´ ř a Pr S s p Q i k p a q 6 if k ă N ´ 1 then 7 ∆ k Ð p e T ´ t k ´ 1 q log ´ e T ´ e t k e T ´ e t k ` 1 ¯ 8 P k Ð exp p p Q i k p MASK q ∆ k q 9 end 10 else 11 P k Ð 0 12 end 13 x i t k ` 1 Ð $ & % MASK , with probability P k , a, with probability p Q i k p a q ř b Pr S s p Q i k p b q p 1 ´ P k q , for a P r S s . 14 end 15 end 16 return x t N In App endix D.3 , w e analyze the p erformance of truncated τ -leaping as an alternative to Algorithm 1 , whic h has an additional d { N 2 term in the upp er b ound Eqn. ( 17 ), ignoring low er-order contributions. Al- though for structured target distributions the resulting iteration complexity already scales as ? d rather than d (as in the existing literature), it do es not fully adapt to the geometry of the target distribution. T o pro vide some intuition, the standard (or truncated) τ -leaping algorithm informally satisﬁes for t P r t k , t k ` 1 q (see Eqn. ( 9 )) G i t p s T ´ t k , x t k q « G i t k p s T ´ t k , x t k q , and thus p Q t « Ð Q t k , (18) where we recall the mapping G i t from Eqn. ( 8 ). That is, ev en when the score estimation is exact, p s T ´ t k ” s T ´ t k , the τ -leaping algorithm in tro duces a mismatc h betw een the surrogate and true rate matrices as s T ´ t k ı s T ´ t . Algorithm 1 corrects this discrepancy by enforcing G i t p s T ´ t k , x t k q « G i t p s T ´ t , x t k q , and thus p Q t « Ð Q t , (19) through the rescaling of the score estimate function: p s T ´ t “ e T ´ t k ´ 1 e T ´ t ´ 1 p s T ´ t k . As it is a linear transformation of the score estimate function, w e can simulate its dynamics only at discrete points T ´ t 0 , . . . , T ´ t N (see Algorithm 1 and Lemma 13 ). This leads to a sharper upp er bound in Theorem 3 relativ e to the analogous b ound for truncated τ -leaping (Theorem 5 ; see also Remark 3 ). Empirically , the b eneﬁt of rescaling the score function in masking discrete diﬀusion mo dels has also been observed in prior w ork; see, for example, Lou et al. ( 2024 ); Ou et al. ( 2025 ). Notably , our results are closely connected to an intriguing parallel line of work on the masking diﬀusion mo dels ( Chen et al. ( 2025 ); Li and Cai ( 2025 )), whic h fo cuses on the design of unmasking schedules without adopting a CTMC persp ectiv e. In particular, Chen et al. ( 2025 ) deriv es optimal unmasking sc hedules and discusses t wo represen tative instances in whic h the num b er of steps scales linearly with B p q data q and C p q data q , respectively . Their algorithms require an a priori estimate of B p q data q and C p q data q or a doubling searc h procedure to calibrate the unmasking schedule and rely on a diﬀerent sampling mec hanism. The 13 fact that our score-based sampler automatically exploits similar information-theoretic quan tities without additional hyperparameters underscores b oth the fundamental nature of these quantities and the robustness of the CTMC framework. Belo w w e pro vide a pro of sk etch of Theorem 3 with the details deferred to App endix D.1 . Pr o of sketch of The or em 3 . First, Lemma 13 sho ws that Algorithm 1 outputs a sample from a CTMC with initial distribution p 0 and rate matrices p Q t p x, y q : “ $ ’ & ’ % p s T ´ t k p x t k d i y i , x t k q e T ´ t k ´ 1 e T ´ t ´ 1 I t x i “ MASK u , if d H p x, y q “ 1 , x i ‰ y i , and x i t k “ MASK , ´ ř z ‰ x p Q t p x, z q , if y “ x, 0 , otherwise. (20) This corresp onds to a τ -bridging strategy (Eqn. ( 8 )) with the follo wing function G i t p p s T ´ t k , x t k q : G i t p p s T ´ t k , x t k qp a, b q “ e T ´ t k ´ 1 e T ´ t ´ 1 p s T ´ t k p x t k , x t k d i b q I t x i t k “ a u for a ‰ b P V . By the data-pro cessing inequality , we upp er b ound the KL div ergence b et ween q 0 and p T b y the KL diver- gence betw een the paths q T ´ t 0 ,...,T ´ t N and p t 0 ,...,t N . Next, w e apply the Marko vian prop ert y of the paths along with Girsano v’s change-of-measure theorem to upp er b ound KL p q 0 } p T q by KL p q T } p 0 q ` N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t k ,x t „ Ð q t k ,t « ÿ y t : Q p y t ,x t qą 0 s T ´ t p y t , x t q ˆ D ˆ e T ´ t k ´ 1 e T ´ t ´ 1 p s T ´ t k p y t , x t q , s T ´ t p y t , x t q ˙ d t ﬀ . The ﬁrst term is the initialization error and is con trolled b y c ho osing the time horizon T “ Ω p log d ` log log p ε ´ 1 S qq . F or the second term, we apply the law of cosines for Bregman div ergence and obtain (with ℓ : “ t k and y ℓ : “ x ℓ d i c , where y t “ x t d i c ): s T ´ t p y t , x t q D ˆ e T ´ ℓ ´ 1 e T ´ t ´ 1 p s T ´ ℓ p y t , x t q , s T ´ t p y t , x t q ˙ “ e T ´ ℓ ´ 1 e T ´ t ´ 1 s T ´ ℓ p y ℓ , x ℓ q D p p s T ´ ℓ p y ℓ , x t q , s T ´ t p y t , x t qq lo ooooooooooooooooooooooooooooooo omo ooooooooooooooooooooooooooooooo on Controlled by Assumption 1 ` p s T ´ t p y ℓ , x ℓ q ´ s T ´ t p y t , x t qq log p s T ´ ℓ p y ℓ , x ℓ q s T ´ ℓ p y ℓ , x ℓ q lo ooooooooooooooooooooooooooo omo ooooooooooooooooooooooooooo on Expectation controlled by Lemma 14 ` s T ´ t p y t , x t q D p s T ´ t p y ℓ , x ℓ q , s T ´ t p y t , x t qq . Similar to the pro of of the uniform discrete diﬀusion mo del, the ﬁrst term can b e controlled by Assumption 1 after taking the exp ectation ov er x t k „ Ð q t k and integrating ov er time, and the second term can be prov ed to b e zero by the martingale prop ert y from Lemma 14 . Finally , using Dynkin’s formula, w e relate the third term to the eﬀective total correlation D p q 0 q . Next, w e deriv e iteration complexity guaran tees for our algorithm under speciﬁc c hoices of step size sc hedules. The pro of is given in App endix D.2 . Corollary 2. Consider the setting in The or em 3 . L et T “ log p d log p S qq . F or a ﬁxe d ε ą 0 , the distribution p output satisﬁes KL p q data } p output q À ε score ` ε , • under the c onstant step size sche dule, t k ´ t k ´ 1 “ T { N for al l k P r N s , pr ovide d N “ r O ˆ B p q data q ε ˙ ; 14 • under the exp onential-then-c onstant step size sche dule, when t k ` 1 ´ t k ď κ min p 1 , T ´ t k ` 1 q for k P t 0 , . . . , N ´ 2 u , T ´ t N ´ 1 “ ε {p d log p S qq , and κ “ N ´ 1 p T ` log p ε ´ 1 d log p S qqq , pr ovide d N “ r O ˆ D p q data q ε ˙ ď r O ˆ min t B p q data q , C p q data qu ε ˙ . In words, Corollary 2 shows that the sampling complexit y of Algorithm 1 required to obtain an ε -accurate distribution is gov erned by intrinsic complexit y measures of the target distribution. Under the constant step size schedule, the iteration complexit y is controlled b y the dual total correlation of the target distribution, whereas under the exp onen tial-then-constant schedule, the eﬀectiv e total correlation becomes the relev ant quan tity . F or illustration, let us consider the following tw o simple examples. • Consider ﬁrst the uniform distribution on r S s d . In this case, b oth complexit y measures scale indepen- den tly of the ambien t dimension d , which means N “ r O ´ 1 ε ¯ , (21) reﬂecting the fact that it is exceptionally easy to sample from uniform distributions. While in tuitive in hindsight, this phenomenon has not b een previously formalized in the literature. • As a second example, consider a mixture of tw o Dirac measures, 1 2 δ k 1 ` 1 2 δ k 2 . A direct calculation sho ws that the dual total correlation remains indep enden t of d , which means N “ r O ´ 1 ε ¯ , (22) indicating that suc h distributions are also handled automatically by our algorithm. T o further illustrate the implications of Theorem 3 , we consider some representativ e distributions for whic h one or more of the quan tities B p q data q , C p q data q , or D p q data q are small. Since the iteration complexity scales linearly with these quantities, our result shows that discrete diﬀusion models can pro v ably ac hieve eﬃcien t sampling. Appendix A dev elops these examples in detail and pro vides rigorous pro ofs of the stated claims. • Hidden Marko v mo dels. Here, the observed v ariables corresp ond to w ords or tokens in a sentence, while the hidden states enco de laten t seman tic topics. Under the natural assumption that topics ev olve slo wly , we show that B p q data q grows sublinearly with the sequence length. • Lo w-dimensional structures. Motiv ated b y image generation, when the discrete data arise from the quan tization of a contin uous distribution with intrinsic dimension k , the dual total correlation B p q data q scales linearly with k rather than with the am bient dimension d . • Random graph mo dels. Such mo dels deﬁne distributions o ver d “ ` n 2 ˘ binary v ariables corre- sp onding to the edges of a graph with n vertices. Besides Erdős-Rén yi random graphs, which hav e indep enden t edges and are therefore easy to sample, we consider both sparse random regular graphs and sto c hastic blo c k mo dels. In these cases, B p q data q grows at most linearly (up to logarithmic factors) with n , rather than quadratically . • Structure-with-noise distributions. Finally , we present an example in which b oth the total corre- lation C p q data q and the dual total correlation B p q data q are of order d , while the eﬀective total correlation D p q data q remains of constant order. Such distributions are motiv ated b y applications such as error- correcting co des and DNA sequences, where substantial noise may b e present, yet the underlying signal is highly structured. 15 4 Discussion In this w ork, we establish nov el theoretical results for b oth uniform and masking discrete diﬀusions. F or uniform diﬀusion mo dels, we sho w that the τ -leaping algorithm requires r O p d { ε q iterations to ac hieve ε accu- racy in KL divergence, improving on the prior b ound r O ` d 2 S { ε ˘ . W e further establish the ﬁrst algorithmic lo wer b ound for the τ -leaping sampler, which sho ws that our upp er b ound is unimpro v able for a large class of distributions. F or the masking discrete diﬀusion, w e derive an upp er b ound that captures the intrinsic complexit y of the data distribution and can scale logarithmically with the am bient dimension. Imp ortan tly , our results for b oth mo dels only require a small score estimation error and, in contrast to prior work, do not rely on early stopping or the b oundedness assumptions of the score estimator. The impro v ed b ound for the masking noising pro cess is ac hieved via a modiﬁcation of the τ -leaping algorithm. This modiﬁcation falls within a structured sub class of τ -leaping strategies that (i) allow for parallel co ordinate updates, and thus sublinear rates, and (ii) preserv e CTMC dynamics, whic h facilitates theoretical analysis. W e hope that this persp ectiv e motiv ates the dev elopmen t of adaptiv e samplers for uniform discrete diﬀusion as well in the future. Sev eral other op en questions remain. Understanding which noising mechanisms — masking, uniform, or others — are b est suited to diﬀerent classes of target distributions is an imp ortan t direction for future w ork. Moreov er, the problem of learning accurate score functions in discrete diﬀusion mo dels remains largely unexplored and w arrants further inv estigation. A c kno wledgemen ts This work is supp orted in part by the NSF grants CCF-2106778, CCF-2418156 and CAREER aw ard DMS- 2143215. References Austin, J., Johnson, D. D., Ho, J., T arlow, D., and v an den Berg, R. (2021). Structured denoising diﬀusion mo dels in discrete state-spaces. A dvanc es in Neur al Information Pr o c essing Systems , 34:17981–17993. Austin, T. (2020). Multi-v ariate correlation and mixtures of pro duct measures. Kyb ernetika , pages 459–499. Bac h, F. and Saremi, S. (2025). Sampling binary data by denoising through score functions. arXiv pr eprint arXiv:2502.00557 . Ben ton, J., Shi, Y., De Bortoli, V., Deligiannidis, G., and Doucet, A. (2024). F rom denoising diﬀusions to denoising marko v mo dels. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 86(2):286–301. Campb ell, A., Benton, J., De Bortoli, V., Rainforth, T., Deligiannidis, G., and Doucet, A. (2022). A con tinuous time framew ork for discrete denoising mo dels. A dvanc es in Neur al Information Pr o c essing Systems , 35:28266–28279. Chen, H., Lee, H., and Lu, J. (2023a). Impro ved analysis of score-based generativ e modeling: User-friendly b ounds under minimal smo othness assumptions. In International Confer enc e on Machine L e arning , pages 4735–4763. PMLR. Chen, H. and Ying, L. (2025). Con vergence analysis of discrete diﬀusion model: exact implemen tation through uniformization. Journal of Machine L e arning , 4(2):108–127. Chen, S., Chewi, S., Li, J., Li, Y., Salim, A., and Zhang, A. (2023b). Sampling is as easy as learning the score: theory for diﬀusion mo dels with minimal data assumptions. In The Eleventh International Confer enc e on L e arning R epr esentations . Chen, S., Cong, K., and Li, J. (2025). Optimal inference sc hedules for mask ed diﬀusion models. arXiv pr eprint arXiv:2511.04647 . 16 Conforti, G., Durmus, A., and Pham, L.-T.-N. (2025). Non-asymptotic conv ergence of discrete diﬀusion mo dels: Mask ed and random walk dynamics. arXiv pr eprint arXiv:2512.00580 . Co ver, T. M. (1999). Elements of information the ory . John Wiley & Sons. Dhariw al, P . and Nichol, A. (2021). Diﬀusion mo dels b eat GANs on image synthesis. A dvanc es in Neur al Information Pr o c essing Systems , 34:8780–8794. F einberg, E. A., Manda v a, M., and Shiryaev, A. N. (2014). On solutions of k olmogorov’s equations for nonhomogeneous jump marko v pro cesses. Journal of Mathematic al A nalysis and Applic ations , 411(1):261– 270. F eller, W. (1940). On the in tegro-diﬀerential equations of purely discon tinuous mark oﬀ processes. T r ansac- tions of the Americ an Mathematic al So ciety , 48(3):488–515. Gales, M. and Y oung, S. (2024). The application of hidden marko v mo dels in sp eec h recognition. F oundations and T r ends ® in Signal Pr o c essing , 1(3):195–304. Gillespie, D. T. (1976). A general metho d for numerically simulating the sto chastic time evolution of coupled c hemical reactions. Journal of c omputational physics , 22(4):403–434. Gillespie, D. T. (2001). Appro ximate accelerated sto c hastic sim ulation of chemically reacting systems. The Journal of chemic al physics , 115(4):1716–1733. Gorban, A. N. and T yukin, I. Y. (2018). Blessing of dimensionalit y: mathematical foundations of the statistical ph ysics of data. Philosophic al T r ansactions of the R oyal So ciety A: Mathematic al, Physic al and Engine ering Scienc es , 376(2118):20170237. Ho, J., Jain, A., and Abbeel, P . (2020). Denoising diﬀusion probabilistic models. A dvanc es in Neur al Information Pr o c essing Systems , 33:6840–6851. Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., and Fleet, D. J. (2022). Video diﬀusion models. A dvanc es in Neur al Information Pr o c essing Systems , 35:8633–8646. Huang, Z., W ei, Y., and Chen, Y. (2024). Denoising diﬀusion probabilistic mo dels are optimally adaptiv e to unkno wn lo w dimensionalit y . arXiv pr eprint arXiv:2410.18784 . Ingraham, J., Garg, V., Barzila y , R., and Jaakkola, T. (2019). Generative models for graph-based protein design. A dvanc es in Neur al Information Pr o c essing Systems , 32. Li, G. and Cai, C. (2025). Breaking AR’s sampling b ottleneck: Pro v able acceleration via diﬀusion language mo dels. A dvanc es in Neur al Information Pr o c essing Systems , 38. Li, G., Cai, C., and W ei, Y. (2025). Dimension-free conv ergence of diﬀusion mo dels for approximate gaussian mixtures. arXiv pr eprint arXiv:2504.05300 . Li, G., W ei, Y., Chen, Y., and Chi, Y. (2023). T ow ards faster non-asymptotic con vergence for diﬀusion-based generativ e models. arXiv pr eprint arXiv:2306.09251 . Li, G. and Y an, Y. (2024). Adapting to unkno wn low-dimensional structures in score-based diﬀusion mo dels. A dvanc es in Neur al Information Pr o c essing Systems , 37:126297–126331. Li, L., Carver, R., Lop ez-Gomez, I., Sha, F., and Anderson, J. (2024). Generative emulation of weather forecast ensembles with diﬀusion models. Scienc e A dvanc es , 10(13). Liang, J., Huang, Z., and Chen, Y. (2025a). Lo w-dimensional adaptation of diﬀusion mo dels: Con vergence in total v ariation. arXiv pr eprint arXiv:2501.12982 . Liang, Y., Huang, R., Lai, L., Shroﬀ, N., and Liang, Y. (2025b). Absorb and con verge: Prov able conv ergence guaran tee for absorbing discrete diﬀusion mo dels. A dvanc es in Neur al Information Pr o c essing Systems , 39. 17 Liang, Y., Liang, Y., Lai, L., and Shroﬀ, N. (2025c). Discrete diﬀusion mo dels: Nov el analysis and new sampler guarantees. A dvanc es in Neur al Information Pr o c essing Systems , 39. Lieb enau, A. and W ormald, N. (2024). Asymptotic en umeration of graphs by degree sequence, and the degree sequence of a random graph. Journal of the Eur op e an Mathematic al So ciety , 26:1–40. Lou, A., Meng, C., and Ermon, S. (2024). Discrete diﬀusion mo deling b y estimating the ratios of the data distribution. In International Confer enc e on Machine L e arning , pages 4735–4763. PMLR. Makur, A. and Poly anskiy , Y. (2018). Comparison of channels: Criteria for domination by a symmetric c hannel. IEEE T r ansactions on Information The ory , 64(8):5704–5725. Meng, C., Choi, K., Song, J., and Ermon, S. (2022). Concrete score matc hing: Generalized score matc hing for discrete data. A dvanc es in Neur al Information Pr o c essing Systems , 35:34532–34545. Mor, B., Garh wal, S., and Kumar, A. (2021). A systematic review of hidden mark ov mo dels and their applications. Ar chives of c omputational metho ds in engine ering , 28(3). Ou, J., Nie, S., Xue, K., Zhu, F., Sun, J., Li, Z., and Li, C. (2025). Y our absorbing discrete diﬀusion secretly mo dels the conditional distributions of clean data. In The Thirte enth International Confer enc e on L e arning R epr esentations . P ark, Y.-H., Lai, C.-H., Ha yak aw a, S., T akida, Y., and Mitsufuji, Y. (2025). Jump your steps: Optimizing sampling schedule of discrete diﬀusion mo dels. In The Thirte enth International Confer enc e on L e arning R epr esentations . Pham, L.-T.-N., Shariatian, D., Ocello, A., Conforti, G., and Durm us, A. O. (2025). Discrete mark o v probabilistic mo dels: An improv ed discrete score-based framew ork with sharp conv ergence b ounds under minimal assumptions. In International Confer enc e on Machine L e arning . P op e, P ., Zh u, C., Ab delk ader, A., Goldblum, M., and Goldstein, T. (2021). The in trinsic dimension of images and its impact on learning. In International Confer enc e on L e arning R epr esentations . Ren, Y., Chen, H., Rotsk oﬀ, G. M., and Ying, L. (2025). How discrete and contin uous diﬀusion meet: Comprehensiv e analysis of discrete diﬀusion models via a sto c hastic integral framework. In International Confer enc e on L e arning R epr esentations . Saho o, S., Arriola, M., Schiﬀ, Y., Gok aslan, A., Marro quin, E., Chiu, J., Rush, A., and Kulesho v, V. (2024). Simple and eﬀectiv e masked diﬀusion language mo dels. A dvanc es in Neur al Information Pr o c essing Systems , 37:130136–130184. Sohl-Dic kstein, J., W eiss, E., Mahesw aranathan, N., and Ganguli, S. (2015). Deep unsup ervised learning using nonequilibrium thermodynamics. In International Confer enc e on Machine L e arning , pages 2256– 2265. Song, Y. and Ermon, S. (2019). Generativ e modeling b y estimating gradien ts of the data distribution. A dvanc es in Neur al Information Pr o c essing Systems , 32. V an Dijk, N. M. (1992). Uniformization for nonhomogeneous marko v chains. Op er ations r ese ar ch letters , 12(5):283–291. v on Rütte, D., Fluri, J., P o oladzandi, O., Sc hölkopf, B., Hofmann, T., and Orvieto, A. (2025). Scaling b eha vior of discrete diﬀusion language models. arXiv pr eprint arXiv:2512.10858 . W atson, J. L., Juergens , D., Bennett, N. R., T ripp e, B. L., Yim, J., Eisenach, H. E., Ahern, W., Borst, A. J., Ragotte, R. J., Milles, L. F., et al. (2023). De nov o design of protein structure and function with rfdiﬀusion. Natur e , 620(7976):1089–1100. Xu, M., Y u, L., Song, Y., Shi, C., Ermon, S., and T ang, J. (2022). Geo diﬀ: A geometric diﬀusion mo del for molecular conformation generation. In International Confer enc e on L e arning R epr esentations . 18 Zeni, C., Pinsler, R., Zügner, D., F owler, A., Horton, M., F u, X., W ang, Z., Shyshey a, A., Crabbé, J., Ueda, S., et al. (2025). A generativ e model for inorganic materials design. Natur e , 639(8055):624–632. Zhang, Z., Chen, Z., and Gu, Q. (2025). Conv ergence of score-based discrete diﬀusion mo dels: A discrete-time analysis. In International Confer enc e on L e arning R epr esentations . A Examples of lo w in trinsic dimensions A.1 Details and formal results In this section, we revisit the examples outlined in Section 3.2.2 and develop them in full detail. W e formalize the statements in this section, and provide rigorous pro ofs in Appendix A.2 . Hidden Marko v mo dels. A hidden Marko v mo del (HMM) consists of a latent Mark ov c hain whose states are observed only indirectly through noisy measurements. Suc h models are widely used in natural language pro cessing and pattern recognition ( Gales and Y oung , 2024 ; Mor et al. , 2021 ). In language mo deling, for instance, the hidden states z i ma y enco de the semantic topic or grammatical structure of the i -th token or w ord, while the observed v ariables x i represen t the realized words or tokens. F ormally , let t z i u i Pr d s b e a discrete-state Marko v c hain supp orted on Z , and let t x i u i Pr d s b e observ ations generated according to x i “ f i p z i , ε i q , where t ε i u i Pr d s are i.i.d. noise v ariables indep enden t of z i i Pr d s . When z i represen ts the semantic topic of the i -th paragraph in a do cument, it is natural to assume that topic transitions happ en only infrequen tly; that is, z i “ z i ´ 1 with high probability for i ą 1 . Under this model, w e establish the follo wing prop osition, whose pro of is deferred to Section A.2.1 . Prop osition 1. Consider the HMM describ e d ab ove. Supp ose the tr ansition pr ob ability of t z i u i Pr d s satisﬁes Pr p z i ‰ z i ´ 1 q ď p for al l i P t 2 , . . . , d u . Assume that 1 { d À p ! 1 . Then B p q data q ď pd log ˆ | Z | p ˙ . (23) T o develop some in tuition, consider generating a do cumen t with a constant n umber of paragraphs, where the transition probabilit y scales p “ Θ p 1 { d q . Supp ose further that the latent space Z P r S s k for some k ! d and S denotes the vocabulary size. Then, the ab ov e bound yields B p q data q À k log p S d q , whic h is substantially smaller than the ambien t dimension d log p S q . As such, with Theorem 3 , the sampling complexit y scales with the in trinsic topic dimension k rather than the do cumen t length d . Lo w-dimensional Structures. In image generation and other structured data settings, it is commonly assumed that the data lie on or near a low-dimensional manifold em b edded in a high-dimensional ambien t space, which often refers to as the manifold h yp othesis ( Gorban and T yukin , 2018 ; P op e et al. , 2021 ). F or example, natural images ma y b e view ed as p oints on a manifold parameterized by a small n umber of underlying factors, suc h as ligh ting conditions, p ose, and ob ject identit y . In discrete settings, the notion of a manifold is not mathematically w ell deﬁned. T o capture lo w- dimensional structure, w e instead mo del the data as arising from a con tinuous mapping from a laten t represen tation into a high-dimensional observ ation space. F or some laten t contin uous random v ariable z supp orted on Z Ă R k , consider a deco ding pro cedure f : r 0 , 1 s k Ñ R d as x con “ f p z q ` ε noise , for additiv e perturbations ε noise . Th us, data lies close to a manifold t f p z q : z P Z u . The ﬁnal discrete observ ation is obtained via a quantization op erator Q S , i.e., x “ Q S p x con q „ q data . 19 T o align the mo del with standard image pro cessing pip elines, w e w ork w ith the uniform lattice quan- tization function Q S : R d Ñ r S s d deﬁned co ordinate-wise as r Q S p x qs i “ clip p t x i u , 0 , S q for i P r d s , where clip p x, a, b q : “ min t max t x, a u , b u is the clip function and t ¨ u is the ﬂo or function. T o ensure regularity of b oth the manifold and the induced data distribution, we fo cus on the case where Z is a compact set and f is a Lipschitz function. The noise ε noise is tak en to b e Gaussian for simplicity of analysis; the arguments extend readily to more general smooth noise distributions. Prop osition 2. L et Z Ă R k b e c omp act with diameter D , and let f : Z Ñ R d b e L -Lipschitz. Assume the noise satisﬁes ε noise „ N p 0 , σ 2 I d q indep endenly gener ate d for e ach ob ersvation. Then the r esulting distribution satisﬁes B p q data q ď k log ˆ 2 ` 2DL σ ˙ . (24) In image generation, the “ideal image” x con ma y b e interpreted as the v ector of contin uous pixel intensities prior to quantization, while the observed image x is obtained by applying pixel-wise quantization to x con . When k ! d , the abov e b ound yields B p q data q “ r O p k q “ o p d q , and hence w e can eﬃcien tly sample such images despite the high dimensionality of the observ ation space. Random graph mo dels. Discrete diﬀusion mo dels hav e also found applications in scien tiﬁc domains such as molecular generation and protein design, where data are naturally represented as random graphs with ﬁxed v ertex sets and random edges ( Ingraham et al. , 2019 ; Xu et al. , 2022 ). T o mak e this concrete, we consider t wo widely studied random graph models on n v ertices, whic h can b e viewed as a discrete distribution ov er adjacency matrices of dimension n 2 . • Regular graphs : A k -regular graph is a graph in whic h eac h vertex has degree exactly k . Supp ose w e wan t to sample a random graph G from some distribution supp orted on the set of k -regular graphs with n v ertices. Prop osition 3. F or sp arse r e gular gr aph mo del, i.e., k ď n { log p n q , we have B p G q À k n log ´ n k ¯ “ o p n 2 q . (25) • Sto c hastic blo c k mo dels : A stochastic blo c k mo del (SBM) is a generativ e model for random graphs that captures communit y structure within netw orks. In an SBM, n v ertices of the graph are partitioned in to r distinct communities or blo c ks, represen ted by laten t v ariables t z i u i Pr n s taking v alues in r r s . Conditioned on the laten t labels, edges are generated indep enden tly . F or t wo vertices i, j P r n s , an edge is created with probability p I t z i “ z j u ` q I t z i ‰ z j u , where p, q P r 0 , 1 s go vern the within- and b et w een-communit y connection probabilities, resp ectiv ely . Prop osition 4. L et G b e a r andom gr aph dr awn fr om the ab ove r -blo ck SBM. Then B p G q ď n log p r q “ o p n 2 q . F or both random graph mo dels, as the n umber of v ertices n grows large, the complexit y satisﬁes B p G q “ o p n 2 q , whic h is strictly smaller compared to the ambien t dimension n 2 . This indicates that diﬀusion-based methods can sample eﬃcien tly from suc h graph distributions. In fact, the analyses of Prop ositions 2 and 4 extend naturally to generalized random geometric graphs. Consider the following example. Let each v ertex i P r n s b e asso ciated with latent v ariable z i P Z . F or distinct vertices i and j , an edge is placed indep enden tly with probabilit y β exp ˆ ´ d p z i , z j q r 0 ˙ , where β P r 0 , 1 s , r 0 ą 0 and d p¨ , ¨q is an appropriate metric in the laten t space Z . 20 • When laten t v ariables t z i u are discrete with o p n q entrop y , as is the case, for example, when it tak es v alue in a ﬁxed-dimensional laten t space, the dual total correlation of the resulting random graph is o p n 2 q . • F or contin uous latent v ariables, supp ose Z “ S d z ´ 1 , the unit sphere in R d z . Under some regularity conditions, the dual total correlation scales with d z ¨ n , follo w ed b y an analogous cov ering n umber argumen t in Proposition 2 . In particular, whenev er d z “ o p n q , the complexity is again sub quadratic, leading to sublinear (in n 2 ) conv ergence rates for diﬀusion-based sampling. Structure-with-noise distributions A prototypical example of a distribution with small dual total cor- relation B p q 0 q and large total correlation C p q 0 q is the mixture of t wo Dirac measures: p m : “ 1 2 δ 0 ` 1 2 δ 1 , where 0 and 1 are v ectors of all-zeros and all-ones, resp ectiv ely . It can b e easily computed that B p p m q “ log p 2 q , whereas, C p p m q “ p d ´ 1 q log p 2 q . The opp osite happ ens, for instance, for the follo wing XOR distribution p XOR : x 1 , . . . , x d ´ 1 i . i . d . „ Bern p 1 { 2 q and x d “ d ´ 1 ÿ i “ 1 x i mo d 2 . In this case, B p p XOR q “ p d ´ 1 q log p 2 q , and C p p XOR q “ log p 2 q . Real-w orld data distributions can combine features of b oth extremes: a strong low-dimensional signal corrupted b y weakly correlated noise. In such cases, both B p q data q and C p q data q can b e large, while D p q data q remains small. T o illustrate this phenomenon, consider the follo wing entrywise mixture of the tw o preceding examples. 1. Fix a bi-partition r d s “ I 0 \ I 1 for non-empty index sets I 0 and I 1 ; 2. F or all indices i P I 0 , set x i “ b for b „ Bern p 1 { 2 q ; 3. Among all indices i P I 1 , sample all but one x i „ Bern p 1 { 2 q indep enden tly; 4. F or the last index i ˚ , set x i ˚ “ ` b ` ř i ‰ i ˚ x i ˘ mo d 2 . Denote this distribution as p ex , and let x “ p x 1 , . . . , x d q „ p ex . Prop osition 5. Supp ose that min t| I 0 | , | I 1 |u{ d “ Θ p 1 q . Distribution p ex satisﬁes B p p ex q “ Θ p d q , C p p ex q “ Θ p d q and D p p ex q “ O p 1 q . (26) By Prop osition 5 , p ex , which can b e viewed as a non-trivial mixing of p m and p XOR , satisﬁes D p p ex q ! min t B p p ex q , C p p ex qu . This example highligh ts the fundamen tal role of the eﬀectiv e total correlation in c haracterizing sampling eﬃciency . A.2 Pro ofs of results in Section A.1 V ariants of the follo wing lemma will be used repeatedly throughout this section. W e state it here for con venience and to streamline the pro ofs. Lemma 1. Consider any d -dimensional discr ete r andom variable X and any r andom variable W such that X i K K X ´ i | W for any i P r d s , wher e X “ p X 1 , . . . , X d q and X ´ i is the p d ´ 1 q -dimensional mar ginal of X with i -th c o or dinate exclude d. Then, B p X q ď I p X ; W q . If W is discr ete, we additional ly have B p X q ď H p W q . 21 Pr o of of L emma 1 . W e ﬁrst notice that for any random v ariable W such that X i K K X ´ i | W for any i P r d s , w e ha ve H p X i | X ´ i q ě H p X i | X ´ i , W q “ H p X i | W q , where the ﬁrst inequality is from the deﬁnition of the entrop y . Recalling the deﬁnition of B p¨q , we obtain B p X q “ H p X q ´ d ÿ i “ 1 H p X i | X ´ i q ď H p X q ´ d ÿ i “ 1 H p X i | W q . Using the conditional indep endence condition again, w e ha ve H p X | W q “ H ` p X 1 , . . . , X d q| W ˘ “ d ÿ i “ 1 H p X i | W q , whic h implies B p X q ď H p X q ´ H p X | W q “ I p X ; W q p a q “ H p W q ´ H p W | X q p b q ď H p W q , where (a) and (b) apply when W is a discrete random v ariable. A.2.1 Proof of Prop osition 1 F or the hidden Marko v structure of tp x i , z i qu i Pr d s , it satisﬁes x i K K x j | p z i , z j q , since ε i K K ε j | p z i , z j q . Considering Lemma 1 abov e, w e can upp er bound B p q data q b y H p z q , which is the entrop y of the laten t Mark ov chain. By the additivity of the en tropy , we ha ve B p q data q ď H p z q “ H p z 1 q ` d ÿ i “ 2 H ` z i |t z j u j Pr i ´ 1 s ˘ “ H p z 1 q ` d ÿ i “ 2 H p z i | z i ´ 1 q . When t z i u i Pr d s is supported on a single p oin t, w e hav e | Z | “ 1 and H p z q “ 0 . When the state space Z satisﬁes 2 ď | Z | ă 8 , the maxim um en tropy distribution is ac hieved when z 1 „ Unif p Z q and z i | z i ´ 1 „ p 1 ´ p q δ z i ´ 1 ` p Unif ` Z z t z i ´ 1 u ˘ . W e obtain H p z q ď log p| Z |q ` d ÿ i “ 2 „ ´p 1 ´ p q log p 1 ´ p q ` ´p| Z | ´ 1 q ¨ p | Z | ´ 1 log ˆ p | Z | ´ 1 ˙ȷ p a q ď log p| Z |q ` p d ´ 1 q ¨ ˆ 2 p ` p log ˆ | Z | p ˙˙ p b q ď pd log ˆ | Z | p ˙ , where in (a), we use ´ log p 1 ´ p q ď 2 p , since p ! 1 ; in (b), we use the condition p Á 1 { d and | Z |{ p ě 2 { p " 1 . This completes the pro of of the desired result. A.2.2 Proof of Prop osition 2 W rite ε noise “ p ε 1 noise , . . . , ε d noise q . Since ε noise „ N p 0 , σ 2 I d q , w e hav e ε i noise K K ε ´ i noise for any i P r d s . Pro cessing through the decoder f , r x con s i “ r f p z qs i ` ε i noise for any i P r d s , whic h leads to r x con s i K K r x con s p´ i q | z , (27) where r x con s p´ i q is the p d ´ 1 q -dimensional marginal of x con with i -th coordinate excluded. Note that Q S is a entry-wise quantization, i.e., w e can write Q S p x q “ p r Q S p x 1 q , . . . , r Q S p x d qq for entry-wise deterministic 22 quan tization function r Q S : R Ñ r S s , and x i “ r Q S pr x con s i q by the generation process. Eqn. ( 27 ) therefore implies that for any i P r d s , x i K K x ´ i | z . Applying Lemma 1 , we obtain B p q data q “ B p x q ď I p x ; z q ď I p x con ; z q , (28) where the last inequality follo ws from the data processing inequalit y of the mutual information. In the following pro of, we pro ceed to control I p x con ; z q . Since ε noise is independent noise, using data- pro cessing inequalit y , we reac h I p x con ; z q ď I ` f p z q ` ε noise ; f p z q ˘ “ I ` f p z q ; f p z q ` ε noise ˘ . (29) Without loss of generalit y , w e assume Z Ď r 0 , D s k . Partition r 0 , D s k in to hypercub es of size h J “ σ { L , and write this partition as t C 1 , . . . , C r D { h J s k u such that r 0 , D s k Ď r D { h J s k ğ i “ 1 C i . Deﬁne J “ J p z q to b e the hypercub e index i p z q such that z P C i p z q , and F J to b e σ -algebra generated by J p z q . By the chain rule and data pro cessing inequalit y for mutual information, we ha ve I ` f p z q ; f p z q ` ε noise ˘ ď I ` J p z q , f p z q ; f p z q ` ε noise ˘ “ I ` J p z q ; f p z q ` ε noise ˘ ` I ` f p z q ; f p z q ` ε noise | J ˘ ď k log ˆ 1 ` D h J ˙ ` I ` f p z q ; f p z q ` ε noise | J ˘ , (30) where in the last line, w e use I p J p z q ; f p z q ` ε noise q ď H p J p z qq ď log p| supp p J p z qq|q . T o upp er bound the second term ab o ve, w e in tro duce the following lemma on Gaussian channel, whose pro of is giv en in Section F.1 . Lemma 2. F or any r andom variable W P R d and indep endent noise ε noise „ N p 0 , σ 2 I d q , we have I p W ; W ` ε noise q ď T r ` V ar r W s ˘ 2 σ 2 , wher e T r p¨q is the tr ac e function. In Lemma 2 , taking W d “ f p z q | F J , we arriv e at I ` f p z q ; f p z q ` ε noise | J ˘ ď T r ´ V ar “ f p z q | F J ‰ ¯ 2 σ 2 . (31) T o further control the right hand side, direct calculations show T r ´ V ar “ f p z q | F J ‰ ¯ “ d ÿ i “ 1 V ar “ r f p z qs i ˇ ˇ F J ‰ “ E „ › › › f p z q ´ E “ f p z q | F J ‰ › › › 2 2 ˇ ˇ ˇ F J ȷ . (32) It is therefore suﬃcient to consider the quan tity } f p z q ´ E r f p z q | F J s} 2 2 . W e make the observ ation that › › › f p z q ´ E “ f p z q | F J ‰ › › › 2 p a q ď sup w P Conv p f p C J p z q qq › › f p z q ´ w › › 2 p b q “ sup w P f p C J p z q q › › f p z q ´ w › › 2 23 p c q ď } f } Lip ¨ sup z 1 P C J p z q › › z ´ z 1 › › 2 p d q ď L ? k h J , where } ¨ } 2 denotes Euclidean norm in R d , and Conv p¨q denotes the conv ex null of a given set. In (a), w e use the fact that E r f p z q | F J s P Conv p f p C J p z q qq ; in (b), w e adopt f is contin uous and hence f p C J p z q qq is b ounded, and the prop erty of the conv ex hull that diam ` Con v p A q ˘ “ diam p A q for an y bounded subset A Ď R d ; in (c), we recall the Lipschitz condition on f ; in (d), we notice that diam p C i q ď ? k h J for any hypercub e C i . Putting pieces together gives T r ´ V ar “ f p z q | F J ‰ ¯ ď ` L ? k h J ˘ 2 “ k σ 2 . (33) Finally , plugging Eqns. ( 31 ) and ( 33 ) into Eqn. ( 30 ), w e obtain I ` f p z q ; f p z q ` ε noise ˘ ď k log ˆ 1 ` DL σ ˙ ` k 2 ď k log ˆ 2 ` 2DL σ ˙ . Com bining the ab o ve inequalit y with Eqns. ( 28 ) and ( 29 ), we conclude B p q data q ď I p x con ; z q “ I ` f p z q ; f p z q ` ε noise ˘ ď k log ˆ 2 ` 2DL σ ˙ . A.2.3 Proof of Prop osition 3 Deﬁne the set of all k -regular graphs with n vertices as G n,k . Without loss of generalit y , w e assume that nk is ev en, as otherwise G n,k is empt y . By a corollary of Liebenau and W ormald ( 2024 , Theorem 1.4), w e hav e the following asymptotic result: | G n,k | “ Θ ˜ ˆ n ´ 1 k ˙ n ˆ n p n ´ 1 q 2 m ˙ˆ n p n ´ 1 q 2 m ˙ ´ 1 ¸ . where m “ k n { 2 . By Stirling’s form ula of the form log p a ! q “ a log p a q ´ a ` O p log p a qq , w e can compute that log p| G n,k |q À n log ˆ n ´ 1 k ˙ ` log ˆ n p n ´ 1 q 2 m ˙ ´ log ˆ n p n ´ 1 q 2 m ˙ “ k n 2 log ˆ n ´ 1 ´ k k ˙ ` n p n ´ 1 q 2 log ˆ n ´ 1 n ´ 1 ´ k ˙ ď k n 2 log ´ n k ¯ ` n 2 2 log ˆ 1 ` k n ´ 1 ´ k ˙ ď k n 2 log ´ n k ¯ ` k n 2 2 p n ´ 1 ´ k q À k n log ´ n k ¯ , where in the last line, we in vok e the condition that k ď n { log p n q ! n ´ 1 ´ k . Recalling the deﬁnition of B p¨q , w e can conclude B p G q ď H p G q ď log p| G n,k |q À k n log ´ n k ¯ “ o p n 2 q . 24 A.2.4 Proof of Prop osition 4 By deﬁnition of r -block SBM, the laten t v ariable vector p z 1 , . . . , z n q is supported on r r s n , which satisﬁes H ` p z 1 , . . . , z n q ˘ ď log ` |r r s n | ˘ “ n log p r q . Giv en the laten t v ariable p z 1 , . . . , z n q , the blo c k structure is ﬁxed and hence each edge is sampled indep en- den tly from a Bernoulli distribution. Therefore, w e ha ve e ij K K e kℓ | t z i u i Pr n s for any i, j, k , l P r n s , where e ij and e kℓ are the indicator v ariables of the existence of edges b et ween vertices i, j and betw een vertices k , ℓ . By Lemma 1 , w e c onclude B p G q ď H pp z 1 , . . . , z n qq ď n log p r q ď n log p n q “ o p n 2 q , where we use the con ven tion that the n umber of blo c ks satisﬁes r ď n . R emark 2 . The setting of Prop osition 4 can b e viewed as a sp ecial case of the generalized random geometric graph model, in which the latent v ariable corresponds to the blo c k index. More generally , the same conclusion holds under analogous assumptions, with essen tially the same pro of strategy . A.2.5 Proof of Prop osition 5 Let r : “ | I 0 |{ d b e the prop ortion of co ordinates in I 0 . Throughout, w e assume min t r, 1 ´ r u “ Θ p 1 q . Step 1: Establish B p p ex q “ Θ p d q and C p p ex q “ Θ p d q . F or a random v ariable x „ p ex , we shall demon- strate that d ÿ i “ 1 H p x i q “ d log p 2 q , log p 2 qp| I 1 | ´ 1 q ď H p x q ď log p 2 q| I 1 | and d ÿ i “ 1 H p x i | x ´ i q “ 0 . (34) T ow ards this goal, we mak e the observ ation that for an y i P I 0 or x P I 1 z i ˚ , x i „ Bern p 1 { 2 q and hence H p x i q “ log p 2 q . F or i “ i ˚ , we assert that x i ˚ „ Bern p 1 { 2 q . In fact, we ha ve P ¨ ˝ ÿ i P I 1 z i ˚ x i ” 0 mod 2 ˛ ‚ “ P ˆ Bin ˆ | I 1 | ´ 1 , 1 2 ˙ ” 0 mod 2 ˙ “ 1 2 , where in the last equality , we inv oke the following lemma. Lemma 3. F or any n P N ` and X „ Bin p n, 1 { 2 q , we have P p X ” 0 mod 2 q “ P p X ” 1 mo d 2 q “ 1 2 . As result, the distribution of x i ˚ satisﬁes P p x i ˚ “ 0 q “ P p b “ 0 q ¨ P ¨ ˝ ÿ i P I 1 z i ˚ x i ” 0 mod 2 ˛ ‚ ` P p b “ 1 q ¨ P ¨ ˝ ÿ i P I 1 z i ˚ x i ” 1 mod 2 ˛ ‚ “ 1 2 , whic h rev eals that x i ˚ „ Bern p 1 { 2 q and hence H p x i ˚ q “ log p 2 q . In conclusion, we obtain d ÿ i “ 1 H p x i q “ ÿ i Pr d sz i ˚ H p x i q ` H p i ˚ q “ d log p 2 q . (35) 25 T o upper b ound H p x q , inv oke the simple prop erty for entrop y function to get H p x q ď log p| supp p x q|q ď log ´ 2 ¨ 2 | I 1 |´ 1 ¯ “ log p 2 q| I 1 | . (36) The low er b ound can be obtained through H p x q ě H pt x i u i P I 1 z i ˚ q “ log ´ 2 | I 1 |´ 1 ¯ “ log p 2 qp| I 1 | ´ 1 q . (37) F or any i P r d s , when x ´ i is giv en, we can recov er x i b y ﬁrst observing the v alue of b from x j for an y j P I 0 , then applying the formula x i “ b ` ÿ k P I 1 z i x k I t i P I 1 u mod 2 . Th us, x i | x ´ i is alwa ys a Dirac measure, which leads to d ÿ i “ 1 H p x i | x ´ i q “ 0 . (38) Com bining Eqns. ( 35 ), ( 36 ), ( 37 ) and ( 38 ) prov es Eqn. ( 34 ). Equipp ed with Eqn. ( 34 ), w e are ready to b ound B p p ex q and C p p ex q . It can easily seen that B p p ex q “ H p x q ´ d ÿ i “ 1 H p x i | x ´ i q ě log p 2 q ` p 1 ´ r q d ´ 1 ˘ “ Ω p d q , C p p ex q “ d ÿ i “ 1 H p x i q ´ H p x q ě log p 2 qp d ´ | I 1 |q “ log p 2 q rd “ Ω p d q . F or the reverse direction, we can prov e the matching lo wer b ound similarly , whic h leads to B p p ex q “ Θ p d q , C p p ex q “ Θ p d q . Step 2: Sho w D p p ex q “ O p 1 q . Recall the deﬁnition of D p¨q in Eqn. ( 16 ): D p p ex q : “ ż 8 0 min p 1 , t q I p t q d t with I p t q : “ ÿ i ‰ j Pr d s I p x i t ; x j t | x ´p i,j q t q ě 0 . T o upper b ound D p p ex q , let us write D p p ex q “ ż 1 d 0 t I p t q d t ` ż log p d q 1 { d min t 1 , t u I p t q d t ` ż 8 log p d q I p t q d t. By direct calculations, one has ż 1 d 0 t I p t q d t ď 1 d ż 1 d 0 I p t q d t ď B p p ex q d “ Θ p 1 q , ż 8 log p d q I p t q d t ď 1 d ´ 1 ż 8 log p d q p e t ´ 1 q I p t q d t ď C p p ex q d ´ 1 “ Θ p 1 q . Therefore, it obeys D p p ex q “ ż log p d q 1 { d min t 1 , t u I p t q d t ` O p 1 q . 26 T o pro ve D p p ex q “ O p 1 q , it suﬃces to sho w that ż log p d q 1 { d min t 1 , t u I p t q d t “ O p 1 q . (39) In view of the deﬁnition of I p t q , w e can decomp ose it as I p t q “ ˜ ÿ i,j P I 0 ,i ‰ j ` ÿ i,j P I 1 ,i ‰ j ` ÿ i P I 0 ,j P I 1 ` ÿ i P I 1 ,j P I 0 ¸ I p x i t ; x j t | x ´p i,j q t q : “ I 1 p t q ` I 2 p t q ` I 3 p t q ` I 4 p t q , and we shall b ound these four terms separately . Before diving into the pro ofs, we make the observ ation that the m utual information can b e computed via I p x i t ; x j t | x ´p i,j q t q “ H p x i t | x ´p i,j q t q ´ H p x i t | x ´ i t q . (40) T o further compute each entrop y terms, let us introduce tw o quantities b elo w H 1 t “ H p e ´ t δ 0 ` p 1 ´ e ´ t q δ MASK q “ H p e ´ t δ 1 ` p 1 ´ e ´ t q δ MASK q “ te ´ t ´ log p 1 ´ e ´ t qp 1 ´ e ´ t q , (41a) H 2 t “ H ˆ 1 2 e ´ t δ 0 ` 1 2 e ´ t δ 1 ` p 1 ´ e ´ t q δ MASK ˙ “ p t ` log p 2 qq e ´ t ´ log p 1 ´ e ´ t qp 1 ´ e ´ t q . (41b) W e shall relate our quan tities of interest to these terms b elo w. Case 1: i, j P I 0 , i ‰ j . F or any giv en x ´p i,j q t , it alw ays holds true that P p x i t “ MASK q “ 1 ´ e ´ t , since the noising pro cess is time-homogeneous and indep enden t betw een co ordinates. Recall the deﬁnition m p x q “ t i P r d s : x i “ MASK u . Deﬁne the even t E i,j t, 1 P F ´p i,j q t , where F ´p i,j q t is the σ -algebra generated b y x ´p i,j q t , as follo ws: E i,j t, 1 : “ # x ´p i,j q t : ˜ ł k P I 0 zt i,j u t k R m p x t qu ¸ ł ˜ ľ ℓ P I 1 t ℓ P m p x t qu ¸ “ 1 + , where ^ is the logical op erator AND, and _ is the logical operator OR. By construction of p ex , it can b e c heck ed that ´ x i t ˇ ˇ x ´p i,j q t P E i,j t, 1 ¯ „ e ´ t δ 0 { 1 ` p 1 ´ e ´ t q δ MASK ; ´ x i t ˇ ˇ x ´p i,j q t P p E i,j t, 1 q c ¯ „ 1 2 e ´ t δ 0 ` 1 2 e ´ t δ 1 ` p 1 ´ e ´ t q δ MASK , where δ 0 { 1 represen ts either δ 0 or δ 1 . Therefore, b y the deﬁnition of the conditional entrop y , we ha ve H p x i t | x ´p i,j q t q “ H 1 t ¨ P p E i,j t, 1 q ` H 2 t ¨ ` 1 ´ P p E i,j t, 1 q ˘ . (42) Deﬁne the ev ent E i t, 1 P F ´ i t , where F ´ i t is the σ -algebra generated by x ´ i t , as follo ws: E i t, 1 : “ # x ´ i t : ˜ ł k P I 0 zt i u t k R m p x t qu ¸ ł ˜ ľ ℓ P I 1 t ℓ P m p x t qu ¸ “ 1 + . Then, it can b e c heck ed similarly that ` x i t ˇ ˇ x ´ i t P E i t, 1 ˘ „ e ´ t δ 0 { 1 ` p 1 ´ e ´ t q δ MASK ; 27 ` x i t ˇ ˇ x ´ i t P p E i t, 1 q c ˘ „ 1 2 e ´ t δ 0 ` 1 2 e ´ t δ 1 ` p 1 ´ e ´ t q δ MASK , whic h leads to H p x i t | x ´ i t q “ H 1 t ¨ P p E i t, 1 q ` H 2 t ¨ ` 1 ´ P p E i t, 1 q ˘ . (43) Plugging Eqns. ( 42 ) and ( 43 ) in to Eqn. ( 40 ) gives that for an y i, j P I 0 , i ‰ j , I p x i t ; x j t | x ´p i,j q t q “ H p x i t | x ´p i,j q t q ´ H p x i t | x ´ i t q “ H 1 t ¨ P p E i,j t, 1 q ` H 2 t ¨ ` 1 ´ P p E i,j t, 1 q ˘ ´ H 1 t ¨ P p E i t, 1 q ´ H 2 t ¨ ` 1 ´ P p E i t, 1 q ˘ “ p H 2 t ´ H 1 t q ` P p E i t, 1 q ´ P p E i,j t, 1 q ˘ “ log p 2 q e ´ 2 t ` 1 ´ e ´ t ˘ | I 0 |´ 2 ´ 1 ´ e ´| I 1 | t ¯ “ O ´ e ´ 2 t p 1 ´ e ´ t q rd { 2 ¯ , whose v alue is indep enden t of the indices i and j . Since |t i, j P I 0 : i ‰ j u| “ rd p r d ´ 1 q “ Θ p d 2 q , quan tity I 1 p t q satisﬁes I 1 p t q “ ÿ i,j P I 0 ,i ‰ j I p x i t ; x j t | x ´p i,j q t q “ O ´ d 2 e ´ 2 t p 1 ´ e ´ t q rd { 2 ¯ . (44) Case 2: i, j P I 1 , i ‰ j . F ollo wing the pro of strategy in Case 1, for an y giv en x ´p i,j q t , it holds that x i t | x ´p i,j q t „ 1 2 e ´ t δ 0 ` 1 2 e ´ t δ 1 ` p 1 ´ e ´ t q δ MASK , whic h implies that H p x i t | x ´p i,j q t q “ H 2 t . (45) Deﬁne the ev ent E i t, 2 P F ´ i t as follows: E i t, 2 : “ # x ´ i t : ˜ ł k P I 0 t k R m p x t qu ¸ ľ ˜ ľ ℓ P I 1 zt i u t ℓ P m p x t qu ¸ “ 1 + , whic h induces ` x i t ˇ ˇ x ´ i t P E i t, 2 ˘ „ e ´ t δ 0 { 1 ` p 1 ´ e ´ t q δ MASK ; ` x i t ˇ ˇ x ´ i t P p E i t, 2 q c ˘ „ 1 2 e ´ t δ 0 ` 1 2 e ´ t δ 1 ` p 1 ´ e ´ t q δ MASK , and the conditional entrop y form ula H p x i t | x ´ i t q “ H 1 t ¨ P p E i t, 2 q ` H 2 t ¨ ` 1 ´ P p E i t, 2 q ˘ . (46) Plugging Eqns. ( 45 ) and ( 46 ) in to Eqn. ( 40 ) gives that for an y i, j P I 0 , i ‰ j , I p x i t ; x j t | x ´p i,j q t q “ H p x i t | x ´p i,j q t q ´ H p x i t | x ´ i t q “ H 2 t ´ H 1 t ¨ P p E i t, 2 q ´ H 2 t ¨ ` 1 ´ P p E i t, 2 q ˘ “ ` H 2 t ´ H 1 t ˘ P p E i t, 2 q “ O ´ e ´p 1 ´ r q dt ¯ , whose v alue is, again, indep endent of the indices i and j . Since |t i, j P I 1 : i ‰ j u| “ p 1 ´ r q d pp 1 ´ r q d ´ 1 q “ Θ p d 2 q , we reac h I 2 p t q “ ÿ i,j P I 1 ,i ‰ j I p x i t ; x j t | x ´p i,j q t q “ O p d 2 e ´p 1 ´ r q dt q . (47) 28 Case 3: i P I 0 , j P I 1 . Deﬁne the function H B p p q : “ ´ p log p p q ´ p 1 ´ p q log p 1 ´ p q to b e the entrop y of the distribution Bern p p q . F ollo wing the pro ofs of the t wo cases ab o ve, let us deﬁne even ts E i,j t, 3 : “ # x ´p i,j q t : ˜ ł k P I 0 zt i u t k ‰ m p x t qu ¸ ł ˜ ľ ℓ P I 1 zt j u t ℓ P m p x t qu ¸ “ 1 + ; E i t, 3 : “ # x ´ i t : ˜ ł k P I 0 zt i u t k ‰ m p x t qu ¸ ł ˜ ľ ℓ P I 1 t ℓ P m p x t qu ¸ “ 1 + . Similar calculations yield H p x i t | x ´p i,j q t q “ H 1 t ¨ P p E i,j t, 3 q ` H 2 t ¨ ` 1 ´ P p E i,j t, 3 q ˘ ; H p x i t | x ´ i t q “ H 1 t ¨ P p E i t, 3 q ` H 2 t ¨ ` 1 ´ P p E i t, 3 q ˘ . Therefore, we obtain I p x i t ; x j t | x ´p i,j q t q “ p H 2 t ´ H 1 t q ` P p E i t, 3 q ´ P p E i,j t, 3 q ˘ “ log p 2 q e ´| I 1 | t ` 1 ´ e ´ t ˘ | I 0 | “ O ´ e ´ H B p r q d ¯ , where the last equalit y is due to the fact that e ´| I 1 | t p 1 ´ e ´ t q | I 0 | is maximized at t “ ´ log p 1 ´ r q . Finally , with |t i P I 0 , j P I 1 u| “ r p 1 ´ r q d 2 , we can b ound I 3 p t q “ ÿ i P I 0 ,j P I 1 I p x i t ; x j t | x ´p i,j q t q “ O ´ d 2 e ´ H B p r q d ¯ . (48) Case 4: i P I 1 , j P I 0 . Notice that I 3 p t q and I 4 p t q are in v ariant under swapping i and j . W e can show in the same w ay as ab o ve that I 4 p t q “ O ´ d 2 e ´ H B p r q d ¯ . (49) Putting ev erything together. Combining Eqns. ( 44 ), ( 47 ), ( 48 ) and ( 49 ), we arrive at I p t q À d 2 ´ e ´ 2 t p 1 ´ e ´ t q rd { 2 ` e ´p 1 ´ r q dt ` e ´ H B p r q d ¯ . (50) W e are now in a p osition to prov e Eqn. ( 39 ). Let us b egin with the integration ov er the time in terv al t P r 1 { d, 1 s . Direct calculation yields that e ´ 2 t p 1 ´ e ´ t q rd { 2 is maximized at t ˚ “ log p 1 ` rd 4 q ą 1 , whic h rev eals that d 2 e ´ 2 t p 1 ´ e ´ t q rd { 2 ď d 2 e ´ 2 t ˚ “ d 2 ˆ 1 ` r d 4 ˙ ´ 2 “ O p 1 q . (51) F or the term from I 2 p t q , we obtain ż 1 1 d t ¨ d 2 e ´p 1 ´ r q dt d t p a q “ ż d 1 se ´p 1 ´ r q s d s ď ż 8 0 se ´p 1 ´ r q s d s “ 1 p 1 ´ r q 2 “ O p 1 q , (52) where in (a), we use the change of v ariable formula with s “ dt . Similarly , we can show that ż 1 1 d t ¨ d 2 e ´ H B p r q d d t ď ż 1 0 t ¨ d 2 e ´ H B p r q d d t “ 1 2 d 2 e ´ H B p r q d “ O p 1 q , (53) where the condition min t r, 1 ´ r u “ Θ p 1 q ensures H B p r q “ Θ p 1 q . T aking collectively Eqns. ( 50 ), ( 51 ), ( 52 ) and ( 53 ), w e arrive at ż 1 1 d min t 1 , t u I p t q d t “ ż 1 1 d t I p t q d t “ O p 1 q . (54) 29 Let us mo ve on to the in tegration o ver time interv al t P r 1 , log p d qs . The integral computation yields that ż log p d q 1 d 2 e ´ 2 t p 1 ´ e ´ t q rd { 2 d t p a q “ ż d { e 1 s ´ 1 ´ s d ¯ rd { 2 d s p b q ď ż 8 0 se ´ rs { 2 d s “ O p 1 q , (55) where in (a), we use the c hange of v ariable form ula with s “ de ´ t , and in (b), we use the inequalit y p 1 ´ x q 1 { x ď e ´ 1 for x P p 0 , 1 s . F or the remaining terms, we hav e ż log p d q 1 d 2 e ´p 1 ´ r q dt d t ď ż log p d q 0 d 2 e ´p 1 ´ r q d d t “ d 2 log p d q e ´p 1 ´ r q d “ O p 1 q ; (56) ż log p d q 1 d 2 e ´ H B p r q d d t ď d 2 log p d q e ´ H B p r q d “ O p 1 q , (57) where the condition min t r, 1 ´ r u “ Θ p 1 q ensures H B p r q “ Θ p 1 q . Now, combining Eqns. ( 50 ), ( 55 ), ( 56 ) and ( 57 ) yields ż log p d q 1 min t 1 , t u I p t q d t “ ż log p d q 1 I p t q d t “ O p 1 q . (58) Finally , equipped with Eqns. ( 54 ) and ( 58 ), w e conclude ż log p d q 1 d min t 1 , t u I p t q d t “ ż 1 1 d min t 1 , t u I p t q d t ` ż log p d q 1 min t 1 , t u I p t q d t “ O p 1 q , whic h pro ves D p p ex q “ O p 1 q . B T ec hnical preparations B.1 Score functions Belo w, w e presen t an equiv alent form ulation of the score functions. Prop osition 6. L et q 0 b e an initial distribution on X 0 . L et x, y P X b e such that Q p y , x q ą 0 . Then, 1. for the uniform noising pr o c ess, s t p y , x q “ E x 0 „ q 0 α d H p y ,x 0 q t E x 0 „ q 0 α d H p x,x 0 q t , (59) wher e α t : “ 1 ´ e ´ t 1 `p S ´ 1 q e ´ t . 2. for the masking noising pr o c ess, s t p y , x q “ 1 e t ´ 1 q 0 p y q q 0 p x q , (60) wher e for x P X z X 0 , q 0 p x q is the mar ginal pr ob ability of the unmask ed c o or dinates of x under q 0 . Pro of of Prop osition 6 . By the deﬁnition of the score function, one can write s t p y , x q “ q t p y q q t p x q “ ř x 0 q t | 0 p y | x 0 q q 0 p x 0 q ř x 0 q t | 0 p x | x 0 q q 0 p x 0 q . F or the uniform noising pro cess, one can solv e the Kolmogoro v forw ard equation for ev ery dimension. As a result, the transition can b e written as q t | 0 p y | x 0 q “ ˆ 1 ´ e ´ t S ˙ d H p y ,x 0 q ˆ 1 ` p S ´ 1 q e ´ t S ˙ d ´ d H p y ,x 0 q “ ˆ 1 ` p S ´ 1 q e ´ t S ˙ d α d H p y ,x 0 q t , 30 whic h pro ves Eqn. ( 59 ). More details of this relation can b e found in (e.g., Zhang et al. ( 2025 ), Prop osition 1). F or the masking noising pro cess, for notational conv enience, given any x P pr S s Y t MASK uq d , deﬁne m p x q : “ t i P r d s : x i “ MASK u . (61) In view of this piece of notation, as Pr p x i t “ MASK q “ e ´ t , and co ordinates evolv e indep enden tly , one can write q t | 0 p y | x 0 q “ p 1 ´ e ´ t q | m p y q | e ´ t p d ´ | m p y q | q I t for all i P r d s , y i P t x i 0 , MASK u u . As Q p y , x q ą 0 , it must b e that d H p x, y q “ 1 , and for i , suc h that x i ‰ y i , x i “ MASK and y i ‰ MASK . This implies that | m p x q | “ | m p y q | ` 1 , and w e can write ř x 0 q t | 0 p y | x 0 q q 0 p x 0 q ř x 0 q t | 0 p x | x 0 q q 0 p x 0 q “ e ´ t 1 ´ e ´ t ř x 0 q 0 p x 0 q I t for all i P r d s , y i P t x i 0 , MASK u u ř x 0 q 0 p x 0 q I t for all i P r d s , x i P t x i 0 , MASK u u “ 1 e t ´ 1 q 0 p y q q 0 p x q . B.2 T echnical lemmas Lemma 4 (Chain rule of KL div ergence) . F or N ą 0 , let a 0: N , b 0: N b e the joint distributions of two Markov pr o c esses. Then, KL p a 0: N } b 0: N q “ KL p a 0 } b 0 q ` N ´ 1 ÿ k “ 0 E x „ a k KL ` a k ` 1 | k p¨ | x q} b k ` 1 | k p¨ | x q ˘ . Pro of of Lemma 4 . Inv oking the deﬁnition of KL divergence with some direct calculations yields KL p a 0: N } b 0: N q “ E x 0: N „ a 0: N log a 0: N p x 0: N q b 0: N p x 0: N q “ E x 0: N „ a 0: N log ˜ a 0 p x 0 q b 0 p x 0 q N ´ 1 ź k “ 0 a k ` 1 | k p x k ` 1 | x k q b k ` 1 | k p x k ` 1 | x k q ¸ “ E x 0 „ a 0 log a 0 p x 0 q b 0 p x 0 q ` N ´ 1 ÿ k “ 0 E x k „ a k E x k ` 1 „ a k ` 1 | k p¨| x k q log a k ` 1 | k p x k ` 1 | x k q b k ` 1 | k p x k ` 1 | x k q “ KL p a 0 } b 0 q ` N ´ 1 ÿ k “ 0 E x k „ a k KL ` a k ` 1 | k p¨ | x k q} b k ` 1 | k p¨ | x k q ˘ . Lemma 5 (Deriv ative of KL: an upp er bound) . L et p q t q and p p t q b e the mar ginals of CTMCs with r ate matric es p Q t q and p p Q t q , r esp e ctively, and Ð q t ” q T ´ t b e the mar ginals of the r everse pr o c ess. Then, for any t ą t k and for any z , B B t KL ´ Ð q t | t k p¨ | z q} p t | t k p¨ | z q ¯ ď E x t „ Ð q t | t k p¨| z q ÿ y ‰ x t « p Q t p x t , y q ´ Ð Q t p x t , y q ` Ð Q t p x t , y q log Ð Q t p x t , y q p Q t p x t , y q ﬀ . Pro of of Lemma 5 . Let us omit the conditioning on z for the notation brevit y . By direct calculations, one can write A : “ B B t KL ´ Ð q t | t k } p t | t k ¯ “ ÿ x P X ˆ B B t Ð q t | t k p x q ˙ log Ð q t | t k p x q p t | t k p x q ´ ÿ x P X Ð q t | t k p x q B B t p t | t k p x q p t | t k p x q . 31 Recall the K olmogorov equation: B B t Ð q t | t k p x q “ ÿ y P X Ð Q p y , x q Ð q t | t k p y q and B B t p t | t k p x q “ ÿ y P X p Q p y , x q p t | t k p y q Putting the abov e together, w e obtain (relab eling x and y ) A “ E x „ Ð q t | t k ÿ y P X « Ð Q p x, y q log ˜ Ð q t | t k p y q p t | t k p y q ¸ ´ p Q p x, y q Ð q t | t k p y q Ð q t | t k p x q ¨ p t | t k p x q p t | t k p y q ﬀ “ E x „ Ð q t | t k ÿ y ‰ x « Ð Q p x, y q log ˜ Ð q t | t k p y q p t | t k p y q ¸ ´ Ð Q p x, y q log ˜ Ð q t | t k p x q p t | t k p x q ¸ ´ p Q p x, y q Ð q t | t k p y q Ð q t | t k p x q ¨ p t | t k p x q p t | t k p y q ﬀ ´ p Q p x, x q “ E x „ Ð q t | t k ÿ y ‰ x « Ð Q p x, y q log ˜ Ð q t | t k p y q Ð q t | t k p x q ¨ p t | t k p x q p t | t k p y q ¸ ´ p Q p x, y q Ð q t | t k p y q Ð q t | t k p x q ¨ p t | t k p x q p t | t k p y q ﬀ ´ p Q p x, x q “ E x „ Ð q t | t k ÿ y ‰ x « Ð Q p x, y q log ˜ Ð q t | t k p y q Ð q t | t k p x q ¨ p t | t k p x q p t | t k p y q ¸ ´ p Q p x, y q Ð q t | t k p y q Ð q t | t k p x q ¨ p t | t k p x q p t | t k p y q ` p Q p x, y q ﬀ , (62) where we inv ok e the the prop ert y that Ð Q t p x, x q “ ´ ř y ‰ x Ð Q t p x, y q and p Q t p x, x q “ ´ ř y ‰ x p Q t p x, y q . Then, letting C xy b e suc h that (recall that z is ﬁxed) Ð q t | t k p y | z q Ð q t | t k p x | z q “ Ð Q t p x, y q C xy , (63) it satisﬁes that A “ E x „ Ð q t | t k ÿ y ‰ x « p Q p x, y q ` Ð Q p x, y q log ˜ Ð Q t p x, y q p Q t p x, y q ¸ ` Ð Q t p x, y q log ˜ p Q t p x, y q C xy ¨ p t | t k p x q p t | t k p y q ¸ ´ p Q p x, y q Ð Q t p x, y q C xy ¨ p t | t k p x q p t | t k p y q ﬀ (64) Finally , since log z ď z ´ 1 , A ď E x „ Ð q t | t k ÿ y ‰ x « p Q p x, y q ` Ð Q p x, y q log ˜ Ð Q t p x, y q p Q t p x, y q ¸ ´ Ð Q p x, y q ` Ð Q t p x, y q p Q t p x, y q C xy ¨ p t | t k p x q p t | t k p y q ´ p Q p x, y q Ð Q t p x, y q C xy ¨ p t | t k p x q p t | t k p y q ﬀ “ E x „ Ð q t | t k ÿ y ‰ x « p Q t p x, y q ´ Ð Q t p x, y q ` Ð Q t p x, y q log ˜ Ð Q t p x, y q p Q t p x, y q ¸ﬀ . Lemma 6 (Itô’s Lemma for Poisson jump pro cess) . F or the Poisson jump pr o c ess t x t u t ě 0 with gener ator t L t u t ě 0 and r ate matrix t R t u t ě 0 . Itô’s L emma formula c an b e written as f p t, x t q “ f p 0 , x 0 q ` ż t 0 “ B s f p s, x s ´ q ` p L s f q p s, x s ´ q ‰ d t ` M t , (65) wher e x s ´ “ lim u Ñ s ´ x s , which exists for almost everywher e s P r 0 , t q under the L eb esgue me asur e. The c omp ensation pr o c ess t M u u u Pr ℓ,t s is deﬁne d as M u “ ÿ y s : y s ‰ x s ż u ℓ ` f p s, y s q ´ f p s, x s q ˘` d N x s ,y s s ´ λ x s ,y s s d s ˘ , 32 wher e N x,y s is the c ounting pr o c ess of jumps fr om x to y up to time t and λ x,y s is the F s ´ -intensity of N x,y s , i.e., λ x,y s “ I t x s ´ “ y u R t p x, y q . See ( Conforti et al. , 2025 , Appendix A.5) for more details. C Pro ofs of results in Section 3.1 C.1 Pro of of Theorem 1 W e ﬁrst decomp ose the KL divergence betw een the output distribution p T and the target distribution q 0 as KL p q 0 } p T q ď KL p q T ´ t 0 ,...,T ´ t N } p t 0 ,...,t N q “ KL p q T } p 0 q ` N ´ 1 ÿ k “ 0 E x t k „ Ð q t k ” KL ´ Ð q t k ` 1 | t k p¨| x t k q} p t k ` 1 | t k p¨| x t k q ¯ı , (66) where the ﬁrst inequalit y follo ws from the data-processing inequalit y for KL div ergence and the second inequalit y follo ws from the chain rule for KL div ergence in Lemma 4 . The ﬁrst term is the initialization error, which can b e upp er b ounded by the following lemma. Lemma 7. F or the uniform noising pr o c ess, for any initial distribution q 0 P P p X q , time index t ě 0 , one has the same limit distribution q t d Ñ p 0 “ Unif p X q , as t Ñ 8 . F urther, the mo diﬁe d lo g-Sob olev c onstant 3 of q t satisﬁes C LSI “ 2 , which le ads to KL p q t } p 0 q ď e ´ t KL p q 0 } p 0 q ď e ´ t d log p S q . The pro of of Lemma 7 can be found in previous works, e.g., Zhang et al. ( 2025 , Proposition 2). Applying the lemma abov e together with Lemma 5 and Eqn. ( 7 ) on the second term in Eqn. ( 66 ), w e obtain KL p p 0 } q T q ď KL p q T } p 0 q ` N ´ 1 ÿ k “ 0 E x t k „ Ð q t k „ ż t k ` 1 t k B B t KL ´ Ð q t | t k p¨| x t k q} p t | t k p¨| x t k q ¯ d t ȷ ď e ´ T d log p S q ` N ´ 1 ÿ k “ 0 E x t k „ Ð q t k ż t k ` 1 t k E x t „ Ð q t | t k ÿ y ‰ x t « p Q t p x t , y q ´ Ð Q t p x t , y q ` Ð Q t p x t , y q log Ð Q t p x t , y q p Q t p x t , y q ﬀ d t ď e ´ T d log p S q ` 1 S N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t k ,x t „ Ð q t k ,t „ ÿ i Pr d s ÿ c Pr S s s T ´ t p x t ‘ i c, x t q D ` p s T ´ t k p x t k ‘ i c, x t k q , s T ´ t p x t ‘ i c, x t q ˘ d t ȷ . (67) In the follo wing, we fo cus on the quan tity E x t k ,x t „ Ð q t k ,t ÿ i Pr d s ÿ c Pr S s s T ´ t p x t ‘ i c, x t q D ` p s T ´ t k p x t k ‘ i c, x t k q , s T ´ t p x t ‘ i c, x t q ˘ . (68) F or simplicit y , w e write t k : “ ℓ . Direct calculations yield the follo wing decomposition ÿ i Pr d s ÿ c Pr S s s T ´ t p x t ‘ i c, x t q D ` p s T ´ t k p x t k ‘ i c, x t k q , s T ´ t p x t ‘ i c, x t q ˘ 3 C LSI is deﬁned as the smallest num b er suc h that for any q P P p X q , KL p q | Unif p X qq ď C LSI { 2 ¨ E p q , log p q qq , where E is the Dirichlet form asso ciated with the uniform noising process, i.e., E p f , g q “ ´p 2 | X |q ´ 1 ř x,y P X p f p x q ´ f p y qqp g p x q ´ g p y qq Q p x, y q . 33 “ ÿ y ℓ :d H p y ℓ ,x ℓ q“ 1 s T ´ ℓ p y ℓ , x ℓ q D p p s T ´ ℓ p y ℓ , x ℓ q , s T ´ ℓ p y ℓ , x ℓ qq lo oooooooooooooooooooooooooooooooooo omo oooooooooooooooooooooooooooooooooo on T t,ℓ 1 ` ÿ i Pr d s ÿ c Pr S s p s T ´ ℓ p x ℓ ‘ i c, x ℓ q ´ s T ´ t p x t ‘ i c, x t qq log p s T ´ ℓ p x ℓ ‘ i c, x ℓ q looooooooooooooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooooooooooooon T t,ℓ 2 ` ÿ y t :d H p y t ,x t q“ 1 h p s T ´ t p y t , x t qq ´ ÿ y ℓ :d H p y ℓ ,x ℓ q“ 1 h p s T ´ ℓ p y ℓ , x ℓ qq lo oooooooooooooooooooooooooooooooooooooomo ooooooooooooooooooooooooooooooooooooo on T t,ℓ 3 , where h p x q “ x log x ´ x ` 1 . W e pro ceed by b ounding eac h term separately . • F or term T t,ℓ 1 , notice that T t,ℓ 1 is indep enden t of t . In view of deﬁnition of score en tropy loss, we ha ve E x ℓ ,x t „ Ð q ℓ,t ” T t,ℓ 1 ı (69) “ E x ℓ ,x t „ Ð q ℓ,t » – ÿ y ℓ :d H p y ℓ ,x ℓ q“ 1 s T ´ ℓ p y ℓ , x ℓ q D p p s T ´ ℓ p y ℓ , x ℓ q , s T ´ ℓ p y ℓ , x ℓ qq ﬁ ﬂ “ S ¨ E x ℓ ,x t „ Ð q ℓ,t » – ÿ y ℓ :d H p y ℓ ,x ℓ q“ 1 Q T ´ ℓ p y ℓ , x ℓ q s T ´ ℓ p y ℓ , x ℓ q D p p s T ´ ℓ p y ℓ , x ℓ q , s T ´ ℓ p y ℓ , x ℓ qq ﬁ ﬂ “ S ¨ L SE p T ´ ℓ, p s T ´ ℓ , s T ´ ℓ q , (70) where we use the fact that Q T ´ t p y , x q “ S ´ 1 for any d H p y , x q “ 1 . • F or term T t,ℓ 2 , we establish the follo wing lemma, whose pro of is pro vided in Section E.2 . Lemma 8. Consider the uniform noising pr o c ess and let 0 ď ℓ ă t ă T . Then, for any c P r S s , i P r d s and x ℓ P X , it ob eys E x t „ Ð q t | ℓ p¨| x ℓ q ” ` s T ´ ℓ p x ℓ ‘ i c, x ℓ q ´ s T ´ t p x t ‘ i c, x t q ˘ log p s T ´ ℓ p x ℓ ‘ i c, x ℓ q ı “ 0 . With Lemma 8 , it is easily seen that E x ℓ ,x t „ Ð q ℓ,t ” T t,ℓ 2 ı “ E x ℓ ,x t „ Ð q ℓ,t « ÿ i Pr d s ÿ c Pr S s ` s T ´ ℓ p x ℓ ‘ i c, x ℓ q ´ s T ´ t p x t ‘ i c, x t q ˘ log p s T ´ ℓ p x ℓ ‘ i c, x ℓ q ﬀ “ ÿ i Pr d s ÿ c Pr S s E x ℓ „ Ð q ℓ „ E x t „ Ð q t | ℓ p¨| x ℓ q ” ` s T ´ ℓ p x ℓ ‘ i c, x ℓ q ´ s T ´ t p x t ‘ i c, x t q ˘ log p s T ´ ℓ p x ℓ ‘ i c, x ℓ q ı ȷ “ 0 . (71) • F or term T t,ℓ 3 , w e mak e the crucial observ ation that E x t „ Ð q t r ř y t :d H p y t ,x t q“ 1 h p s T ´ t p y t , x t qqs admits a simple representation. The statement is formalized in the following lemma. Lemma 9. F or any t P r 0 , T s , we have E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 h p s T ´ t p y t , x t qq ﬁ ﬂ “ E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 ´ log p s T ´ t p y t , x t qq ﬁ ﬂ . 34 In view of this lemma, w e can further express the term T t,ℓ 3 as E x ℓ ,x t „ Ð q ℓ,t ” T t,ℓ 3 ı (72) “ E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 h p s T ´ t p y t , x t qq ﬁ ﬂ ´ E x ℓ „ Ð q ℓ » – ÿ y ℓ :d H p y ℓ ,x ℓ q“ 1 h p s T ´ ℓ p y ℓ , x ℓ qq ﬁ ﬂ “ E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 ´ log p s T ´ t p y t , x t qq ﬁ ﬂ ´ E x ℓ „ Ð q ℓ » – ÿ y ℓ :d H p y ℓ ,x ℓ q“ 1 ´ log p s T ´ ℓ p y ℓ , x ℓ qq ﬁ ﬂ . (73) Plugging Eqns. ( 70 ), ( 71 ), and ( 73 ) into Eqn. ( 68 ), we end up with E x t k ,x t „ Ð q t k ,t ÿ i Pr d s ÿ c Pr S s s T ´ t p y t , x t q D p p s T ´ t k p x t k ‘ i c, x t k q , s T ´ t p y t , x t qq “ S ¨ L SE p T ´ ℓ, p s T ´ ℓ , s T ´ ℓ q ` E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 ´ log p s T ´ t p y t , x t qq ﬁ ﬂ ´ E x ℓ „ Ð q ℓ » – ÿ y ℓ :d H p y ℓ ,x ℓ q“ 1 ´ log p s T ´ ℓ p y ℓ , x ℓ qq ﬁ ﬂ “ S ¨ L SE p T ´ ℓ, p s T ´ ℓ , s T ´ ℓ q ` S ` φ p T ´ t q ´ φ p T ´ ℓ q ˘ , where we deﬁne φ p t q as φ p t q : “ 1 S E x t „ q t » – ÿ y t :d H p y t ,x t q“ 1 ´ log p s t p y t , x t qq ﬁ ﬂ . (74) Returning to Eqn. ( 67 ), we conclude that KL p p 0 } q T q ď e ´ T d log p S q ` N ´ 1 ÿ k “ 0 ż t k ` 1 t k L SE p T ´ t k , p s T ´ t k , s T ´ t k q ` ` φ p T ´ t q ´ φ p T ´ t k q ˘ d t ď ε score ` e ´ T d log p S q ` N ´ 1 ÿ k “ 0 ż t k ` 1 t k ` φ p T ´ t q ´ φ p T ´ t k q ˘ d t. (75) T o establish Theorem 1 , it is only left for us to control the last term in Eqn. ( 75 ). First, by Jensen’s inequalit y , φ p t q is low er b ounded by φ p t q ě ´ 1 S ÿ i Pr d s ÿ c Pr S s log ` E x t „ q t r s t p x t ‘ i c, x t qs ˘ “ 0 . (76) F or the upp er b ound, from the deﬁnition of φ p t q , it satisﬁes that φ p t q ď 1 S E x t „ q t « ˇ ˇ t y t : d H p y t , x t q “ 1 u ˇ ˇ ¨ sup x,y :d H p x,y q“ 1 ˇ ˇ log p s t p y , x qq ˇ ˇ ﬀ , (77) where |t y t : d H p y t , x t q “ 1 u| denotes the cardinality of the set t y t : d H p y t , x t q “ 1 u , which equals d p S ´ 1 q for an y x t P X . It therefore suﬃces to con trol the quantit y | log p s t p y t , x t qq | , which is achiev ed through the follo wing lemma. Lemma 10. F or any distribution q 0 on X , let q t b e the mar ginal distribution of the uniform noising pr o c ess at time t . Then, for any x, y P X such that d H p x, y q “ 1 , it holds that | log s t p y , x q| À log p S q ` max t log p t ´ 1 q , 0 u . 35 As a consequence of Lemma 10 , we arriv e at φ p t q ď d p S ´ 1 q S ¨ sup x,y :d H p x,y q“ 1 | log p s t p y , x qq | À d ` log p S q ` max t log p t ´ 1 q , 0 u ˘ . (78) In addition, w e make the observ ation in Lemma 12 that φ p t q is a non-increasing function in t . No w w e are ready to combine everything and b ound the last term of Eqn. ( 75 ). Deﬁne ∆ “ max k t t k ` 1 ´ t k u , and c ho ose 1 ď M ď N ´ 1 suc h that T ´ t M P r ∆ , 2∆ s . Armed with Eqns. ( 76 ), ( 78 ) and the monotonicit y of φ p t q , we obtain N ´ 1 ÿ k “ 0 ż t k ` 1 t k ` φ p T ´ t q ´ φ p T ´ t k q ˘ d t ď ż T ´ t M 0 φ p t q d t ` M ´ 1 ÿ k “ 0 ż t k ` 1 t k ` φ p T ´ t k ` 1 q ´ φ p T ´ t k q ˘ d t ď 2∆ d ` log p S q ` log p 2 { ∆ q ` 1 ˘ ` ∆ N ´ 2 ÿ k “ 0 ` φ p T ´ t k ` 1 q ´ φ p T ´ t k q ˘ ď 2∆ d ` log p S q ` log p 2 { ∆ q ` 1 ˘ ` ∆ φ p ∆ q À ∆ d log p S { ∆ q . Com bining the inequality abov e with Eqn. ( 75 ) ac hieves KL p p 0 } q T q ď ε score ` e ´ T d log p S q ` ∆ d log p S { ∆ q , whic h completes the pro of of Theorem 1 . C.2 Pro of of Corollary 1 Cho ose time horizon T “ log p d log p S q{ ε q and num b er of discretization steps N “ Θ ˆ d log p S q log 3 p d log p S q{ ε q ε ˙ “ r Θ ˆ d ε ˙ . A dopting the upp er b ound in Theorem 1 leads to KL p q data } p output q “ KL p q 0 } p T q À ε score ` e T d log p S q ` T d N log ˆ S N T ˙ À ε score ` ε ` ε log p S q T 2 ´ log p S q ` 3 T ¯ À ε score ` ε. C.3 Pro of of Theorem 2 Recall that the path measures of the bac kward process and the sampling pro cess are denoted b y Q d “ t Ð q t u t Pr 0 ,T ´ δ s and P d “ t p t u t Pr 0 ,T ´ δ s , respectively . It can b e chec ked that the path measure Q is absolutely con tinuous with respect to P . By Girsano v’s theorem for the bac kward pro cess (e.g. Ren et al. ( 2025 , Corollary 3.4)), it satisﬁes KL p Q } P q “ KL p Ð q 0 } p 0 q ` 1 S N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t k ,x t „ Ð q t k ,t „ ÿ i Pr d s ÿ c Pr S s s T ´ t p x t ‘ i c, x t q D p p s T ´ t k p x t k ‘ i c, x t k q , s T ´ t p x t ‘ i c, x t qq d t ȷ . 36 F ollowing the same analysis for Eqn. ( 68 ) in App endix C.1 , w e arriv e at KL p Q } P q “ ε score ` KL p Ð q 0 } p 0 q ` N ´ 1 ÿ k “ 0 ż t k ` 1 t k p φ p T ´ t q ´ φ p T ´ t k qq d t “ ε score ` KL p q T } p 0 q ` N ´ 1 ÿ k “ 0 ż t k ` 1 t k p φ p T ´ t q ´ φ p T ´ t k qq d t, where the function φ p¨q is deﬁned as in Eqn. ( 74 ) φ p t q : “ 1 S E x t „ q t » – ÿ y t :d H p y t ,x t q“ 1 ´ log p s t p y t , x t qq ﬁ ﬂ . Th us, to achiev e KL p Q } P q ď ε score ` O p 1 q , we need to select N , T and step size schedule such that KL p q T } p 0 q ` N ´ 1 ÿ k “ 0 ż t k ` 1 t k p φ p T ´ t q ´ φ p T ´ t k q q d t “ O p 1 q . (79) In order to understand the ﬁrst term in Eqn. ( 79 ), let us ﬁrst consider the case when T “ 1 . By the assumption q 0 P P γ p X q , therefore, we are ensured that KL p q 1 } p 0 q “ ÿ x P X q 1 p x q log ` q 1 p x qq ´ ÿ x P X q 1 p x q log ` p 0 p x qq ě γ d log p S q " 1 . Hence, it implies that T must satisfy T ą 1 . W e then fo cus on the analysis of the second term in Eqn. ( 79 ). W e aim to show that the changing rate, i.e., ´ φ 1 p t q , is low er b ounded as we come close to the target data distribution (i.e., t P r 0 , 1 sq , whic h in turn leads to a low er b ound on the diﬀerence φ p T ´ t q ´ φ p T ´ t k q . W e proceed with our analysis under the information-theoretic framework. F or notational conv enience, giv en ev ery i P r d s and c P r S s , let us deﬁne φ i,c p t q “ E x t „ q t “ ´ log ` s t p x t ‘ i c, x t q ‰ “ E x t „ q t “ ´ log ` s t p N i,c p x t q , x t q ‰ (80) where the operator N i,c : X Ñ X is deﬁned as N i,c p x q “ x ‘ i c . It is easy to c heck that φ p t q “ 1 S ÿ i Pr d s ÿ c Pr S s φ i,c p t q . Notice that N i,c is a bijection in X . W e deﬁne N i, ´ c “ p N i,c q ´ 1 “ N i,S ´ c , where p N i,c q ´ 1 is denoted as the in verse function of N i,c . Since φ p t q can b e written as a linear combination of φ i,c p t q , it suﬃc es to study the prop erties of the individual φ i,c p t q to c haracterize φ p t q . T o b egin with, the following lemma provides a characterization of φ p t q and φ i,c p t q as information-theoretic quantities. Lemma 11. F or φ p t q and φ i,c p t q deﬁne d in Eqns. ( 74 ) and ( 80 ) , we have φ i,c p t q “ KL p q t } p N i, ´ c q # q t q ; φ p t q “ ´ B B t KL p q t } p 0 q “ ÿ i Pr d s ÿ c Pr S s KL p q t } p N i,c q # q t q , wher e p N i,c q # is denote d as the pushforwar d me asur e of q t under op er ator N i,c . Here, the pushforw ard measure giv es, for any x P X , p N i, ´ c q # q t p x q “ q t p N i,c p x qq . (81) 37 Lemma 11 allows us to write φ i,c p t q as the KL divergence b et ween the marginal forward pro cess and its pushforw ard under N i,c . By viewing N i,c as an information channel, we can show it is in a sp ecial family of c hannels, named S -ary symmetric c hannel ( Makur and P olyanskiy , 2018 ), whic h satisﬁes strong data pro cessing inequalit y . Through this idea, w e can prov e the following lemma. The details are provided in Section E.6 . Lemma 12. F or t P p 0 , T s , φ i,c p t q is diﬀer entiable in t and it holds that ´ φ 1 i,c p t q ě φ i,c p t q . Consequen tly , Lemma 12 leads to ´ φ 1 p t q “ ´ 1 S ÿ i Pr d s ÿ c Pr S s φ 1 ic p t q ě 1 S ÿ i Pr d s ÿ c Pr S s φ ic p t q “ φ p t q . Recall the log-Sob olev inequality in Lemma 7 . W e ha ve, for an y target distribution q 0 P P γ p X q and t P p 0 , 1 q , ´ φ 1 p t q ě φ p t q ě KL p q t } p 0 q ě KL p q 1 } p 0 q ě γ d log p S q . Equipp ed with the ab ov e relation, we are ready to control the second term in Eqn. ( 79 ). By the funda- men tal theorem of calculus, w e obtain ż t k ` 1 t k p φ p T ´ t q ´ φ p T ´ t k q q d t “ ż T ´ t k T ´ t k ` 1 ż T ´ t k T ´ t ´ φ 1 p τ q d τ d t Á ż T ´ t k T ´ t k ` 1 p t ´ t k q γ d log p S q d t “ 1 2 p t k ` 1 ´ t k q 2 γ d log p S q . Cho ose M such that T ´ t M P r 1 2 , 1 s . Such M exists due to the fact that T ą 1 and max k t t k ` 1 ´ t k u ď 1 2 . It holds in this case that O p 1 q “ N ´ 1 ÿ k “ 0 ż t k ` 1 t k p φ p T ´ t q ´ φ p T ´ t k qq d t Á N ´ 1 ÿ k “ M ż t k ` 1 t k p φ p T ´ t q ´ φ p T ´ t k qq d t Á N ´ 1 ÿ k “ M p t k ` 1 ´ t k q 2 γ d log p S q . (82) By Cauch y-Sch warz inequality , it is direct to show that N ´ 1 ÿ k “ M p t k ` 1 ´ t k q 2 ě 1 N ´ M ˜ N ´ 1 ÿ k “ M p t k ` 1 ´ t k q ¸ 2 “ p T ´ t M ´ δ q 2 N ´ M Á 1 N ´ M , (83) where in the last inequalit y , we use the fact that T ´ t M ě 1 2 and δ ! 1 . Plugging Eqn. ( 83 ) into Eqn. ( 82 ) leads to N ě N ´ M “ Ω p γ d log p S qq “ r Ω p d q . C.4 Eﬃcien t sampling for high-en tropy distributions In the discussion follo wing Theorem 2 , w e p oin ted out that the τ -leaping algorithm can attain sublinear iteration complexity in d for the uniform noising pro cess when the target distribution is close to the uniform distribution on X . W e now state this result formally . 38 Theorem 4. L et q 0 P P p X q denote the data distribution. Cho ose time p oints 0 “ t 0 ă t 1 ă . . . ă t N “ T ´ δ with exp onential-then-c onstant step size sche dule, i.e., t k ` 1 ´ t k ď κ min p 1 , T ´ t k ` 1 q for k “ 0 , . . . , N ´ 2 . Supp ose 0 ă κ ă 0 . 9 . Then, KL p q T ´ δ } p output q À ε score ` ` e ´ T ` κ log p δ ´ 1 q ˘ ¨ KL p q 0 } Unif p X qq . Theorem 4 reveals that, with an exponential-then-constan t schedule and early stopping time δ , the error upp er b ound dep ends only on the initial KL divergence KL p q 0 } Unif p X qq , which can potentially b e small if q 0 is close to the forward limit distribution Unif p X q . T o b e more concrete, we can c ho ose T “ log p KL p q 0 } Unif p X qq{ ε q , δ ´ 1 “ p oly p d q and κ “ e ´ T { log p d q to ac hieve KL p q T ´ δ } p output q À ε score ` ε, with iteration complexit y N “ r Θ ˆ KL p q 0 } Unif p X qq ε ˙ . In particular, this b ound is sublinear in d when KL p q 0 } Unif p X qq “ o p d q . Pro of of Theorem 4 . The pro of pro ceeds along the same lines as the pro of of Theorem 1 . W rite p 0 “ Unif p X q as the initial distribution of the sampling pro cess. F ollowing the proof of Eqn. ( 75 ), w e bound KL p q T ´ δ } p T ´ δ q ď ε score ` e ´ T KL p q 0 } p 0 q ` N ´ 1 ÿ k “ 0 ż t k ` 1 t k ` φ p T ´ t q ´ φ p T ´ t k q ˘ d t, (84) where as shown in Lemma 11 , φ p t q “ ´B t KL p q t } p 0 q ě 0 . By Lemma 12 , φ p t q is a non-increasing function of t P p 0 , T s , which leads to φ p t q ď 1 t ż t 0 B s KL p q s } p 0 q d s “ KL p q 0 } p 0 q t . (85) Without loss of generalit y , Cho ose M such that 1 ď M ď N ´ 1 such that T ´ t M “ 1 . F or 1 ď k ă M , t k ` 1 ´ t k “ t k ´ t k ´ 1 “ κ . With Eqn. ( 85 ), it can b e seen that N ´ 1 ÿ k “ 0 ż t k ` 1 t k ` φ p T ´ t q ´ φ p T ´ t k q ˘ d t ď N ´ 1 ÿ k “ 0 p t k ` 1 ´ t k q ` φ p T ´ t k ` 1 q ´ φ p T ´ t k q ˘ d t “ p t N ´ t N ´ 1 q φ p T ´ t N q ` N ´ 1 ÿ k “ 1 ` p t k ´ t k ´ 1 q ` p t k ´ t k ` 1 q ˘ φ p T ´ t k q ´ p t 1 ´ t 0 q φ p T q p a q ď κδ 1 ´ κ ¨ KL p q 0 } p 0 q δ ` N ´ 1 ÿ k “ M κ 2 1 ´ κ p T ´ t k q ¨ KL p q 0 } p 0 q T ´ t k À p N ´ M q κ 2 KL p q 0 } p 0 q “ log p 1 ´ κ q p δ q κ 2 KL p q 0 } p 0 q p b q ď κ log p δ ´ 1 q KL p q 0 } p 0 q , where we apply Eqn. ( 85 ) in (a) and log p 1 ´ κ q ď ´ κ in (b). Putting the ab o v e b ound and Eqn. ( 84 ) together pro ves the desired result. 39 D Pro ofs of results in Section 3.2 D.1 Pro of of Theorem 3 F or t P t t 0 , . . . , t N u , let p t denote the marginal distribution of x t k in Algorithm 1 . Using the data-pro cessing inequalit y KL p Ð q T } p T q ď KL p Ð q t 0 ,...,t N } p t 0 ,...,t N q and Lemma 4 , w e decomp ose the KL divergence b et ween the target distribution q 0 ” Ð q T and the output distribution p T as follows: KL p Ð q T } p T q ď KL p Ð q 0 } p 0 q ` N ´ 1 ÿ k “ 0 E x t k „ Ð q t k ” KL ´ Ð q t k ` 1 | t k p¨| x t k q} p t k ` 1 | t k p¨| x t k q ¯ı . (86) The ﬁrst term, initialization err or , was bounded in Conforti et al. ( 2025 ); Liang et al. ( 2025b ) as follo ws: KL p Ð q 0 } p 0 q ď e ´ T d p 1 ` log S ` T q À e ´ T d log S. (87) Next, we mov e on to control the second term. The following lemma states that for each k , conditioned on x t k , one can consider a CTMC on the interv al r t k , t k ` 1 s , with marginals p t k ` 1 | t k p¨ | x t k q at time t k ` 1 . The pro of is given in Section F.3 . Lemma 13. Fix k “ 0 , . . . , N ´ 1 . L et x t k and x t k ` 1 b e as in Algorithm 1 . L et p y t q t Pr t k ,t k ` 1 s b e a CTMC with y t k “ x t k and the fol lowing r ate matrix: p Q t p a, b q : “ $ ’ & ’ % p s T ´ t k p y t k d i b i , y t k q e T ´ t k ´ 1 e T ´ t ´ 1 I t a i “ MASK u , if d H p a, b q “ 1 , a i ‰ b i , and y i t k “ MASK , ´ ř c ‰ a p Q t p a, c q , if a “ b, 0 , otherwise. (88) Then, x t k ` 1 has the same distribution as y t k ` 1 . Armed with this result, w e rewrite the right hand side of Eqn. ( 86 ) with marginals p t | t k p¨ | x t k q of this CTMC: KL p Ð q T } p T q À e ´ T d log S ` N ´ 1 ÿ k “ 0 E x t k „ Ð q t k ” KL ´ Ð q t k ` 1 | t k p¨| x t k q} p t k ` 1 | t k p¨| x t k q ¯ı (89) “ e ´ T d log S ` N ´ 1 ÿ k “ 0 E x t k „ Ð q t k „ ż t k ` 1 t k B B t KL ´ Ð q t | t k p¨ | x t k q} p t | t k p¨ | x t k q ¯ d t ȷ . (90) T o further control the second term ab o ve, we apply Lemma 5 with rate matrices speciﬁed in Lemma 13 . W e can write KL p Ð q T } p T q À e ´ T d log S ` N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t k ,x t „ Ð q t k ,t ÿ y ‰ x t « p Q t p x t , y q ´ Ð Q t p x t , y q ` Ð Q t p x t , y q log ˜ Ð Q t p x t , y q p Q t p x t , y q ¸ﬀ d t. (91) Fix k P t 0 , . . . , N ´ 1 u and t P r t k , t k ` 1 q . Let ℓ : “ t k . Inv oking Eqn. ( 88 ) further leads to ÿ y ‰ x t « p Q t p x t , y q ´ Ð Q t p x t , y q ` Ð Q t p x t , y q log ˜ Ð Q t p x t , y q p Q t p x t , y q ¸ﬀ “ ÿ i P m p x t q ÿ c Pr S s « p Q t p x t , x t d i c q ´ Ð Q t p x t , x t d i c q ` Ð Q t p x t , x t d i c q log ˜ Ð Q t p x t , x t d i c q p Q t p x t , x t d i c q ¸ﬀ 40 “ ÿ i P m p x t q ÿ c Pr S s « e T ´ ℓ ´ 1 e T ´ t ´ 1 p s T ´ ℓ p x ℓ d i c, x ℓ q ´ s T ´ t p x t d i c, x t q ` s T ´ t p x t d i c, x t q log ˜ s T ´ t p x t d i c, x t q e T ´ ℓ ´ 1 e T ´ t ´ 1 p s T ´ ℓ p x ℓ d i c, x ℓ q ¸ ﬀ “ ÿ i P m p x t q ÿ c Pr S s s T ´ t p x t d i c, x t q D ˆ e T ´ ℓ ´ 1 e T ´ t ´ 1 p s T ´ ℓ p x ℓ d i c, x ℓ q , s T ´ t p x t d i c, x t q ˙ . (92) T o proceed, w e mak e the observ ation that the Bregman div ergence satisﬁes the following law of cosines: D p α, γ q “ D p α, β q ` D p β , γ q ` p α ´ β q β ´ γ β γ . W e apply this decomp osition to eac h term of Eqn. ( 92 ) with α “ e T ´ ℓ ´ 1 e T ´ t ´ 1 p s T ´ ℓ p x ℓ d i c, x ℓ q , β “ e T ´ ℓ ´ 1 e T ´ t ´ 1 s T ´ ℓ p x ℓ d i c, x ℓ q , and γ “ s T ´ t p x t d i c, x t q . In the follo wing, we sligh tly abuse the notation and write x t : “ p x t d i c, x t q and x ℓ : “ p x ℓ d i c, x ℓ q whenev er i P m p x t q and c P r S s are ﬁxed. F or ﬁxed i, c , each term in Eqn. ( 92 ) can b e decomp osed as s T ´ t p x t q D ˆ e T ´ ℓ ´ 1 e T ´ t ´ 1 p s T ´ ℓ p x ℓ q , s T ´ t p x t q ˙ “ s T ´ t p x t q D p p s T ´ ℓ p x ℓ q , s T ´ ℓ p x ℓ qq ` s T ´ t p x t q D ˆ e T ´ ℓ ´ 1 e T ´ t ´ 1 s T ´ ℓ p x ℓ q , s T ´ t p x t q ˙ ` p s T ´ ℓ p x ℓ q ´ s T ´ ℓ p x ℓ q s T ´ ℓ p x ℓ q ˆ e T ´ ℓ ´ 1 e T ´ t ´ 1 s T ´ ℓ p x ℓ q ´ s T ´ t p x t q ˙ . Note that w e simpliﬁed the ﬁrst term as D p αx, αy q “ D p x, y q . Observing that e T ´ ℓ ´ 1 e T ´ t ´ 1 s T ´ ℓ ” s T ´ t b y Eqn. ( 60 ), this can b e rearranged as follo ws: s T ´ t p x t q D ˆ e T ´ ℓ ´ 1 e T ´ t ´ 1 p s T ´ ℓ p x ℓ q , s T ´ t p x t q ˙ “ e T ´ ℓ ´ 1 e T ´ t ´ 1 s T ´ ℓ p x ℓ q D p p s T ´ ℓ p x ℓ q , s T ´ ℓ p x ℓ qq looooooooooooooooooooooooo omooooooooooooooooooooooooo on “ : T 1 ` ` s T ´ t p x ℓ q ´ s T ´ t p x t q ˘ ¨ log p s T ´ ℓ p x ℓ q s T ´ ℓ p x ℓ q loooooooooooooooooooooo omoooooooooooooooooooooo on “ : T 2 ` s T ´ t p x t q D p s T ´ t p x ℓ q , s T ´ t p x t qq lo oooooooooooooooooomo oooooooooooooooooon “ : T 3 . T aking the ab o ve collectiv ely with Eqns. ( 92 ) and ( 91 ) leads to KL p Ð q T } p T q À e ´ T d log S ` N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t k ,x t „ Ð q t k ,t ÿ i P m p x t q ÿ c Pr S s p T 1 ` T 2 ` T 3 q . No w, it suﬃces to con trol eac h term on the righ t, respectively . • After taking a summation o ver i P m p x t q and c P r S s we connect the ﬁrst term, T 1 , to the score entrop y loss. T o see that, direct calculations show E x t ,x ℓ „ Ð q t,ℓ ÿ i P m p x t q ÿ c Pr S s e T ´ ℓ ´ 1 e T ´ t ´ 1 s T ´ ℓ p x ℓ d i c, x ℓ q D p p s T ´ ℓ p x ℓ d i c, x ℓ q , s T ´ ℓ p x ℓ d i c, x ℓ qq 41 “ E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s e t ´ ℓ s T ´ ℓ p x ℓ d i c, x ℓ q D p p s T ´ ℓ p x ℓ d i c, x ℓ q , s T ´ ℓ p x ℓ d i c, x ℓ qq “ e t ´ ℓ L SE p T ´ ℓ, p s T ´ ℓ , s T ´ ℓ q , where w e used in the second line that Pr p x i t “ MASK | x i ℓ “ MASK q “ 1 ´ e ´p T ´ t q 1 ´ e ´p T ´ ℓ q . Therefore, recalling that ℓ : “ t k , N ´ 1 ÿ k “ 0 ż t k ` 1 t k e t ´ t k L SE p T ´ t k , p s T ´ t k , s T ´ t k q d t “ N ´ 1 ÿ k “ 0 p e t k ` 1 ´ t k ´ 1 q L SE p T ´ t k , p s T ´ t k , s T ´ t k q À ε score , (93) where we used ∆ “ O p 1 q and Assumption 1 in the last inequality . • T o control the second term, T 2 , the next lemma describ es a martingale prop ert y of the score function. The pro of is given in Section E.7 . Lemma 14. Consider the masking noising pr o c ess and let 0 ď ℓ ă t ă T . Then, for any c P V and i P m p x ℓ q , E x t „ Ð q t | ℓ p¨| x ℓ q rp s T ´ t p x ℓ d i c, x ℓ q ´ s T ´ t p x t d i c, x t qq I t i P m p x t qus “ 0 . In view of Lemma 14 , we conclude that the second term, T 2 , contributes zero after conditioning on x t k : ÿ i Pr d s ÿ c Pr S s E x t „ Ð q t | t k p¨| x t k q I t i P m p x t qup s T ´ t p x t k d i c, x t k q ´ s T ´ t p x t d i c, x t qq “ 0 . • Lastly , we mo ve on to con trol the last term, T 3 . T o wards this goal, w e introduce the follo wing lemma, whose pro of is provided in Section F.4 . Lemma 15. L et 0 ď ℓ ă t ď T . Then, for I p t q deﬁne d in Eqn. ( 16 ) , E x ℓ ,x t „ Ð q ℓ,t ÿ i P m p x t q ÿ c Pr S s s T ´ t p x t d i c, x t q D p s T ´ t p x ℓ d i c, x ℓ q , s T ´ t p x t d i c, x t qq “ ż t ℓ e t ´ v I p T ´ v q d v . (94) After the summation ov er i P m p x t q and c P r S s , w e express the contributions of the term T 3 us- ing Lemma 15 as follows: N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t k ,x t „ Ð q t k ,t ÿ i P m p x t q ÿ c Pr S s s T ´ t p x t d i c, x t q D p s T ´ t p x t k d i c, x t k q , s T ´ t p x t d i c, x t qq d t “ N ´ 1 ÿ k “ 0 ż t k ` 1 t k ż t t k e t ´ v I p T ´ v q d v d t ď N ´ 1 ÿ k “ 0 h k ż T ´ t k T ´ t k ` 1 I p t q d t, (95) where we used ∆ “ O p 1 q and non-negativity of conditional m utual information in the last inequality . Collecting Eqns. ( 87 ), ( 93 ), and ( 95 ) prov es KL p q 0 } p T q À ε score ` e ´ T d log S ` N ´ 1 ÿ k “ 0 h k ż T ´ t k T ´ t k ` 1 I p t q d t. 42 D.2 Pro of of Corollary 2 W e upp er b ound the last term of Eqn. ( 17 ) under uniform and exp onen tial-then-constant step size sc hedules. First, under the constant step size schedule, the quantit y of in terest satisﬁes N ´ 1 ÿ k “ 0 h k ż T ´ t k T ´ t k ` 1 I p t q d t “ T N ż T 0 I p t q d t ď T N ż 8 0 I p t q d t “ T N B p q 0 q , where the last inequality follo ws from Lemma 16 . Therefore, as long as N ě T B p q 0 q ε “ r O ˆ B p q 0 q ε ˙ , Eqn. ( 17 ) leads to KL p q 0 } p T q À ε score ` ε . Next, under exp onential-then-constan t step size sc hedule, w e b ound the last term of Eqn. ( 17 ) as follows: N ´ 1 ÿ k “ 0 h k ż T ´ t k T ´ t k ` 1 I p t q d t “ ε d log p S q ż ε {p d log S q 0 I p t q d t ` N ´ 2 ÿ k “ 0 ż T ´ t k T ´ t k ` 1 p t k ` 1 ´ t k q I p t q d t ď ε 2 log p S q ` κ N ´ 2 ÿ k “ 0 ż T ´ t k T ´ t k ` 1 min p 1 , T ´ t k ` 1 q I p t q d t ď ε ` κ ż T 0 min p 1 , t q I p t q d t ď ε ` κ D p q 0 q . F or N ą 0 , suc h step size schedule is p ossible with κ “ O ´ T ` log p ε ´ 1 d log p S qq N ¯ . Thus, choosing N ě p T ` log p ε ´ 1 d log p S qqq D p q 0 q ε “ r O ˆ D p q 0 q ε ˙ giv es KL p q 0 } p T q À ε score ` ε . D.3 τ -leaping for masking discrete diﬀusion In this section, w e pro ve the analogue of Theorem 3 for the truncated τ -leaping algorithm. Note that since applying m ultiple jumps on a single coordinate is ill-deﬁned in masking noising pro cess (where should w e transition if the τ -leaping algorithm requires tw o transitions MASK Ñ 1 at some coordinate?), w e analyse the truncated v ersion (Eqn. ( 9 )) instead of the classical τ -leaping algorithm. Theorem 5. L et q data “ q 0 b e the tar get distribution on r S s d . L et 0 ă δ ă T and 0 “ t 0 ă t 1 ă . . . ă t N “ T ´ δ , such that h k : “ t k ´ t k ´ 1 ď κ min p 1 , T ´ t k q for k P r N s , and κ “ O p 1 q . L et p 0 : “ ˜ ` 1 ´ e ´ T ˘ δ MASK ` e ´ T S S ÿ k “ 1 δ k ¸ b d . Under Assumption 1 , trunc ate d τ -le aping Eqn. ( 9 ) initialize d at p 0 pr o duc es a sample fr om p output : “ p T ´ δ , such that KL p q δ } p output q À ε score ` e ´ T d log p S q ` N ´ 1 ÿ k “ 0 h k ` 1 ż T ´ t k T ´ t k ` 1 I p t q d t ` κ 3 N d ` κC, (96) wher e C : “ N ´ 1 ÿ k “ 0 p t k ` 1 ´ t k q E x t k „ Ð q t k ÿ i P m p x t k q ÿ c Pr S s s T ´ t k p x t k d i c, x t k q     log p s T ´ t k p x t k d i c, x t k q s T ´ t k p x t k d i c, x t k q     . Conse quently, with exp onential-then-c onstant step size sche dule wher e κ “ O ´ T ` log p δ ´ 1 q N ¯ , for any ε ą 0 , for T “ O p log p ε ´ 1 d log S qq and N “ r O ˜ D p q data q ` C ε ` c d ε ¸ , (97) 43 it satisﬁes KL p q δ } p output q À ε score ` ε. Note that the guaran tee in Eqn. ( 96 ) closely parallels Eqn. ( 17 ) in Theorem 3 for Algorithm 1 . In particular, t wo additional terms arise in the analysis of the truncated τ -leaping algorithm. W e exp ect the constan t C to b e small and remark that it also arises in the analysis of Conforti et al. ( 2025 ), as C M 2 in Theorem 3.2.1, in the form of the maximum rather than the a verage. Under the assumption of (one-sided) b oundedness p s T ´ t k ě M ´ 1 , the constant C can be upp er b ounded via the Cauc hy-Sc hw arz inequalit y , as the next corollary shows. Corollary 3. Consider the setting of The or em 5 . If, additional ly, ther e exists M ą 0 such that for al l k P t 0 , . . . , N ´ 1 u , x P pr S s Y t MASK uq d , i P m p x q , and c P r S s it holds log p s T ´ t k p x d i c, x q ě ´ log M , it is suﬃcient for N “ r O ˜ D p q data q ` ? ε score d log M ε ` c d ε ¸ to ensur e KL p q δ } p T ´ δ q À ε score ` ε. Pr o of. F or ﬁxed k P t 0 , . . . , N u and ℓ : “ t k , by the Cauch y-Sch warz inequality , it satisﬁes E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q     log p s T ´ ℓ p x ℓ d i c, x ℓ q s T ´ ℓ p x ℓ d i c, x ℓ q     ď ¨ ˝ E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q ˛ ‚ 1 { 2 ¨ ˝ E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q ˆ log p s T ´ ℓ p x ℓ d i c, x ℓ q s T ´ ℓ p x ℓ d i c, x ℓ q ˙ 2 ˛ ‚ 1 { 2 ď ? d ¨ ˝ E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q ˆ log p s T ´ ℓ p x ℓ d i c, x ℓ q s T ´ ℓ p x ℓ d i c, x ℓ q ˙ 2 ˛ ‚ 1 { 2 . Next, using z ´ 1 ´ log z Á p log z q 2 B for log z ě ´ B , together with log p s T ´ ℓ p x ℓ d i c, x ℓ q s T ´ ℓ p x ℓ d i c, x ℓ q ě ´ log M ` log ` e T ´ ℓ ´ 1 ˘ ě ´ log M ` log p T ´ ℓ q ě ´ log p M δ ´ 1 q , w e upper b ound E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q ˆ log p s T ´ ℓ p x ℓ d i c, x ℓ q s T ´ ℓ p x ℓ d i c, x ℓ q ˙ 2 ď log p M δ ´ 1 q ˆ E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q D p p s T ´ ℓ p x ℓ d i c, x ℓ q , s T ´ ℓ p x ℓ d i c, x ℓ qq . As a result, C can b e con trolled as C : “ N ´ 1 ÿ k “ 0 p t k ` 1 ´ t k q E x t k „ Ð q t k ÿ i P m p x t k q ÿ c Pr S s s T ´ t k p x t k d i c, x t k q     log p s T ´ t k p x t k d i c, x t k q s T ´ t k p x t k d i c, x t k q     ď a d log p M δ ´ 1 qˆ N ´ 1 ÿ k “ 0 p t k ` 1 ´ t k q ¨ ˝ E x t k „ Ð q t k ÿ i P m p x t k q ÿ c Pr S s s T ´ t k p x t k d i c, x t k q D p p s T ´ t k p x t k d i c, x t k q , s T ´ t k p x t k d i c, x t k qq ˛ ‚ 1 { 2 44 p a q ď a κN d log p M δ ´ 1 qˆ ¨ ˝ N ´ 1 ÿ k “ 0 p t k ` 1 ´ t k q E x t k „ Ð q t k ÿ i P m p x t k q ÿ c Pr S s s T ´ t k p x t k d i c, x t k q D p p s T ´ t k p x t k d i c, x t k q , s T ´ t k p x t k d i c, x t k qq ˛ ‚ 1 { 2 ď a κN d log p M δ ´ 1 q ε score , (98) where in p a q we used t k ` 1 ´ t k ď κ together with the Cauch y-Sch warz inequalit y . Combining the bound of Eqn. ( 98 ) with Eqn. ( 97 ) and κ “ r O p 1 { N q completes the proof. W e emphasize that, in con trast to Theorem 3 , in Theorem 5 we require early stopping for some δ ą 0 , whic h in turn leads to the exp onential-then-constan t step size schedule. W e now elaborate on the diﬀerence b et w een Theorem 5 and Theorem 3 , and provide some in tuition for the app earance of the t wo additional terms in Eqn. ( 96 ). R emark 3 . T o obtain an accurate sampler, it is natural to require that p Q t « Ð Q t uniformly for all t P r 0 , T s . The main challenge is that w e only hav e access to score estimates at discrete time points. In the truncated τ -leaping algorithm analyzed in this section, this results in appro ximating p s T ´ t : “ p s T ´ t k « s T ´ t . Informally , w e establish this by sho wing p s T ´ t k « s T ´ t k « s T ´ t , (99) where the ﬁrst approximation is ensured by Assumption 1 and the second results from the prop erties of the score function for the masking noising pro cess, requiring the step size t k ` 1 ´ t k to b e small. In contrast, for Algorithm 1 considered in Theorem 3 , the condition p Q t « Ð Q t translates to p s T ´ t : “ e T ´ t k ´ 1 e T ´ t ´ 1 p s T ´ t k « s T ´ t . In view of Prop osition 6 , the abov e condition is equiv alent to p s T ´ t : “ e T ´ t k ´ 1 e T ´ t ´ 1 p s T ´ t k « e T ´ t k ´ 1 e T ´ t ´ 1 s T ´ t k “ s T ´ t , whic h is guaran teed b y Assumption 1 . Notably , this simple rescaling eliminates the need for the second appro ximation step that is required in the truncated τ -leaping analysis. This distinction explains wh y Theorem 5 contains tw o additional error terms and necessitates early stopping, in con trast to the cleaner guaran tee obtained in Theorem 3 . Pro of of Theorem 5 . The proof follows the pro of of Theorem 3 closely with several additional steps. W e b egin with Eqn. ( 91 ) KL p Ð q T } p T q À e ´ T d log S ` N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t k ,x t „ Ð q t k ,t ÿ y ‰ x t « p Q t p x t , y q ´ Ð Q t p x t , y q ` Ð Q t p x t , y q log ˜ Ð Q t p x t , y q p Q t p x t , y q ¸ﬀ d t. (100) Next, since for this sampler, the rate matrices p Q t are given b y the following: p Q t p x, y q : “ $ ’ & ’ % p s T ´ t k p x t k d i y i , x t k q I t x i “ MASK u , if d H p x, y q “ 1 , x i ‰ y i , and x i t k “ MASK , ´ ř z ‰ x p Q t p x, z q , if y “ x, 0 , otherwise, 45 w e can therefore b ound ÿ y ‰ x t « p Q t p x t , y q ´ Ð Q t p x t , y q ` Ð Q t p x t , y q log ˜ Ð Q t p x t , y q p Q t p x t , y q ¸ﬀ “ ÿ i P m p x t q ÿ c Pr S s « p Q t p x t , x t d i c q ´ Ð Q t p x t , x t d i c q ` Ð Q t p x t , x t d i c q log ˜ Ð Q t p x t , x t d i c q p Q t p x t , x t d i c q ¸ﬀ “ ÿ i P m p x t q ÿ c Pr S s „ p s T ´ ℓ p x ℓ d i c, x ℓ q ´ s T ´ t p x t d i c, x t q ` s T ´ t p x t d i c, x t q log ˆ s T ´ t p x t d i c, x t q p s T ´ ℓ p x ℓ d i c, x ℓ q ˙ȷ “ ÿ i P m p x t q ÿ c Pr S s s T ´ t p x t d i c, x t q D p p s T ´ ℓ p x ℓ d i c, x ℓ q , s T ´ t p x t d i c, x t qq . T o proceed, w e again apply the la w of cosines D p α, γ q “ D p α , β q ` D p β , γ q ` p α ´ β q β ´ γ β γ with α “ p s T ´ ℓ p x ℓ d i c, x ℓ q , β “ s T ´ ℓ p x ℓ d i c, x ℓ q , and γ “ s T ´ t p x t d i c, x t q . In the follo wing, we sligh tly abuse the notation and write x t : “ p x t d i c, x t q and x ℓ : “ p x ℓ d i c, x ℓ q whenev er i P m p x t q and c P r S s are ﬁxed. As a result, for ﬁxed i, c , one has s T ´ t p x t q D p p s T ´ ℓ p x ℓ q , s T ´ t p x t qq “ s T ´ t p x t q D p p s T ´ ℓ p x ℓ q , s T ´ ℓ p x ℓ qq ` s T ´ t p x t q D p s T ´ ℓ p x ℓ q , s T ´ t p x t qq ` p s T ´ ℓ p x ℓ q ´ s T ´ ℓ p x ℓ q s T ´ ℓ p x ℓ q p s T ´ ℓ p x ℓ q ´ s T ´ t p x t qq . This can be rearranged as follows: s T ´ t p x t q D p p s T ´ ℓ p x ℓ q , s T ´ t p x t qq “ s T ´ ℓ p x ℓ q D p p s T ´ ℓ p x ℓ q , s T ´ ℓ p x ℓ qq lo oooooooooooooooooomo oooooooooooooooooon “ : T 1 ` s T ´ t p x t q D p s T ´ ℓ p x ℓ q , s T ´ t p x t qq lo ooooooooooooooooo omoooooooooooooooooo on “ : T 2 ` p s T ´ ℓ p x ℓ q ´ s T ´ t p x t qq log p s T ´ ℓ p x ℓ q s T ´ ℓ p x ℓ q loooooooooooooooooooooomooooooooooooooooooooo on “ : T 3 . It is therefore suﬃcien t to control each term separately . Similar to the pro of of Theorem 3 , the ﬁrst term, T 1 , after the summation is upper b ounded b y the score entrop y loss E x t ,x ℓ „ Ð q t,ℓ ÿ i P m p x t q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q D p p s T ´ ℓ p x ℓ d i c, x ℓ q , s T ´ ℓ p x ℓ d i c, x ℓ qq ď E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q D p p s T ´ ℓ p x ℓ d i c, x ℓ q , s T ´ ℓ p x ℓ d i c, x ℓ qq “ L SE p T ´ ℓ, p s T ´ ℓ , s T ´ ℓ q , and, by Assumption 1 , N ´ 1 ÿ k “ 0 ż t k ` 1 t k L SE p T ´ ℓ, p s T ´ ℓ , s T ´ ℓ q d t ď ε score . (101) No w, w e turn to controlling the terms T 2 and T 3 whic h require a diﬀerent treatment than in the pro of of Theorem 3 . 46 F or the term T 2 , we again apply the law of cosines with α “ s T ´ ℓ p x ℓ q , β “ s T ´ t p x ℓ q , and γ “ s T ´ t p x t q . This leads to the following decomp osition s T ´ t p x t q D p s T ´ ℓ p x ℓ q , s T ´ t p x t qq “ s T ´ t p x t q D p s T ´ ℓ p x ℓ q , s T ´ t p x ℓ qq looooooooooooooooooomooooooooooooooooooon “ : T 21 ` s T ´ t p x t q D p s T ´ t p x ℓ q , s T ´ t p x t qq loooooooooooooooooo omooooooooooooooooooon “ : T 22 ` p s T ´ t p x ℓ q ´ s T ´ t p x t qq s T ´ ℓ p x ℓ q ´ s T ´ t p x ℓ q s T ´ t p x ℓ q lo oooooooooooooooooooooooooo omo oooooooooooooooooooooooooo on “ : T 23 . • F or T 21 , using Eqn. ( 60 ), observe that D p s T ´ ℓ p x ℓ q , s T ´ t p x ℓ qq “ e T ´ t ´ 1 e T ´ ℓ ´ 1 ´ 1 ´ log e T ´ t ´ 1 e T ´ ℓ ´ 1 ď p e T ´ ℓ ´ e T ´ t q 2 2 p e T ´ ℓ ´ 1 qp e T ´ t ´ 1 q À κ 2 , where κ is a parameter of the step size sc hedule: t k ` 1 ´ t k ď κ min p 1 , T ´ t k ` 1 q . The total contribution of terms T 21 is: N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t ,x t k „ Ð q t,t k ÿ i P m p x t q ÿ c Pr S s s T ´ t p x t d i c, x t q D p s T ´ t k p x t k d i c, x t k q , s T ´ t p x t k d i c, x t k qq d t À κ 3 N ´ 1 ÿ k “ 0 E x t „ Ð q t ÿ i P m p x t q ÿ c Pr S s s T ´ t p x t d i c, x t q “ κ 3 N ´ 1 ÿ k “ 0 E x t „ Ð q t ÿ i P m p x t q ř c Pr S s q t p x t d i c q q t p x t q ď κ 3 N d, (102) as ř c Pr S s q t p x t d i c q “ q t p x t q . • The term T 22 is identical to the term T 2 from the proof of Theorem 3 , thus we use Lemma 15 and obtain that after summation of o ver i P m p x t q and c P r S s : N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t k ,x t „ Ð q t k ,t ÿ i P m p x t q ÿ c Pr S s s T ´ t p x t d i c, x t q D p s T ´ t p x t k d i c, x t k q , s T ´ t p x t d i c, x t qq d t. “ N ´ 1 ÿ k “ 0 ż t k ` 1 t k ż t t k e t ´ v I p T ´ v q d v d t À N ´ 1 ÿ k “ 0 h k ` 1 ż T ´ t k T ´ t k ` 1 I p t q d t. (103) Here, w e in vok e the assumption κ “ O p 1 q and the non-negativit y of conditional mutual information in the last inequalit y . • T o con trol T 23 , observe that by Eqn. ( 60 ), the score function satisﬁes the relation s T ´ ℓ p x ℓ q ´ s T ´ t p x ℓ q s T ´ t p x ℓ q “ e T ´ t ´ e T ´ ℓ e T ´ ℓ ´ 1 , and importantly it do es not depend on x ℓ . This implies that up on summation o ver i P m p x t q and c P r S s , the term T 23 con tributes zero, i.e., ÿ i P m p x t q ÿ c Pr S s p s T ´ t p x ℓ d i c, x ℓ q ´ s T ´ t p x t d i c, x t qq e T ´ t ´ e T ´ ℓ e T ´ ℓ ´ 1 47 “ e T ´ t ´ e T ´ ℓ e T ´ ℓ ´ 1 ÿ i P m p x t q ˜ ř c Pr S s q t p x ℓ d i c q q t p x ℓ q ´ ř c Pr S s q t p x t d i c q q t p x t q ¸ “ e T ´ t ´ e T ´ ℓ e T ´ ℓ ´ 1 ÿ i P m p x t q p 1 ´ 1 q “ 0 . Putting pieces together, we can conclude N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x t k ,x t „ Ð q t k ,t ÿ i P m p x t q ÿ c Pr S s T 2 ď κ 3 N d ` N ´ 1 ÿ k “ 0 h k ` 1 ż T ´ t k T ´ t k ` 1 I p t q d t. (104) It therefore remains to control term T 3 . Recall the deﬁnition T 3 : “ p s T ´ ℓ p x ℓ q ´ s T ´ t p x t qq log p s T ´ ℓ p x ℓ q s T ´ ℓ p x ℓ q . Crucially , unlik e in the pro of of Theorem 3 , w e no longer ha ve a martingale prop ert y for this term. How ever, w e can decomp ose T 3 “ p s T ´ ℓ p x ℓ q ´ s T ´ t p x ℓ qq log p s T ´ ℓ p x ℓ q s T ´ ℓ p x ℓ q ` p s T ´ t p x ℓ q ´ s T ´ t p x t qq log p s T ´ ℓ p x ℓ q s T ´ ℓ p x ℓ q loooooooooooooooooooooomoooooooooooooooooooooon contributes zero by Lemma 14 . It remains to b ound the ﬁrst term, which can b e written as p s T ´ ℓ p x ℓ q ´ s T ´ t p x ℓ qq log p s T ´ ℓ p x ℓ q s T ´ ℓ p x ℓ q “ e T ´ t ´ e T ´ ℓ e T ´ t ´ 1 s T ´ ℓ p x ℓ q log p s T ´ ℓ p x ℓ q s T ´ ℓ p x ℓ q . In view of the step size assumption t k ` 1 ´ t k ď κ min p 1 , T ´ t k ` 1 q , it satisﬁes     e T ´ t ´ e T ´ ℓ e T ´ t ´ 1     À κ. The total con tribution of terms T 3 can therefore be upp er bounded b y the following: κ N ´ 1 ÿ k “ 0 ż t k ` 1 t k E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q     log p s T ´ ℓ p x ℓ d i c, x ℓ q s T ´ ℓ p x ℓ d i c, x ℓ q     d t “ κ N ´ 1 ÿ k “ 0 p t k ` 1 ´ t k q E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q     log p s T ´ ℓ p x ℓ d i c, x ℓ q s T ´ ℓ p x ℓ d i c, x ℓ q     . (105) Collecting Eqns. ( 100 ), ( 101 ), ( 102 ), ( 106 ), and ( 105 ) pro ves KL p q δ } p T ´ δ q À ε score ` e ´ T d log p S q ` N ´ 1 ÿ k “ 0 h k ` 1 ż T ´ t k T ´ t k ` 1 I p t q d t ` κ 3 N d ` κC, where C : “ N ´ 1 ÿ k “ 0 p t k ` 1 ´ t k q E x ℓ „ Ð q ℓ ÿ i P m p x ℓ q ÿ c Pr S s s T ´ ℓ p x ℓ d i c, x ℓ q     log p s T ´ ℓ p x ℓ d i c, x ℓ q s T ´ ℓ p x ℓ d i c, x ℓ q     . F or our step size sc hedule, as in Corollary 2 , we upp er bound N ´ 1 ÿ k “ 0 h k ` 1 ż T ´ t k T ´ t k ` 1 I p t q d t ď κ N ´ 1 ÿ k “ 0 ż T ´ t k T ´ t k ` 1 min p 1 , T ´ t k ` 1 q I p t q d t ď κ ż T δ min p 1 , t q I p t q d t ď κ D p q data q . (106) 48 Plugging in the choices κ “ O ´ T ` log p δ ´ 1 q N ¯ , T “ O p log p ε ´ 1 d log S qq , and N “ r O ˜ D p q data q ` C ε ` c d ε ¸ , yields KL p q δ } p T ´ δ q À ε score ` ε. E Pro ofs of the main lemmas E.1 Characterization of B p q data q and C p q data q The characterization of B p q data q and C p q data q is summarized in the follo wing lemma. Lemma 16. Consider a masking noising pr o c ess with initial distribution q 0 “ q data . L et C p q data q and B p q data q b e the total c orr elation and dual total c orr elation. Then, B p q data q “ ż 8 0 I p t q d t and C p q data q “ ż 8 0 p e t ´ 1 q I p t q d t. Conse quently, D p q data q ď min p B p q data q , C p q data qq . Pr o of. Let p ” p p t q “ e ´ t b e the probabilit y that at time t a co ordinate is unmask ed and X p p q ” p X 1 , . . . , X d q : “ p x 1 t , . . . , x d t q . W e also denote X R : “ x ´p i,j q t and X R : “ p X i q i P R for R Ď r d s . With a sligh t abuse of notation w e write I p p q : “ I p t p p qq . W e ha ve I p p q : “ ÿ i ‰ j I p X i ; X j | X R q “ ÿ i ‰ j p 2 ÿ R Ďr d szt i,j u p | R | p 1 ´ p q d ´ 2 ´ | R | I p X i ; X j | X R q , where p 2 app ears since for the term I p X i ; X j | X R q to b e non-zero, both X i and X j m ust b e unmasked. F or i P r d s , deﬁne h i p p q : “ ÿ R Ďr d szt i u p | R | p 1 ´ p q d ´ 1 ´ | R | H p X i | X R q , with d h i p p q d p “ ÿ R Ďr d szt i u ” | R | p | R | ´ 1 p 1 ´ p q d ´ 1 ´ | R | ´ p d ´ 1 ´ | R | q p | R | p 1 ´ p q d ´ 2 ´ | R | ı H p X i | X R q “ ÿ j ‰ i ÿ R Ďr d szt i,j u p | R | p 1 ´ p q d ´ 2 ´ | R | p H p X i | X R Yt j u q ´ H p X i | X R qq “ ´ ÿ j ‰ i ÿ R Ďr d szt i,j u p | R | p 1 ´ p q d ´ 2 ´ | R | I p X i ; X j | X R q . Therefore, I p p q “ d ÿ i “ 1 p 2 ˆ ´ d h i p p q d p ˙ . Since p “ e ´ t w e ha ve that d t “ ´ d p p and we can write ż 8 0 ÿ i ‰ j I p X i ; X j | X R q d t “ ż 1 0 d ÿ i “ 1 p ˆ ´ d h i p p q d p ˙ d p “ d ÿ i “ 1 ` ´ ph i p p q ˘ ˇ ˇ ˇ 1 0 ` ż 1 0 d ÿ i “ 1 h i p p q d p. Observ e that d H p X p p qq d p “ d ÿ i “ 1 h i p p q , 49 therefore, ż 1 0 d ÿ i “ 1 h i p p q d p “ H p X p 1 qq ´ H p X p 0 qq “ H p x 0 q . Since ř d i “ 1 h i p 1 q “ ř d i “ 1 H p x i 0 | x ´ i 0 q , we pro ved the ﬁrst part: ż 8 0 I p t q d t “ H p x 0 q ´ d ÿ i “ 1 H p x i 0 | x ´ i 0 q “ B p q 0 q . W e proceed similarly for the total correlation: ż 8 0 p e t ´ 1 q ÿ i ‰ j I p t q d t “ ż 1 0 p 1 ´ p q d ÿ i “ 1 ˆ ´ d h i p p q d p ˙ d p “ ´ ˜ d ÿ i “ 1 p 1 ´ p q h i p p q ˇ ˇ ˇ 1 0 ` ż 1 0 d ÿ i “ 1 h i p p q d p ¸ “ d ÿ i “ 1 H p x i 0 q ´ H p x 0 q “ C p q 0 q . E.2 Pro of of Lemma 8 F or an y i P r d s and c P r S s , let us deﬁne f i,c p t, x t q : “ s T ´ t p x t ‘ i c, x t q . (107) The following analysis holds for all i P r d s and c P r S s , so we safely omit the index i, c in the following analysis, and write it as f p t, x t q . Consider the case that the bac kward pro cess t x t u t Pr 0 ,T s „ t Ð q t u t Pr 0 ,T s , whic h is a P oisson jump pro cess with generator Ð L t suc h that ´ Ð L t f ¯ p t, x q “ ÿ y :d H p y ,x qď 1 Q T ´ t p y , x q s T ´ t p y , x q ` f p t, y q ´ f p t, x q ˘ “ 1 S ÿ y :d H p y ,x q“ 1 s T ´ t p y , x q ` f p t, y q ´ f p t, x q ˘ . By Itô’s form ula for Poisson p oin t pro cess in Lemma 6 , f p t, x t q satisﬁes the following sto c hastic diﬀerential equation: for 0 ď ℓ ď t ă T , f p t, x t q “ f p ℓ, x ℓ q ` ż t ℓ ” B t f p s, x s ´ q ` ´ Ð L s f ¯ p s, x s ´ q ı d s ` M t , (108) where x s ´ “ lim u Ñ s ´ x s , which exists for almost everywhere s P r 0 , T q under the Leb esgue measure, since w e hav e ﬁnite num b er of jumps almost surely . The comp ensation pro cess t M u u u Pr ℓ,t s in Eqn. ( 108 ) is deﬁned as M u “ ÿ y s :d H p y s ,x s q“ 1 ż u ℓ ` f p s, y s q ´ f p s, x s q ˘` d N x s ,y s s ´ λ x s ,y s s d s ˘ , (109) where N x,y s is the coun ting pro cess of jumps from x to y and we write the random coun ting measure as d N x,y s . W e deﬁne λ x,y s “ S ´ 1 s T ´ t p y , x q I t x s ´ “ x u to b e intensit y of the pro cess N x,y s . Since x s ´ “ x s almost ev erywhere s P p 0 , T q due to the ﬁnite num b er of jumps for eac h path almost surely , we can rewrite Eqn. ( 108 ) as f p t, x t q ´ f p ℓ, x ℓ q “ ż t ℓ ” B t f p s, x s q ` ´ Ð L s f ¯ p s, x s q ı d s ` M t . (110) 50 T o further simplify the righ t hand side, we assert that B t f p s, x s q ` ´ Ð L s f ¯ p s, x s q “ 0 . (111) In order to see this, ﬁrst, recall the deﬁnition ( 107 ) and direct calculations giv e B t f p s, x s q ` ´ Ð L s f ¯ p s, x s q “ B B s ˆ q T ´ s p x s ‘ i c q q T ´ s p x s q ˙ ` 1 S ÿ i 1 Pr d s ÿ c 1 Pr S s s T ´ s p x s ‘ i 1 c 1 , x s q ´ s T ´ s p x s ‘ i 1 c 1 ‘ i c, x s ‘ i 1 c 1 q ´ s T ´ s p x s ‘ i c, x s q ¯ p a q “ 1 S ÿ i 1 Pr d s ÿ c 1 Pr S s s T ´ s p x s ‘ i c, x s q ´ s T ´ s p x s ‘ i 1 c 1 , x s q ´ s T ´ s p x s ‘ i c ‘ i 1 c 1 , x s ‘ i c q ¯ ` 1 S ÿ i 1 Pr d s ÿ c 1 Pr S s s T ´ s p x s ‘ i 1 c 1 , x s q ´ s T ´ s p x s ‘ i 1 c 1 ‘ i c, x s ‘ i 1 c 1 q ´ s T ´ s p x s ‘ i c, x s q ¯ “ 1 S ÿ i 1 Pr d s ÿ c 1 Pr S s ´ s T ´ s p x s ‘ i c, x s q s T ´ s p x s ‘ i 1 c 1 , x s q ´ s T ´ s p x s ‘ i 1 c 1 , x s q s T ´ s p x s ‘ i c, x s q ¯ ` 1 S ÿ i 1 Pr d s ÿ c 1 Pr S s ´ s T ´ s p x s ‘ i 1 c 1 , x s q s T ´ s p x s ‘ i 1 c 1 ‘ i c, x s ‘ i 1 c 1 q ´ s T ´ s p x s ‘ i c, x s q s T ´ s p x s ‘ i c ‘ i 1 c 1 , x s ‘ i c q ¯ p b q “ 1 S ÿ i 1 Pr d s ÿ c 1 Pr S s ´ s T ´ s p x s ‘ i 1 c 1 ‘ i c, x s q ´ s T ´ s p x s ‘ i c ‘ i 1 c 1 , x s q ¯ , where in equality (a), w e apply the Kolmogoro v forward equation on q T ´ t ; in equality (b), we use the fact that s T ´ t p x, y q s T ´ t p y , z q “ s T ´ t p x, z q for any x, y , z P X . It is direct to chec k that the ‘ op erators commute, i.e., for an y x s P X , x s ‘ i 1 c 1 ‘ i c “ x s ‘ i c ‘ i 1 c 1 . for i ‰ i 1 , and the relation holds trivially when i “ i 1 . This relation directly reveals that B t f p s, x s q ` ´ Ð L s f ¯ p s, x s q “ 1 S ÿ i 1 Pr d s ÿ c 1 Pr S s ´ s T ´ s p x s ‘ i 1 c 1 ‘ i c, x s q ´ s T ´ s p x s ‘ i c ‘ i 1 c 1 , x s q ¯ “ 0 , whic h completes the pro of of Eqn. ( 111 ). T aking u “ ℓ in Eqn. ( 109 ), we hav e M ℓ “ 0 almost surely , and M u is a lo cal martingale for u P r ℓ, t s by deﬁnition. Recalling Lemma 10 , we can b ound sup s Pr ℓ,t s sup x P X f p s, x q ď log p S q ` max ! log ` p T ´ t q ´ 1 ˘ , 0 ) ă 8 . Similarly , for the intensit y of the counting pro cess, it satisﬁes sup s Pr ℓ,t s sup x,y P X λ x,y s ď 1 S s T ´ t p y , x q ď 1 S ´ log p S q ` max ! log ` p T ´ t q ´ 1 ˘ , 0 )¯ ă 8 . No w it is direct to chec k that sup s Pr ℓ,t s E r| M s |s À p t ´ ℓ q d p S ´ 1 q ¨ sup s Pr ℓ,t s sup x,y P X r f p s, x q ¨ λ x,y s s ă 8 . 51 As a result, w e can conclude t M u u u Pr ℓ,t s is L 1 and hence a martingale. By the deﬁnition of the martingale, w e arriv e at E Ð q t | ℓ p¨| x ℓ q r M t s “ M ℓ “ 0 . Returning to Eqn. ( 110 ), we obtain E x t „ Ð q t | ℓ p¨| x ℓ q “ f p t, x t q ´ f p ℓ, x ℓ q ‰ “ E x t „ Ð q t | ℓ p¨| x ℓ q r M t s “ 0 . (112) Th us, w e conclude that E x t „ Ð q t | ℓ p¨| x ℓ q ” ` s T ´ ℓ p x ℓ ‘ i c, x ℓ q ´ s T ´ t p x t ‘ i c, x t q ˘ log p s T ´ ℓ p x ℓ ‘ i c, x ℓ q ı “ E x t „ Ð q t | ℓ p¨| x ℓ q “ f p ℓ, x ℓ q ´ f p t, x t q ‰ ¨ log p s T ´ ℓ p x ℓ ‘ i c, x ℓ q “ 0 , where we plug in Eqn. ( 112 ) in the last line. E.3 Pro of of Lemma 9 The pro of of Lemma 9 follows directly from exc hanging the order of summation. Sp eciﬁcally , we can write E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 h p s T ´ t p y t , x t qq ﬁ ﬂ “ E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 s T ´ t p y t , x t q log p s T ´ t p y t , x t qq ´ s T ´ t p y t , x t q ` 1 ﬁ ﬂ “ E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 ˆ q T ´ t p y t q q T ´ t p x t q ˙ log p s T ´ t p y t , x t qq ﬁ ﬂ ´ E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 ˆ q T ´ t p y t q q T ´ t p x t q ˙ ﬁ ﬂ ` d p S ´ 1 q “ ÿ x t Pr S s d ÿ y t :d H p y t ,x t q“ 1 q T ´ t p y t q log p s T ´ t p y t , x t qq ´ ÿ x t Pr S s d ÿ y t :d H p y t ,x t q“ 1 q T ´ t p y t q ` d p S ´ 1 q p a q “ ´ ÿ x t Pr S s d ÿ y t :d H p y t ,x t q“ 1 q T ´ t p x t q log p s T ´ t p y t , x t qq ´ ÿ y t Pr S s d ÿ x t :d H p y t ,x t q“ 1 q T ´ t p y t q ` d p S ´ 1 q “ ´ E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 log p s T ´ t p y t , x t qq ﬁ ﬂ ´ d p S ´ 1 q ` d p S ´ 1 q “ E x t „ Ð q t » – ÿ y t :d H p y t ,x t q“ 1 ´ log p s T ´ t p y t , x t qq ﬁ ﬂ , where in equalit y (a), w e switc h the p ositions of x t and y t in the summations. E.4 Pro of of Lemma 10 Lemma 10 is a direct consequence of Liang et al. ( 2025c , Lemma 2). Here, w e present a simpliﬁed proof based on Proposition 6 . It can b e easily chec ked that α t “ 1 ´ e ´ t 1 ` p S ´ 1 q e ´ t P p 0 , 1 q . By Eqn. ( 59 ), one has, for d H p x, y q “ 1 , s t p y , x q “ E x 0 „ q 0 α d H p y ,x 0 q t E x 0 „ q 0 α d H p x,x 0 q t ď α ´ sup y,x,x 0 | d H p y ,x 0 q´ d H p y ,x 0 q| t 52 “ exp ˆ ´ log p α t q ¨ sup y ,x,x 0 | d H p y , x 0 q ´ d H p y , x 0 q| ˙ ď exp ˆ ´ log p α t q ¨ sup y ,x d H p y , x q ˙ “ exp ` ´ log p α t q ˘ . With similar calculation, one can establish the rev ersed inequality s t p y , x q ě exp ˆ log p α t q ¨ sup y ,x d H p y , x q ˙ “ exp ` log p α t q ˘ . As a result, we conclude | log ` s t p y , x q ˘ | ď ´ log p α t q À log p S q ` max ␣ log p t ´ 1 q , 0 ( . E.5 Pro of of Lemma 11 Let us start by pro ving the ﬁrst equation, i.e., φ i,c p t q “ KL p q t } p N i, ´ c q # q t q . Recall the deﬁnition of φ i,c p t q that φ i,c p t q “ E x t „ q t „ ´ log ˆ q t p N i,c p x t qq q t p x t q ˙ȷ “ ÿ x t P X q t p x q log ˆ q t p x q q t p N i,c p x t qq ˙ . (113) As in Eqn. ( 81 ), the pushforw ard measure satisﬁes p N i, ´ c q # q t p x q “ q t p N i,c p x qq , for an y x P X . As such, w e can express Eqn. ( 113 ) as φ i,c p t q “ ÿ x t P X q t p x q log ˆ q t p x q p N i, ´ c q # q t p x t q ˙ “ KL p q t } p N i, ´ c q # q t q , whic h pro ves the ﬁrst equation. F or the second relation, the deﬁnition of KL divergence gives ´ B B t KL p q t } p 0 q “ ´ B B t ÿ x Pr S s d q t p x q log p q t p x qq “ ´ ÿ x Pr S s d d q t p x q d t ` log p q t p x qq ` 1 ˘ “ ´ ÿ x Pr S s d d q t p x q d t log p q t p x qq . (114) Using the K olmogorov forw ard equation for the forw ard noising pro cess, w e ha ve d q t p x q d t “ ÿ y P X Q p x, y q q t p x q “ 1 S ÿ y : p y,x q“ 1 q t p y q ´ d p S ´ 1 q S q t p x q . Plugging the equation ab o ve in to Eqn. ( 114 ), we arriv e at ´ B B t KL p q t } p 0 q “ ´ ÿ x Pr S s d ¨ ˝ ÿ y : p y,x q“ 1 ˆ 1 S q t p y q ˙ ´ d p S ´ 1 q S q t p x q ˛ ‚ log p q t p x qq “ ´ 1 S ÿ x Pr S s d ÿ y : p y,x q“ 1 p q t p y q ´ q t p x qq log p q t p x qq 53 “ ´ 1 S ÿ x Pr S s d ÿ y : p y,x q“ 1 q t p x q ` log p q t p y qq ´ log p q t p x qq ˘ “ φ p t q . In addition, recall φ p t q “ 1 { S ř i Pr d s ř c Pr S s φ i,c p t q . W e reach φ p t q “ 1 S ÿ i Pr d s ÿ c Pr S s KL p q t } p N i, ´ c q # q t q “ 1 S ÿ i Pr d s ÿ c Pr S s KL p q t } p N i,c q # q t q . E.6 Pro of of Lemma 12 Let L be the time-homogeneous inﬁnitesimal generator of the forward pro cess. Since each co ordinate i P r d s is up dated indep enden tly in the forw ard pro cess, we can write L “ L i ` L ´ i , where L i only up dates co ordinate i , and L ´ i up dates all other coordinates. It direct to sho w that L i and L ´ i comm ute, therefore, we hav e for an y u ě 0 , q t ` u “ q t e uL i e uL ´ i , p N i, ´ c q # q t ` u “ pp N i, ´ c q # q t q e uL i e uL ´ i , where the second equation is due to the operator N i, ´ c comm utes with the semigroup t e uL u u ě 0 . With this form ulation, w e reac h φ i,c p t ` u q “ KL p q t ` u } p N i, ´ c q # q t ` u q ď KL p q t e uL i } pp N i, ´ c q # q t q e uL i q , (115) where in the last inequality , we apply the w eak data pro cessing inequality for KL divergence. Since b oth N i, ´ c and L i only op erate on the co ordinate i , we arriv e at the decomp osition KL p q t e uL i } pp N i, ´ c q # q t q e uL i q “ E x ´ i „p q t q ´ i “ KL ` q t p¨| x ´ i q e uL i } pp N i, ´ c q # q t p¨| x ´ i qq e uL i ˘‰ , (116) where p q t q ´ i is the marginal distribution of q t with co ordinate i excluded. Deﬁne K u to b e the transition k ernel on r S s ˆ r S s induced by e uL i . It is shown that K u p v 1 , v 2 q “ # 1 S ` p 1 ´ 1 S q e ´ u if v 1 “ v 2 ; 1 S p 1 ´ e ´ u q if v 1 ‰ v 2 . It can b e directly chec ked that K u is a S -ary symmetric c hannel with noise scale σ u “ p 1 ´ S ´ 1 qp 1 ´ e ´ u q . By Makur and Poly anskiy ( 2018 , Prop osition 12), a strong data pro cessing inequalit y holds for the c hannel K u , i.e., for any distribution p, q supported on r S s , KL p pe uL i } q e uL i q ď η KL p K u q KL p p } q q , where η KL p K u q satisﬁes η KL p K u q ď ˇ ˇ ˇ ˇ 1 ´ σ u ´ σ u S ´ 1 ˇ ˇ ˇ ˇ “ 1 ´ S S ´ 1 p 1 ´ S ´ 1 qp 1 ´ e ´ u q “ e ´ u . T aking this strong data proc essing inequalit y with Eqn. ( 116 ) yields KL p q t e uL i }pp N i, ´ c q # q t q e uL i q “ E x ´ i „p q t q ´ i “ KL p q t p¨| x ´ i q e uL i } pp N i, ´ c q # q t p¨| x ´ i qq e uL i q ‰ ď E x ´ i „p q t q ´ i “ e ´ u KL p q t p¨| x ´ i q } pp N i, ´ c q # q t p¨| x ´ i qq ‰ ď e ´ u E x ´ i „p q t q ´ i “ KL p q t p¨| x ´ i q } pp N i, ´ c q # q t p¨| x ´ i qq ‰ “ e ´ u KL p q t } p N i, ´ c q # q t q . Then, by Eqn. ( 115 ), w e hav e φ i,c p t ` u q ď KL p q t e uL i } pp N i, ´ c q # q t q e uL i q ď e ´ u KL p q t } p N i, ´ c q # q t q “ e ´ u φ i,c p t q , 54 whic h holds for any u ě 0 . Therefore, the deriv ative can b e b ound as φ 1 i,c p t q “ lim u Ñ 0 ` φ i,c p t ` u q ´ φ i,c p t q u ď lim u Ñ 0 ` e ´ u u φ i,c p t q “ ´ φ i,c p t q , whic h induces the target result ´ φ 1 i,c p t q ě φ i,c p t q . E.7 Pro of of Lemma 14 The pro of follows from ( Conforti et al. , 2025 , Lemma 5.2.2). W e add the argumen t b elo w for completeness. Let us deﬁne f p t, x t q : “ s T ´ t p x t d i c, x t q I t i P m p x t qu , where the dependence on i and c is omitted for simplicity . In view of Lemma 6 , for 0 ď ℓ ď t ă T , we can write f p t, x t q “ f p ℓ, x ℓ q ` ż t ℓ ” B t f p s, x s q ` p Ð L s f qp s, x s q ı d s ` M t , with generator t Ð L s u s Pr ℓ,t s as follows ´ Ð L s f ¯ p s, x q “ ÿ y ‰ x Q T ´ s p y , x q s T ´ s p y , x q ` f p s, y q ´ f p s, x q ˘ “ ÿ i 1 P m p x q ÿ c 1 Pr S s s T ´ s p x d i 1 c 1 , x q ` f p s, x d i 1 c 1 q ´ f p s, x q ˘ , and the compensation pro cess t M u u u Pr ℓ,t s deﬁned as M u “ ż u ℓ ÿ i 1 P m p x s q ÿ c 1 Pr S s ` f p s, x s d i 1 c 1 q ´ f p s, x s q ˘` d N x s ,x s d i 1 c 1 s ´ λ x s ,x s d i 1 c 1 s d s ˘ . With similar argumen t as in the proof of Lemma 8 , one has E Ð q t | ℓ p¨| x ℓ q r M t s “ 0 , whic h leads to E x t „ Ð q t | ℓ p¨| x ℓ q r f p t, x t q ´ f p ℓ, x ℓ qs “ E x t „ Ð q t | ℓ p¨| x ℓ q „ ż t ℓ ` B t f p s, x s q ` p Ð L s f qp s, x s q ˘ d s ȷ . T aking deriv ative with resp ect to t on b oth side, we arrive at d d t E x t „ Ð q t | ℓ p¨| x ℓ q r f p t, x t qs “ E x t „ Ð q t | ℓ p¨| x ℓ q ” B t f p t, x t q ` p Ð L t f qp t, x t q ı . No w let us consider eac h term on the righ t hand side ab ov e separately . By Prop osition 6 , it obeys that s T ´ t p x t d i c, x t q “ 1 e T ´ t ´ 1 q 0 p x t d i c q q 0 p x t q , and we ha ve B B t f p t, x t q “ e T ´ t e T ´ t ´ 1 f p t, x t q . Next, direct calculations yield p Ð L t f qp t, x t q “ ÿ i 1 P m p x t q ÿ c 1 Pr S s s T ´ t p x t d i 1 c 1 , x t q ´ s T ´ t p x t d i c d i 1 c 1 , x t d i 1 c 1 q I t i P m p x t d i 1 c 1 qu 55 ´ s T ´ t p x t d i c, x t q I t i P m p x t qu ¯ “ 1 e T ´ t ´ 1 f p t, x t q ¨ ˝ ÿ i 1 P m p x t qzt i u ÿ c 1 Pr S s q 0 p x t d i c d i 1 c 1 q q 0 p x t d i c q ´ ÿ i 1 P m p x t q ÿ c 1 Pr S s q 0 p x t d i 1 c 1 q q 0 p x t q ˛ ‚ “ 1 e T ´ t ´ 1 f p t, x t qp| m p x t qzt i u| ´ | m p x t q|q “ ´ 1 e T ´ t ´ 1 f p t, x t q . Putting everything together leads to d d t E x t „ Ð q t | ℓ p¨| x ℓ q r f p t, x t qs “ E x t „ Ð q t | ℓ p¨| x ℓ q r f p t, x t qs , and therefore, E x t „ Ð q t | ℓ p¨| x ℓ q r f p t, x t qs “ e t ´ ℓ ¨ f p ℓ, x ℓ q . Finally , in view of the relation Pr p x i t “ MASK | x i ℓ “ MASK q “ 1 ´ e T ´ t 1 ´ e T ´ ℓ , we conclude the follo wing E x t „ Ð q t | ℓ p¨| x ℓ q “ s T ´ t p x t d i c, x t q I t i P m p x t qu ‰ “ e t ´ ℓ ¨ s T ´ ℓ p x ℓ d i c, x ℓ q I t i P m p x ℓ qu “ E x t „ Ð q t | ℓ p¨| x ℓ q “ s T ´ t p x ℓ d i c, x ℓ q I t i P m p x t qu ‰ , whic h completes the pro of of the desired result. F Pro ofs of the auxiliary lemmas F.1 Pro of of Lemma 2 F or a con tinuous random v ariable U in R d with density function p U with resp ect to Leb esgue measure, deﬁne the diﬀerential en tropy of U as H diﬀ p U q “ ´ ż R d p U log p p U q d x, (117) where we adopt the con ven tion 0 log p 0 q “ 0 again. By deﬁnition of m utual information, we ha ve I p W ; W ` ε noise q “ H diﬀ p W ` ε noise q ´ H diﬀ p W ` ε noise | W q p a q “ H diﬀ p W ` ε noise q ´ E w “ H diﬀ p w ` ε noise | W “ w q ‰ p b q “ H diﬀ p W ` ε noise q ´ E w “ H diﬀ p ε noise | W “ w q ‰ p c q “ H diﬀ p W ` ε noise q ´ H diﬀ p ε noise q , (118) where in (a), w e use the c hain rule of diﬀeren tial entrop y; in (b), w e apply the translation in v ariance prop ert y , i.e., H diﬀ p U q “ H diﬀ p c 0 ` U q for any constan t c 0 ; in (c), we use the condition that ε noise K K W . Denote the Gaussian densit y function with mean 0 and v ariance σ 2 I d as ϕ σ p¨q . Since ε noise „ N p 0 , σ 2 I d q , w e can compute with Eqn. ( 117 ) that H diﬀ p ε noise q “ ´ ż R d ϕ σ p x q log ` ϕ σ p x q ˘ d x “ ´ ż R d ϕ σ p x q ˆ ´ d 2 log p 2 π σ 2 q ´ } x } 2 2 2 σ 2 ˙ d x “ d 2 log p 2 π σ 2 q ` E r} ε noise } 2 2 s 2 σ 2 “ d 2 log p 2 π eσ 2 q , (119) 56 where } ¨ } 2 is the Euclidean norm in R d . F or H diﬀ p W ` ε noise q , notice that V ar “ W ` ε noise ‰ “ V ar r W s ` V ar r ε noise s ` 2 Cov “ W , ε noise ‰ “ V ar r W s ` σ 2 I d . By Cov er ( 1999 , P age 255), for distributions with the same ﬁnite v ariance, H diﬀ is maximized at the centered Gaussian random v ariable. The refore, w e ha ve H diﬀ p W ` ε noise q ď H diﬀ ´ N ` 0 , V ar r W s ` σ 2 I d ˘ ¯ “ d 2 log p 2 π e q ` 1 2 log ` det ` V ar r W s ` σ 2 I d ˘˘ , where det p¨q is the determinant of matrices, and the calculation is the same as in Eqn. ( 119 ). Since V ar r W s is a positive semideﬁnite matrix, w e can apply the matrix inequality that log ` det ` V ar r W s ` σ 2 I d ˘˘ “ d log p σ 2 q ` log ` det ` I d ` V ar r W { σ 2 s ˘˘ ď d log p σ 2 q ` T r ` V ar r W { σ 2 s ˘ “ d log p σ 2 q ` T r p V ar r W sq σ 2 , whic h leads to H diﬀ p W ` ε noise q ď d 2 log p 2 π eσ 2 q ` T r p V ar r W sq 2 σ 2 . (120) Plugging Eqns. ( 119 ) and ( 120 ) in to Eqn. ( 118 ), we conclude that I p W ; W ` ε noise q ď d 2 log p 2 π eσ 2 q ` T r p V ar r W sq 2 σ 2 ´ d 2 log p 2 π eσ 2 q “ T r p V ar r W sq 2 σ 2 . F.2 Pro of of Lemma 3 F or X „ Bin p n, 1 { 2 q , its pmf is giv en by P p X “ x q “ ˆ n x ˙ ˆ 1 2 ˙ n 9 ˆ n x ˙ . Notice that our desired b ound is equiv alent to the following equation: ÿ x : x mo d 2 “ 0 ˆ n x ˙ “ ÿ x : x mo d 2 “ 1 ˆ n x ˙ , whic h follo ws from the binomial theorem for 0 “ p 1 ´ 1 q n . F.3 Pro of of Lemma 13 The CTMC ( 88 ) in the lemma statement can b e decomp osed into d indep enden t CTMCs for each co ordinate. F or co ordinates i such that x i t k ‰ MASK clearly neither Eqn. ( 88 ) nor Algorithm 1 makes a c hange. Next, w e ﬁx i P m p x t k q . First, w e compute the probability that the i -th co ordinate remains masked for y t k ` 1 : Pr p y i t k ` 1 “ MASK | y t k q “ exp ¨ ˝ ż t k ` 1 t k ¨ ˝ ´ ÿ c Pr S s p s T ´ t k p y t k d i c, y t k q e T ´ t k ´ 1 e T ´ t ´ 1 ˛ ‚ d t ˛ ‚ “ exp p p Q i k p MASK q ∆ k q “ P k , where p Q i k p MASK q , ∆ k , and P k are deﬁned in Algorithm 1 . Next, for c P r S s w e can write Pr p y i t k ` 1 “ c | x t k q “ Pr p x i t k ` 1 “ c | x t k and x i t k ` 1 ‰ MASK qp 1 ´ P k q . 57 Since for an y t P r t k , t k ` 1 q the rates p Q t p x, x d i c q are proportional to p Q i k p c q , we get that Pr p y i t k ` 1 “ c | x t k and y i t k ` 1 ‰ MASK q “ p Q i k p c q ř b Pr S s p Q i k p b q , whic h matc hes the expression in Algorithm 1 . Therefore, the distribution of y t k ` 1 deﬁned by the CTMC matc hes the distribution of x t k ` 1 from the algorithm. F.4 Pro of of Lemma 15 In view of the deﬁnition of D p¨ , ¨q , one can write s T ´ t p x t d i c, x t q D p s T ´ t p x ℓ d i c, x ℓ q , s T ´ t p x t d i c, x t qq “ s T ´ t p x ℓ d i c, x ℓ q ´ s T ´ t p x t d i c, x t q ` s T ´ t p x t d i c, x t q log s T ´ t p x t d i c, x t q s T ´ t p x ℓ d i c, x ℓ q . The ﬁrst t wo terms cancel out in exp ectation by Lemma 14 ; i.e., for any c P r S s , one has E x t „ Ð q t | ℓ p¨| x ℓ q ” ÿ i P m p x t q p s T ´ t p x ℓ d i c, x ℓ q ´ s T ´ t p x t d i c, x t qq ı “ 0 . Next, using Eqn. ( 60 ), we obtain s T ´ t p x t d i c, x t q s T ´ t p x ℓ d i c, x ℓ q “ q 0 p x t d i c q q 0 p x ℓ q q 0 p x t q q 0 p x ℓ d i c q . Using this relation, we further b ound E x ℓ ,x t „ Ð q ℓ,t ÿ i P m p x t q ÿ c Pr S s s T ´ t p x t d i c, x t q log q 0 p x t d i c q q 0 p x ℓ q q 0 p x t q q 0 p x ℓ d i c q “ E y ℓ ,y t „ Ð q ℓ,t ÿ i R m p y t q log q 0 p y t q q 0 p y ℓ d i MASK q q 0 p y t d i MASK q q 0 p y ℓ d i y i t q “ ÿ i Pr d s E y ℓ ,y t „ Ð q ℓ,t log q 0 p y t q q 0 p y ℓ d i MASK q q 0 p y t d i MASK q q 0 p y ℓ d i y i t q , (121) where in the second line, we used the deﬁnition of the score function along with the natural bijection b et w een the sets tp x, i, c q , for x P X , i P m p x q , and c P r S su and tp y , i q , for y P X and i R m p y qu to change the measure under the exp ectation: x t Ñ y t d i MASK x ℓ Ñ y ℓ d i MASK x t d i c Ñ y t x ℓ d i c Ñ y ℓ d i y i t . Note that since y ℓ app ears earlier in the backw ard pro cess, y i ℓ can b e masked or unmask ed. Since the i -th elemen t of x ℓ d i c is unmasked by construction, we explicitly set the i -th element of y ℓ to y i t . The third line follo ws from the fact that, for i P m p y t q , the term is equal to zero. Next, we deﬁne, for ﬁxed t, y t , and i P r d s , f i p y q : “ log q 0 p y d i y i t q q 0 p y d i MASK q . T o further control the right hand side of Eqn. ( 121 ), w e inv oke Dynkin’s formula as described in Lemma 6 to obtain ÿ i Pr d s E y ℓ ,y t „ Ð q ℓ,t log q 0 p y t q q 0 p y ℓ d i MASK q q 0 p y t d i MASK q q 0 p y ℓ d i y i t q 58 “ ÿ i Pr d s E y ℓ ,y t „ Ð q ℓ,t r f i p y t q ´ f i p y ℓ qs “ ÿ i Pr d s ż t ℓ E y v ,y t „ Ð q v,t ÿ j R m p y v qYt i u log q 0 p y v d i y i t q q 0 p y v d i MASK d j MASK q q 0 p y v d i MASK q q 0 p y v d i y i t d j MASK q d v p i q “ ÿ i ‰ j Pr d s ż t ℓ E y v ,y t „ Ð q v,t log q 0 p y v d i y i t q q 0 p y v d i MASK d j MASK q q 0 p y v d i MASK q q 0 p y v d i y i t d j MASK q d v p ii q “ ÿ i ‰ j Pr d s ż t ℓ e t ´ v E y v „ Ð q v log q 0 p y v q q 0 p y v d i MASK d j MASK q q 0 p y v d i MASK q q 0 p y v d j MASK q d v . (122) Here for part p i q , as before, we extended the sum to j P m p y v qzt i u since additional terms equal zero. As f i p y q only dep ends on d ´ 1 co ordinates of y (all except the i -th), the constraint j ‰ i appears. Regarding part p ii q , it follows from Pr p y i v ‰ MASK | y i t ‰ MASK q “ e v ´ t . Next, let y ´p i,j q v denote all unmask ed elements of y v , except i -th and j -th. W e can write q 0 p y v q q 0 p y v d i MASK d j MASK q q 0 p y v d i MASK q q 0 p y v d j MASK q “ q 0 p y i v , y j v | y ´p i,j q v q q 0 p y i v | y ´p i,j q v q q 0 p y j v | y ´p i,j q v q , and thus, ÿ i ‰ j Pr d s ż t ℓ e t ´ v E y v „ Ð q v log q 0 p y v q q 0 p y v d i MASK d j MASK q q 0 p y v d i MASK q q 0 p y v d j MASK q d v “ ÿ i ‰ j ż t ℓ e t ´ v I p y i v ; y j v | y ´p i,j q v q d v “ ż t ℓ e t ´ v I p T ´ v q d v , (123) as y v „ q T ´ v . Combining Eqns. ( 121 ), ( 122 ), and ( 123 ) concludes the proof. 59

Efficient Sampling with Discrete Diffusion Models: Sharp and Adaptive Guarantees

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment