Manifold-Aligned Generative Transport

Manifold-Aligned Generativ e T ransp ort ∗ Xin yu Tian † , Xiaotong Shen ‡ . Abstract High-dimensional generative modeling is fundamen tally a manifold-learning prob- lem: real data concen trate near a low-dimensional structure embedded in the am bien t space. Eﬀectiv e generators m ust therefore balance support ﬁdelit y—placing proba- bilit y mass near the data manifold—with sampling eﬃciency . Diﬀusion mo dels often capture near-manifold structure but require man y iterativ e denoising steps and can leak oﬀ-supp ort; normalizing ﬂo ws sample in one pass but are limited by in vertibilit y and dimension preserv ation. W e propose MAGT (Manifold-Aligned Generativ e T rans- p ort), a ﬂo w-like generator that learns a one-shot, manifold-aligned transp ort from a lo w-dimensional base distribution to the data space. T raining is p erformed at a ﬁxed Gaussian smo othing lev el, where the score is well-deﬁned and numerically stable. W e appro ximate this ﬁxed-lev el score using a ﬁnite set of laten t anchor points with self- normalized imp ortance sampling, yielding a tractable ob jectiv e. MA GT samples in a single forward pass, concen trates probabilit y near the learned supp ort, and induces an in trinsic densit y with respect to the manifold v olume measure, enabling principled lik e- liho o d ev aluation for generated samples. W e establish ﬁnite-sample W asserstein b ounds linking smo othing lev el and score-approximation accuracy to generativ e ﬁdelity , and em- pirically improv e ﬁdelit y and manifold concentration across synthetic and b enchmark datasets while sampling substantially faster than diﬀusion mo dels. Keywor ds: Manifold learning, Diﬀusion, Flo ws, High ﬁdelit y , Syn thetic data generation. 1 In tro duction Mo dern generative mo deling is c haracterized by a trade-oﬀ betw een ﬁdelit y and eﬃciency . Diﬀusion mo dels can pro duce highly realistic samples but typically rely on iterative denoising at inference time, whic h mak es generation exp ensive ev en with improv ed solvers and distil- lation (Dhariw al and Nic hol, 2021; Karras et al., 2022; Rom bach et al., 2022; Song et al., 2023; Salimans and Ho, 2022; Lu et al., 2022). On the other hand, normalizing ﬂows enable ∗ This work was supported in part b y the National Science F oundation (NSF) under Grant DMS-2513668, b y the National Institutes of Health (NIH) under Grants R01A G069895, R01A G065636, R01AG074858, and U01A G073079, and b y the Minnesota Sup ercomputing Institute. (Corresponding author: Xiaotong Shen.) † Xin yu Tian is with the School of Statistics, Univ ersity of Minnesota, MN, 55455 USA (email: tianx@umn.edu) ‡ Xiaotong Shen is with the Sc ho ol of Statistics, Universit y of Minnesota, MN, 55455 USA (email: xshen@umn.edu) 1 single-pass sampling and tractable lik eliho o ds via change-of-v ariables training and inv ertible arc hitectures (Dinh et al., 2017; Kingma and Dhariwal, 2018; P apamak arios et al., 2021). Con tin uous-time, transp ort-based form ulations, including probabilit y ﬂo w, ﬂow matc hing, rectiﬁed ﬂow, and sto chastic in terp olan ts, help bridge these paradigms by casting generation as transport, often reducing the n umber of function ev aluations needed for sampling (Song, Sohl-Dic kstein, Kingma, Kumar, Ermon and Poole, 2021; Lipman et al., 2022; Liu et al., 2022; Alb ergo and V anden-Eijnden, 2023). Nonetheless, the most eﬃcien t ﬂo w construc- tions remain constrained by in v ertibilit y and dimension preserv ation, while diﬀusion-based samplers require multiple ev aluations at inference time (K obyzev et al., 2021; P apamak arios et al., 2021). The limitations of existing approaches b ecome most acute in th e manifold regime, where data concentrate near a low-dimensional set embedded in a high-dimensional am bient space. This setting is common for images, biological measurements, and learned feature em b ed- dings. When probability mass lies near a thin support, am bient-space mo deling can waste capacit y in directions orthogonal to the data supp ort and may lead to oﬀ-manifold leak- age, miscalibrated lik eliho o ds, or unreliable uncertain ty estimates and out-of-distribution b eha vior (Nalisnick et al., 2019; Kiric henk o et al., 2020; Ren et al., 2019). Geometry-a ware generativ e metho ds aim to address these issues by incorp orating manifold structure into training, but accurately capturing the relev an t geometry while main taining scalabilit y and stable optimization remains c hallenging (De Bortoli et al., 2022; Huang, Agha johari, Bose, P anangaden and Courville, 2022). W e introduce MAGT ( Manifold-Aligne d Gener ative T r ansp ort ), a ﬂow-inspired frame- w ork designed to reconcile high ﬁdelity with one-shot sampling in the manifold regime. The metho d trains at a ﬁxed lev el of Gaussian smo othing in the am bien t space, where the p er- turb ed data distribution has a w ell-deﬁned densit y and score. A cen tral posterior iden tity sho ws that the smo othed score is determined b y the clean sample a veraged under the poste- rior giv en a noisy observ ation. Building on classical connections b et ween score matc hing and denoising (Hyv ärinen, 2005; Vincen t, 2011), MA GT approximates this conditional mean us- ing a ﬁnite collection of latent anc hors together with self-normalized imp ortance sampling. The anchor appro ximation can b e instan tiated with standard Mon te Carlo, quasi–Mon te Carlo v ariance reduction, or Laplace-based prop osals, yielding a practical score estimator and an end-to-end single-level denoising score-matching ob jective. On the theoretical side, w e establish a new single-lev el pull-bac k inequalit y that trans- lates a squared score discrepancy betw een t wo smo othed distributions at a ﬁxed noise lev el in to a W asserstein error bound b et w een their corresp onding unsmoothed generators. This result highligh ts the roles of smoothing and underlying manifold geometry in determining generation error. Building on this inequality , w e com bine it with a ﬁnite-sample complexit y analysis of ﬁxed-lev el score-matc hing risk minimization to obtain nonasymptotic generation b ounds whose rates dep end on the intrinsic dimension and explicitly quantify the anc hor appro ximation error. Empirically , exp eriments on synthetic manifolds as w ell as image and tabular b enchmarks demonstrate that MA GT outp erforms diﬀusion baselines in ﬁdelit y across all rep orted set- tings and uniformly outp erforms GANs; relative to ﬂow matc hing, MA GT matc hes or im- pro v es ﬁdelity on the synthetic-manifold suite and is b est on three of four real b enchmarks (MNIST, Sup erconduct, Genomes), with CIF AR10-0 (airplanes) the only case where ﬂow 2 matc hing attains a lo wer FID, using one-shot sampling, while sim ultaneously impro ving supp ort concen tration and substan tially reducing inference-time function ev aluations. Our con tributions are as follo ws. 1). Metho dology: W e in tro duce a non-inv ertible transport h : R d → R D tailored to the manifold regime, trained at a ﬁxed Gaussian smo othing lev el via a p osterior score iden tit y . A ﬁnite set of laten t anc hors com bined with self-normalized imp ortance sampling yields a practical single-lev el denoising score-matching ob jectiv e. The learned h enables one- shot sampling and induces an in trinsic densit y on its image (with resp ect to d -dimensional Hausdorﬀ measure) that is computable under mild regularit y conditions; see T able 1. 2). Theory: W e prov e (i) a new single-lev el pull-bac k inequalit y that con v erts ﬁxed-lev el score error into W asserstein generation error, and (ii) an excess-risk bound for our ﬁxed-lev el score-matc hing risk minimization with ﬁnite anchors via brac k eting entrop y . T ogether, these yield ﬁnite-sample generation rates that dep end on the in trinsic dimension and explicitly trac k smo othing and manifold geometry; see T able 2. 3). Algorithms: Practical Monte Carlo, quasi–Monte Carlo, and Laplace-based prop os- als for anc hor selection within a uniﬁed training ob jective. 4). Evidence: Empirical results on syn thetic and real image/tabular data indicate consisten t ﬁdelity gains ov er diﬀusion baselines (all b enchmarks) and GANs (all b enc hmarks); relativ e to ﬂo w matc hing, MA GT impro v es or matc hes ﬁdelit y on syn thetic manifolds and is b est on MNIST, Sup erconduct, and Genomes, with CIF AR10-0 the only b enc hmark where ﬂo w matc hing is clearly b etter in FID. In tabular settings, these gains are substantial: MAGT reduces W 2 b y 74 . 9% on Sup erconduct and 43 . 6% on Genomes relativ e to DDIM (T able 5). Under one-shot sampling, MA GT also substan tially improv es concen tration near the data supp ort and reduces inference-time netw ork ev aluations b y orders of magnitude. The remainder of the pap er is organized as follows. Section 2 in tro duces the MAGT framew ork, including the transp ort-based score iden tity , the resulting one-shot sampler, and practical considerations for likelihoo d ev aluation. Section 3 develops nonasymptotic risk b ounds that relate single-lev el score estimation error to W asserstein generation accuracy . Section 4 discusses practical Monte Carlo and quasi-Mon te Carlo sc hemes for appro ximating the conditional exp ectations that app ear in the MA GT score estimator. Section 5 describ es practical implementation details for training MAGT, including a memory-eﬃcien t up date for large anc hor banks. Section 6 presen ts empirical results on synthetic manifolds and real image/tabular datasets. Section 7 concludes with a brief discussion. The App endix con tains pro ofs, auxiliary lemmas, and additional exp erimental and implementation details. 2 MA GT: Manifold-aligned generativ e transp ort 2.1 Dimension alignmen t via p erturbation Consider generative mo deling in which observ ations Y 0 ∈ R D concen trate near a low- dimensional manifold M ⊂ R D of in trinsic dimension d ≪ D . Our goal is to learn a deterministic tr ansp ort (generator) map h : R d → R D suc h that, for a laten t v ariable U dra wn from a base distribution with densit y π (e.g., a standard Gaussian on R d or a uni- form distribution on [0 , 1] d ), the generated sample Y 0 = h ( U ) follo ws the data distribution 3 p Y 0 . Imp ortantly , h need not b e in v ertible, and the latent and data dimensions may diﬀer, whic h is essen tial when the target distribution concen trates on or near a lo w er-dimensional manifold. A key obstacle is that if p Y 0 is supp orted on a manifold, it can b e singular with resp ect to Leb esgue measure on R D , so an am bien t density and score for Y 0 ma y b e ill-deﬁned. MA GT resolves this b y working at a ﬁxed smo othing lev el t : w e add Gaussian noise in the am bien t space so that the corrupted v ariable Y t has an everywhere-positive densit y and a w ell-deﬁned score ∇ y t log p Y t ( y t ) . Crucially , this smo othed score admits a p osterior/mixture represen tation in terms of h and the base distribution π , which we approximate with a ﬁnite set of laten t “anc hors” and then con vert into a one-shot transp ort map. Am bient Gaussian p erturbations. W e in tro duce a noise schedule ( α t , σ t ) t ∈ [0 , 1] and de- ﬁne, for eac h t , a perturb ed observ ation Y t = α t Y 0 + σ t Z t , Z t ∼ N (0 , I D ) . (1) This construction deﬁnes a dimension-preserving Gaussian corruption of Y 0 directly in the am bien t space. It pro vides a probabilistic link b etw een the distribution of Y t in R D and that of the clean data Y 0 , which may b e supp orted on a d -dimensional manifold with d ≤ D . W e use only the marginal Gaussian corruption in (1); no und erlying SDE or diﬀusion dynamics are assumed. 2.2 Score matc hing and generator MA GT uses the p erturb ed data Y t to deﬁne a score-matc hing ob jectiv e that learns h via ∇ y t log p ( y t ) in the am bient space, linking the noisy observ ation Y t to its clean coun terpart Y 0 = h ( U ) . The noise lev el is con trolled by t through the sc hedule ( α t , σ t ) . F rom (1) and Y 0 = h ( U ) , the conditional densit y of Y t giv en U = u is p t ( y t | u ) = ϕ  y t ; α t h ( u ) , σ 2 t I D  , where ϕ ( y ; m , Σ ) denotes the densit y of N ( m , Σ ) and π is the base density of U . The marginal densit y of Y t is the con tinuous mixture p t ( y t ) = Z ϕ  y t ; α t h ( u ) , σ 2 t I D  π ( u ) d u . Diﬀeren tiating log p t ( y t ) yields the mixture score ∇ y t log p t ( y t ) = E [ ∇ y t log p t ( y t | U ) | Y t = y t ] = 1 σ 2 t  α t E [ h ( U ) | Y t = y t ] − y t  , (2) where ∇ y t log p t ( y t | u ) = − ( y t − α t h ( u )) /σ 2 t and the conditional exp ectation is under p ( u | y t ) ∝ π ( u ) p t ( y t | u ) . This identit y highlights that the mixture score dep ends on the posterior mean E [ h ( U ) | y t ] , which enco des the geometry of the latent space and the generator h . 4 T ransp ort-based score estimator. In (2), score estimation at noise lev el t requires com- puting the p osterior mean E [ h ( U ) | y t ] under p ( u | y t ) ∝ π ( u ) ϕ  y t ; α t h ( u ) , σ 2 t I D  . W e appro ximate this conditional exp ectation E [ h ( U ) | y t ] using self-normalize d imp ortanc e sam- pling . Speciﬁcally , let ˜ π ( · | y t ) b e a proposal distribution on the laten t space, possibly de- p ending on y t . W e draw U (1) , . . . , U ( K ) i.i.d. ∼ ˜ π ( · | y t ) , and form the unnormalized imp ortance w eigh ts ω ( j ) t ( y t ) := π  U ( j )  ˜ π  U ( j ) | y t  ϕ  y t ; α t h  U ( j )  , σ 2 t I D  . (3) Then, the p osterior mean E [ h ( U ) | y t ] is appro ximated b y e m t,K ( y t ) := P K j =1 ω ( j ) t ( y t ) h ( U ( j ) ) P K j =1 ω ( j ) t ( y t ) , and w e deﬁne the corresp onding transp ort-based score estimator e s t,K ( y t ; h, π , ˜ π ) := 1 σ 2 t  α t e m t,K ( y t ) − y t  . (4) When ˜ π ( · | y t ) ≡ π ( · ) (i.e., we sample anc hors from the generativ e base), the importance ratio cancels and ω ( j ) t ( y t ) ∝ ϕ  y t ; α t h ( U ( j ) ) , σ 2 t I D  , recov ering the ﬁnite-mixture form. This estimator is “transp ort-based” b ecause it is an explicit functional of the learned map h and the base distribution, with the conditional exp ectation in (2) approximated b y a w eighted set of anc hors { U ( j ) } K j =1 . Choice of prop osal distribution ˜ π . The base distribution π sp eciﬁes the generativ e mo del: dra w U ∼ π and set Y 0 = h ( U ) . The proposal ˜ π in (3) is purely a c omputational devic e for appro ximating E [ h ( U ) | y t ] : as long as ˜ π ( · | y t ) has supp ort cov ering the high- densit y regions of the p osterior and the imp ortance ratio π / ˜ π is included, the estimator in (4) is consisten t (and asymptotically un biased) for the true p osterior mean. F or small σ t (or for expressiv e h in high ambien t dimension), the laten t p osterior p ( u | y t ) ∝ π ( u ) ϕ ( y t ; α t h ( u ) , σ 2 t I D ) can b e sharply concen trated: only a tiny subset of latent p oin ts pro duces α t h ( u ) close to the observ ed y t . If w e dra w anc hors from the prior π , most samples receiv e negligible likelihoo d weigh t, leading to a low eﬀective sample size and a high-v ariance estimate of E [ h ( U ) | y t ] (and therefore of the score). A prop osal ˜ π ( · | y t ) that b etter matc hes the p osterior geometry yields more balanced w eigh ts, impro ving n umerical stabilit y and reducing Monte Carlo v ariance without c hanging the underlying model. Section 4 presen ts practical c hoices that plug directly in to (3)–(4): (i) MAGT-MC uses ˜ π = π (baseline sampling); (ii) MA GT-QMC replaces i.i.d. dra ws from π with lo w- discrepancy p oin t sets to reduce in tegration error; and (iii) MAGT-MAP uses a data- dep enden t Gaussian prop osal ˜ π ( · | y t ) = q ( · | y t ) obtained from a MAP–Laplace approxima- tion to the p osterior. All three c hoices estimate the same quan tit y E [ h ( U ) | y t ] ; they diﬀer only in ho w eﬃcien tly they appro ximate it. T raining loss. Giv en the transp ort-based score estimator (4), w e estimate the transp ort map h by minimizing a single-level denoising score-matc hing ob jectiv e. Sp eciﬁcally , for eac h training sample y i 0 ∼ p Y 0 , we dra w z i ∼ N (0 , I D ) and construct the p erturb ed observ ation 5 y i t = α t y i 0 + σ t z i . ℓ K ( y t , y 0 ; h ) :=    e s t,K ( y t ; h, π , ˜ π ) − ∇ y t log p ( y t | y 0 )    2 2 , (5) where p ( y t | y 0 ) is the normal densit y for N ( α t y 0 , σ 2 t I D ) . Given ( y i t , y i 0 ) n i =1 , w e then solv e the empirical risk minimization problem L n,K ( h ) := 1 n n X i =1 ℓ K  y i t , y i 0 ; h  , ˆ h λ ∈ arg min h ∈H L n,K ( h ) , (6) o v er a prescribed hypothesis class H (e.g., ReLU neural netw orks). When H is instan tiated b y a neural net w ork family { h θ : θ ∈ Θ } , w e equiv alently optimize o v er parameters θ and obtain ˆ θ λ ∈ arg min θ ∈ Θ L n,K ( h θ ) , with the learned transp ort deﬁned a s ˆ h λ := h ˆ θ λ . (W e suppress the dependence on ˆ θ λ in the theory and write ˆ h λ for the learned function.) Here λ := ( t, d, K ) collects the tuning parameters: ( t, d ) gov ern the bias–v ariance trade-oﬀ of the estimator, while K controls the accuracy of the Monte Carlo approximation used in ℓ K . The tuning parameter ˆ λ = ( ˆ t, ˆ d, ˆ K ) is selected b y cross-v alidation: w e c ho ose λ to minimize a v alidation generative criterion (e.g., an estimated W asserstein distance) computed on an indep enden t v alidation set, and we rep ort the resulting generator ˆ h ˆ λ . A t the p opulation lev el, the conditional-score target is un biased for the marginal score at time t because E [ ∇ Y t log p ( Y t | Y 0 ) | Y t ] = ∇ Y t log p Y t ( Y t ) . This iden tity motiv ates the denoising score-matc hing ob jectiv e in (6); the theory in Section 4 makes the dep endence on the ﬁnite-anc hor approximation explicit. Sample generation. Given the selected generator ˆ h ˆ λ , we generate new samples b y drawing u ∼ π and pushing it forward through the learned transp ort, ˜ y 0 = ˆ h ˆ λ ( u ) , where ˆ λ is selected via cross-v alidation. Thus, ˆ h ˆ λ pro vides a one-pass sampler for the target distribution and, together with the anchor bank, deﬁnes the transp ort-based score estimator in (4). In trinsic densit y and lik eliho o d ev aluation. Assume h : R d → R D is C 1 and has rank d almost everywhere, and let M := h ( U ) denote its image ov er a laten t domain U ⊂ R d that con tains the supp ort of π . Assume further that π admits a density on U with resp ect to Leb esgue measure. By the classical area form ula, the pushforward measure h # π is absolutely con tin uous with respect to the d -dimensional Hausdorﬀ measure H d restricted to M , and for H d -a.e. y 0 ∈ M , p M ( y 0 ) = X u ∈ h − 1 ( { y 0 } ) π ( u )   J h ( u )   , where   J h ( u )   = q det  J h ( u ) ⊤ J h ( u )  . (7) Here p M is the intrinsic density of Y 0 = h ( U ) with respect to H d | M , J h ( u ) ∈ R D × d is the Jacobian, and | J h ( u ) | is the d -dimensional Jacobian determinan t. If h is injective on U (e.g., a C 1 em b edding), then h − 1 ( { y 0 } ) is a singleton for H d -a.e. y 0 ∈ M , and (7) reduces to the familiar c hart form ula log p M  h ( u )  = log π ( u ) − 1 2 log det  J h ( u ) ⊤ J h ( u )  . (8) 6 More generally , if h has b ounded m ultiplicity , the sum in (7) contains ﬁnitely man y terms; ev aluating p M ( y 0 ) requires identifying and summing all contributing preimages. Equa- tion (8) gives the branch wise c hart contribution asso ciated with a sp eciﬁc preimage u . F or generated samples y 0 = h ( u ) , this branch wise log-densit y is directly computable from ( u , h ( u )) and coincides with log p M ( y 0 ) whenever the ﬁb er is a singleton (in particular, un- der injectivit y). If h is not injectiv e and additional preimages exist, (8) should be interpreted as a lo cal con tribution unless the remaining preimages are recov ered and included in (7). F or an observed p oint y 0 ∈ M , ev aluating p M ( y 0 ) requires iden tifying one or more laten t preim- ages solving h ( u ) = y 0 (or appro ximately minimizing ∥ h ( u ) − y 0 ∥ 2 ); this can b e done via a separate encoder or numerical optimization when needed, but is not required for sampling. Finally , note that an ambien t-space density for Y 0 do es not exist when its law is supp orted on a manifold; at an y ﬁxed smo othing level t > 0 with σ t > 0 , how ever, the corrupted la w admits the mixture representation p Y t ( y ) = R ϕ ( y ; α t h ( u ) , σ 2 t I D ) π ( u ) d u , whic h can be estimated using the same anchor bank employ ed for score appro ximation. 2.3 Comparisons with diﬀusion and ﬂo w mo dels This section ev aluates MAGT against diﬀusion- and ﬂow-based baselines across several prac- tical dimensions, including sampling cost, supp ort alignmen t, lik eliho o d accessibility , com- putational fo otprint, and statistical guarantees. As sho wn in Sections 2.1 – 2.2, the ﬁxed- t training scheme in MAGT yields an am bient-space mixture representation of Y t (so the smo othed densit y at lev el t is MC-estimable) and induces an intrinsic density on the learned manifold. This com bination bridges ﬂo w-st yle densit y ev aluation on the support with the computational eﬃciency of one-shot sampling. T able 1 highligh ts that MAGT com bines one-shot sampling, support alignment to a thin manifold, and intrinsic densities on the learned supp ort, together with a Mon te Carlo route to smoothed ambien t lik eliho o ds at the training noise lev el. Sampling cost. MAGT generates samples in a single forward ev aluation of the transp ort map h , as in ﬂo w mo dels. Diﬀusion models, b y con trast, generate samples b y numerically in tegrating a rev erse-time sto c hastic diﬀeren tial equation (SDE) or ordinary diﬀeren tial equa- tion (ODE) through sequential denoising steps, often requiring tens to thousands of neural net w ork ev aluations p er sample, making them substan tially slo wer without distillation (Lu et al., 2022; Karras et al., 2022). Moreo v er, b ecause this mo del relies on time discretization, reducing the n um b er of steps increases the discretization (solver) error, whic h v anishes only as the sequen tial denoising steps increase, or higher-order solvers are used (Chen et al., 2023; Zheng et al., 2023). This mak es MAGT attractiv e for interactiv e or streaming use without distillation. Supp ort alignment. MA GT trains at a ﬁxed smo othing lev el t , concentrating probability mass near the data manifold and av oiding the b oundary bias that arises from av eraging across noise scales in diﬀusion. Its ﬁnite-mixture appro ximation further emphasizes anchors that b est explain each observ ation while down-w eigh ting oﬀ-manifold ones for improving b oundary ﬁdelit y . This asp ect is conﬁrmed by the exp eriment in Section 6. 7 T able 1: Comparison of MAGT with diﬀusion mo dels and normalizing ﬂo ws. MA GT ac hiev es strong manifold alignmen t and high ﬁdelit y with single-pass sampling, av oiding diﬀusion’s long c hains and ﬂow in v ertibilit y . NFE denotes the n umber of function ev aluations during sampling. MA GT Diﬀusion/Flo w- matc hing Normalizing ﬂo ws T raining Matc hing loss at ﬁxed t Time-a vg. loss o ver t MLE via change of v ars Sampling cost One forward pass (NFE = 1 ) NFE steps (NFE ≫ 1 ) One inv erse pass Supp ort Manifold via ﬁxed- t smo othing Near-manifold leak age No measure- 0 manifolds Arc hitecture Non-in vertible; dimensions ma y diﬀer Unconstrained In vertible; dimensions m ust matc h Lik eliho o d In trinsic density on manifold via area form ula (exact for embeddings; otherwise requires summing preimages); smo othed ambien t densit y via anchor MC Unnormalized; no tractable lik eliho o d Exact (ambien t) F ailures t mis-sp eciﬁcation Boundary bias; high cost In vertibilit y bottleneck; manifold mismatch Lik eliho o ds on the supp ort and at ﬁxed smo othing. MAGT induces an in trinsic densit y on its image manifold via the area form ula (7)–(8). In particular, for generated samples y 0 = h ( u ) one can ev aluate log p M ( y 0 ) in closed form from ( u , J h ( u )) when h is lo cally injectiv e. An ambien t-space likelihoo d for Y 0 is not deﬁned when the target la w is manifold-supp orted; ho wev er, at the ﬁxed smo othing lev el t > 0 , the corrupted law has densit y p Y t ( y ) = R ϕ ( y ; α t h ( u ) , σ 2 t I D ) π ( u ) d u , which can be appro ximated b y the same anc hor bank used for score estimation. Computational fo otprin t and arc hitectural freedom. Because h need not b e in vert- ible and the latent and data dimensions may diﬀer, MA GT a v oids the D × D Jacobian log-determinan ts and the coupling or triangular constrain ts that are standard in inv ertible arc hitectures, particularly normalizing ﬂo ws. This architectural freedom reduces training o v erhead and facilitates scaling to high-dimensional embeddings. Imp ortantly , neither train- ing nor inference requires taking the limit t → 0 ; this con trasts with many diﬀusion-based ob jectiv es, where score magnitudes can div erge as the noise level v anishes and ma y destabilize optimization. Statistical guaran tees. Our non-asymptotic risk analysis dep ends on the intrinsic di- mension d and geometric regularity of the manifold, rather than the am bient dimension D . This clariﬁes why MA GT remains data-eﬃcient when observ ations are high-dimensional but eﬀectiv ely lo w-dimensional in geometry . 8 3 Theory: excess risk and generation ﬁdelit y This section establishes a ﬁnite-sample b ound for the one-shot generation accuracy of ˆ h λ ( U ) with U ∼ π indep enden t, measured b y the 2 -W asserstein error W 2 ( P Y , P ˜ Y ) with ˜ Y = ˆ h λ ( U ) estimated at a noise lev el t ∈ (0 , 1) . The analysis decomposes into t w o ingredients. First, Theorem 1 (a pull-bac k inequalit y) conv erts a single-level score mismatc h b et w een the smo othed laws at level t in to a b ound on W 2 at t = 0 . Second, Theorem 2 con trols the ﬁxed- t score-matching excess risk of the empirical minimizer ˆ h λ of (6) via brack eting en trop y . This result is an adaptation to our setting of classical brac k eting-en tropy arguments for (generalized) M -estimators, as in Shen and W ong (1994). Combining these t wo ingredien ts yields the generation-ﬁdelity b ound in Theorem 3. T o the b est of our kno wledge, b oth the pull-bac k inequality in Theorem 1 and the resulting generation-ﬁdelit y bound in Theorem 3 are new. 3.1 Setup and geometric assumptions Let U ⊂ R d b e a bounded laten t domain and let π denote a base density on U with respect to Leb esgue measure. W e consider a manifold-supp orted data distribution that is wel l sp e ciﬁe d b y an (unkno wn) transp ort map h ∗ : U → R D with suﬃcient smo othness to deﬁne a regular d -dimensional image manifold (precise regularit y is stated in Assumption 1). Deﬁne the target manifold M ∗ := h ∗ (supp π ) ⊂ R D , where supp denotes supp ort. Throughout, ∥ · ∥ denotes the Euclidean norm and ∥ A ∥ op := sup ∥ x ∥ =1 ∥ Ax ∥ denotes the op erator norm. W e imp ose the follo wing regularit y and geometric conditions. Deﬁnition 1 (Hölder class) . F or s > 0 , a b ounded set U ⊂ R d , and a radius B > 0 , we write C s ( U , B ) for the (vector-v alued) Hölder ball of order s : the set of functions h : U → R D whose deriv atives up to order ⌊ s ⌋ exist and are b ounded, and whose ⌊ s ⌋ -th deriv ative is ( s − ⌊ s ⌋ ) - Hölder with Hölder seminorm at most B . In particular, when s ∈ (0 , 1] this reduces to the condition ∥ h ( u ) − h ( v ) ∥ 2 ≤ B ∥ u − v ∥ s for all u , v ∈ U , together with sup u ∈U ∥ h ( u ) ∥ 2 ≤ B . Deﬁnition 2 (Reac h and tubular neigh b orho o d) . F or a closed set M ⊂ R D , its reac h reac h( M ) ∈ [0 , ∞ ] is the largest r such that ev ery p oint x with dist( x, M ) < r has a unique nearest-p oin t pro jection Π M ( x ) ∈ M . Equiv alen tly , the op en tub e T r ( M ) := { x ∈ R D : dist( x, M ) < r } admits a well-deﬁned pro jection map x 7→ Π M ( x ) . Assumption 1 (Regular transp ort class) . The true tr ansp ort h ∗ lies in the Hölder smo oth class C η +1 ( U , B ) over a b ounde d latent domain U ⊂ R d ∗ . Set γ := min(1 , η ) . Ther e exist c onstants 0 < m ≤ M < ∞ and H γ < ∞ , and a r e gular subset H reg ⊆ H with h ∗ ∈ H reg , such that the fol lowing hold for every h ∈ H reg : (i) (F ul l r ank and c onditioning) The Jac obian J h ( u ) ∈ R D × d has r ank d for al l u ∈ U and its singular values lie in [ m, M ] . 9 (ii) ( C 1 ,γ r e gularity) J h is γ –Hölder with c onstant H γ , i.e., ∥ J h ( u ) − J h ( v ) ∥ op ≤ H γ ∥ u − v ∥ γ for al l u , v ∈ U . In the the or etic al r esults b elow, we assume the le arne d estimator ˆ h λ b elongs to H reg ; se e R emark 1 for discussion. Assumption 2 (P ositive reac h) . Ther e exists a c onstant ρ M > 0 such that the image man- ifold M ∗ := h ∗ (supp π ) has r e ach at le ast ρ M . Mor e over, every h ∈ H reg has an image manifold h (supp π ) with r e ach at le ast ρ M . Remark 1. Assumptions 1 – 2 imp ose uniform c hart regularit y (full-rank, w ell-conditioned Jacobians) and a p ositiv e reac h in order to con trol tubular neigh b orho o ds and justify the Hessian bounds that underlie the single-level pull-back analysis in Section 3. They are stated as uniform conditions o v er the restricted set H reg , but they can b e lo calized: b oth the smallest singular v alue of J h and the reac h of h (supp π ) are stable under suﬃcien tly small C 1 p erturbations of h on the b ounded domain U . Consequen tly , it is enough for the estimator ˆ h λ to lie in a C 1 neigh b orho o d of h ∗ , which is consisten t with the excess-risk con trol for large n . In practice, one can encourage mem b ership in H reg via Jacobian-conditioning penalties (e.g., p enalizing ∥ J h ( u ) ⊤ J h ( u ) − I d ∥ ov er sampled u ), sp ectral normalization, and p ost ho c c hec ks on a dense latent grid. Deﬁnition 3 (Log–Sob olev constant) . Let µ b e a p ositiv e density with resp ect to the Leb esgue measure. F or g ≥ 0 with R g dµ < ∞ , deﬁne the en trop y function as En t µ ( g ) := Z g log  g R g dµ  dµ W e say that µ satisﬁes a logarithmic Sob olev inequalit y (LSI) with constan t C LSI ( µ ) > 0 if En t µ ( f 2 ) ≤ 2 C LSI ( µ ) Z ∥∇ f ∥ 2 2 dµ for all smooth f ≡ const . Equiv alently , C LSI ( µ ) := inf f ≡ const 2 R ∥∇ f ( x ) ∥ 2 2 µ ( x ) dx Ent µ ( f 2 ) . Assumption 3 (Smo oth base density) . The b ase density π is C 2 on U and its lo g-density has b ounde d Hessian: Λ 2 := sup u ∈U   ∇ 2 u log π ( u )   op < ∞ . Mor e over, the latent density π satisﬁes a lo g–Sob olev ine quality with c onstant C LSI ( π ) > 0 . Under Assumptions 1 – 3, w e assume, for simplicit y , that b oth U and Y 0 ha v e bounded supp ort. In fact, this assumption can be relaxed to a uniform tail-con trol condition (e.g., sub-Gaussian tails), with only minor mo diﬁcations to the proof. 3.2 F rom a single-level score error to W 2 generation error Consider the v ariance-preserving (VP) schedule σ 2 t = t and α t = √ 1 − t for a ﬁxed t ∈ (0 , 1) in (1). Deﬁne the smoothed v ariables Y t := α t Y 0 + σ t Z , ˜ Y t := α t ˜ Y 0 + σ t Z , Z ∼ N (0 , I D ) , 10 where Z is indep enden t of Y 0 and of ˜ Y 0 = ˆ h λ ( U ) . Let p t denote the densit y of Y t and let ˜ p t denote the densit y of ˜ Y t ; b oth are smo oth and ev erywhere p ositive for t > 0 . W e measure the single-level mismatch b y the squared error b et w een the t w o denoisers (p osterior means) under Gaussian corruption: E MAG ( t ) := E Y ∼ p t   m p,t ( Y ) − m ˜ p,t ( Y )   2 2 , (9) where m p,t ( y ) := E [ Y 0 | Y t = y ] and m ˜ p,t ( y ) := E [ ˜ Y 0 | ˜ Y t = y ] . F or Gaussian corruption, T weedie’s formula giv es m p,t ( y ) = y + t ∇ log p t ( y ) α t , m ˜ p,t ( y ) = y + t ∇ log ˜ p t ( y ) α t , and therefore E MAG ( t ) = t 2 1 − t E Y ∼ p t   ∇ log p t ( Y ) − ∇ log ˜ p t ( Y )   2 2 . In particular, the exp ectation on the right is the squared Fisher div ergence J ( p t ∥ ˜ p t ) (up to con v en tion). Recall that for p ositive densities q and p on R D , the (relativ e) Fisher div ergence is J ( q ∥ p ) := Z R D   ∇ log q ( y ) − ∇ log p ( y )   2 2 q ( y ) d y , see, e.g., Shen (1997). Theorem 1 sho ws that controlling E MAG ( t ) at a single noise level t > 0 controls the one-shot W 2 generation error. Theorem 1 (Single-lev el pull-back b ound) . Under Assumptions 1 – 3, suppose the VP noise lev el t ∈ (0 , 1) lies in the tub e regime t ≤ t max := c 2 tube ρ 2 M , θ t := C ( γ ) N t γ (1 − t ) γ < 1 . Then the one-shot generation error is controlled by the single-level mismatch: W 2  P Y 0 , P ˜ Y 0  ≤ C PB ( t ) p E MAG ( t ) , (10) where the pull-bac k constan t is C PB ( t ) := √ 1 − t t  Φ( t ) ¯ C LSI ( t ) + Ψ( t )  , ¯ C LSI ( t ) := (1 − t ) M 2 + t min { C LSI ( π ) , 1 } . Here, Φ( t ) := exp  I γ ( t )  √ 1 − t , Ψ( t ) := Γ( t ) exp  4 I γ ( t )  √ 1 − t , Γ( t ) := − log (1 − t ) , with I γ ( t ) := 2 γ C ( γ ) T t γ / 2 (1 − t ) 1+ γ / 2 + 1 γ ( C ( γ ) S ) 2 1 − θ t t γ (1 − t ) 1+ γ . The constan ts C ( γ ) T , C ( γ ) S , and C ( γ ) N dep end solely on the parameters ( m, M , H γ , Λ 2 , ρ M ) . Moreo v er, for ﬁxed problem constants and t ≤ t max , the dominan t scaling is C PB ( t ) = O ( t − 1 ) as t ↓ 0 , and C PB ( t ) = O (1) if t is b ounded aw ay from 0 . 11 Theorem 1 pro vides a single-lev el b ound on W 2  P Y 0 , P ˜ Y 0  based on the score mismatc h at a ﬁxed noise lev el 0 < t < t max , in con trast to diﬀusion analyses that integrate score errors o ver time. The result highligh ts a bias–stability trade-oﬀ in the c hoice of t : larger t yields smo other densities and more stable score estimation, but incurs greater smo othing bias, while smaller t reduces bias at the cost of more concen trated p osteriors and higher Mon te Carlo v ariance. Practical strategies for mitigating Mon te Carlo error at small t are discussed in Section 4. 3.3 Learning the single-lev el score b y empirical risk minimization T o connect Theorem 1 to the training ob jectiv e, we make explicit the roles of (i) the ﬁnite- anc hor appro ximation and (ii) the underlying p opulation score-matc hing risk. Recall the ﬁnite-anc hor loss ℓ K in (5) and the empirical ob jective L n,K ( h ) := 1 n n X i =1 ℓ K ( y i t , y i 0 ; h ) , ˆ h λ ∈ arg min h ∈H L n,K ( h ) , as in (6). Ideal (inﬁnite-anc hor) risk. F or the statistical analysis, it is conv enient to introduce the ideal (inﬁnite-anc hor) loss ℓ ( y t , y 0 ; h ) :=   s t ( y t ; h ) − ∇ y t log p ( y t | y 0 )   2 2 , (11) where s t ( · ; h ) := ∇ log p h t ( · ) is the score of the smo othed mo del induced b y h , and p h t denotes the densit y of Y h t := α t h ( U ) + σ t Z with U ∼ π and Z ∼ N (0 , I D ) . The corresp onding p opulation risk is R ( h ) := E  ℓ ( Y t , Y 0 ; h )  , where ( Y t , Y 0 ) follo w the data corruption model (1) with Y 0 = h ∗ ( U ) and U ∼ π . Deﬁne the excess p opulation risk relative to the ground-truth map h ∗ b y ρ 2 ( h ∗ , h ) := R ( h ) − R ( h ∗ ) ≥ 0 , ρ ( h ∗ , h ) := p ρ 2 ( h ∗ , h ) . (12) If h ∗ / ∈ H , the approximation error is inf h ∈H ρ 2 ( h ∗ , h ) . F rom score-matc hing risk to Fisher div ergence and E MAG ( t ) . Let p t denote the smo othed data density of Y t = α t Y 0 + σ t Z under h ∗ , and let p h t b e the smoothed densit y induced b y a candidate transp ort h as ab o v e. W rite s t ( · ; h ) = ∇ log p h t ( · ) , s t ( · ; h ∗ ) = ∇ log p t ( · ) . The conditional score target used for training satisﬁes the unbiasedness identit y E  ∇ y t log p ( Y t | Y 0 )   Y t  = ∇ log p t ( Y t ) , 12 whic h yields the orthogonal decomp osition R ( h ) = R ( h ∗ ) + E Y ∼ p t   s t ( Y ; h ) − s t ( Y ; h ∗ )   2 2 . Consequen tly , the excess risk is a Fisher-divergence-t yp e score mismatc h measured under the smoothed data law: ρ 2 ( h ∗ , h ) = E Y ∼ p t   ∇ log p h t ( Y ) − ∇ log p t ( Y )   2 2 = J  p t ∥ p h t  , (13) No w sp ecialize to the learned transp ort ˆ h λ and denote ˜ Y 0 := ˆ h λ ( U ) and ˜ p t := p ˆ h λ t , so that ˜ Y t = α t ˜ Y 0 + σ t Z has density ˜ p t . Under the VP sc hedule σ 2 t = t and α t = √ 1 − t , it implies that the denoiser mismatch in (9) satisﬁes E MAG ( t ) = t 2 1 − t E Y ∼ p t   ∇ log p t ( Y ) − ∇ log ˜ p t ( Y )   2 2 = t 2 1 − t ρ 2  h ∗ , ˆ h λ  . Corollary 1 (T raining-to- W 2 pip eline) . Fix t ∈ (0 , 1) in the tub e regime of Theorem 1. Let ˆ h λ b e the learned transp ort and ˜ Y 0 = ˆ h λ ( U ) its one-shot generator. Then W 2  P Y 0 , P ˜ Y 0  ≤  Φ( t ) ¯ C LSI ( t ) + Ψ( t )  ρ  h ∗ , ˆ h λ  . Brac keting en trop y . Let P denote the joint la w of ( Y t , Y 0 ) under (1). F or a class of measurable functions F on the sample space and u > 0 , let N B ( u, F , L 2 ( P )) b e the u - brac k eting num b er in L 2 ( P ) and H B ( u, F ) := log N B ( u, F , L 2 ( P )) its brac keting entrop y (Shen and W ong, 1994). W e apply this with the exc ess-loss class F :=  ℓ ( · , · ; h ) − ℓ ( · , · ; h ∗ ) : h ∈ H  . In (15) below, the generic v ariable x ranges ov er the sample space of ( Y t , Y 0 ) . Theorem 2 b ounds the excess risk of the empirical minimizer ˆ h λ in terms of (i) appro xima- tion error, (ii) a brack eting-en tropy i n tegral, and (iii) the additional p erturbation in tro duced b y the ﬁnite-anc hor loss ℓ K . Theorem 2 (Score-matc hing excess-risk b ound) . Fix an y k such that 0 < c b 4 c v ≤ k < 1 , with c v = 40 α 2 t B 2 σ 4 t and c b = 16 α 2 t B 2 σ 4 t . Let ˆ h λ ∈ arg min h ∈H L n,K ( h ) b e the empirical minimizer deﬁned ab o v e. Then, for an y ε > 0 satisfying the en trop y condition Z 4 c 1 / 2 v ε kε 2 / 16 H 1 / 2 B ( u, F ) d u ≤ c h n 1 / 2 ε 2 , (14) and the lo wer b ound ε 2 ≥ max n 4 inf h ∈H ρ 2 ( h ∗ , h ) , 8 sup h, x   ℓ K ( y t , y 0 ; h ) − ℓ ( y t , y 0 ; h )   o , (15) with c h = k 3 / 2 / 2 11 , w e hav e the deviation bound P  ρ ( h ∗ , ˆ h λ ) ≥ ε  ≤ 4 exp  − c e n ε 2  , c e := 1 − k 8(64 c v + 2 c b 3 ) . 13 3.4 Main result: ﬁnite-sample generation ﬁdelit y Before stating the main result, w e in tro duce sev eral deﬁnitions. Neural netw ork class. Fix in tegers d in , d out ≥ 1 and depth L ≥ 2 . A feedforw ard ReLU net w ork h : R d in → R d out is deﬁned b y x (0) = x , x ( ℓ ) = σ  A ℓ x ( ℓ − 1) + b ℓ  ( ℓ = 1 , . . . , L − 1) , h ( x ) = A L x ( L − 1) + b L , where σ ( z ) = max { z , 0 } is applied comp onen t wise, A ℓ ∈ R d ℓ × d ℓ − 1 , and b ℓ ∈ R d ℓ , with d 0 = d in and d L = d out . The width is max 0 ≤ ℓ ≤ L − 1 d ℓ . W e write NN( d in , d out , L, W, S, B , E ) for the class of such netw orks with maximum width at most W , at most S nonzero parameters, entrywise parameter b ound E , and output uniformly b ounded b y B on the latent domain U : NN( d in , d out , L, W, S, B , E ) := n h : max 0 ≤ ℓ ≤ L − 1 d ℓ ≤ W, L X ℓ =1  ∥ A ℓ ∥ 0 + ∥ b ℓ ∥ 0  ≤ S, max ℓ  ∥ A ℓ ∥ max , ∥ b ℓ ∥ max  ≤ E , sup u ∈U ∥ h ( u ) ∥ ∞ ≤ B o , (16) where ∥ M ∥ max := max i,j | M ij | denotes the max-entry norm and ∥ · ∥ 0 coun ts nonzeros. Assume that the ﬁnite-anchor approximation induces a uniform p erturbation of the loss: sup h ∈H , y t , y 0   ℓ K ( y t , y 0 ; h ) − ℓ ( y t , y 0 ; h )   ≤ ε ( ˜ π , t, K ) , for some deterministic function ε ( ˜ π , t, K ) (see Section 4 for explicit b ounds). With these deﬁnitions in place, we obtain the following generation-accuracy guarantee b y com bining the pull-bac k inequalit y (Theorem 1), the excess-risk b ound (Theorem 2), and standard appro ximation and en tropy estimates for ReLU net works. Theorem 3 ( MA GT’s generation ﬁdelit y ) . Under Assumptions 1 – 3 and the estimator class setting H = NN( d, D , L, W, W 2 L, B , B ) with d ≥ d ∗ , there exist constants c 1 , c 2 , c 3 > 0 (dep ending only on ( m, M , H γ , Λ 2 , ρ M ) and the VP sc hedule, but not on n, W, L, K ) suc h that E W 2  P Y 0 , P ˜ Y 0  ≤ C PB ( t )  c 1 ( W L ) − 2( η +1) d ∗ + c 2 σ − ( η +2) t  ( W L ) 2 log 5 ( W L ) n  η +1 2 η + c 3 ε ( ˜ π , t, K )  , (17) where the exp ectation is ov er the training sample and an y Mon te Carlo randomness used to form the anc hor-based score estimator. Corollary 2 (Explicit n –rate) . Under the assumptions of Theorem 3, set κ := d ∗ 2(2 η + d ∗ ) , r := η + 1 2 η + d ∗ , and c ho ose ( W , L ) so that W L = l n log 5 n  κ m . Then, E W 2  P Y 0 , P ˜ Y 0  ≤ C PB ( t )   c 1 + c 2 σ − ( η +2) t  n log 5 n  − r + c 3 ε ( ˜ π , t, K )  . 14 where ε ( ˜ π , t, K ) → 0 as K → ∞ for eac h ﬁxed t ∈ (0 , 1) , with explicit K -dependent b ounds in Section 4. The exp onen t  n log 5 n  − r =  n log 5 n  − η +1 2 η + d ∗ in Corollary 2 matc hes the intrinsic-dimension minimax scaling established for W asserstein-risk estimation of η -regular distributions on a d ∗ -dimensional manifold; see, e.g., T ang and Y ang (2024) (up to p olylogarithmic factors). Generation error b ounds. As summarized in T able 2, the results listed there that achiev e an intrinsic-dimension W asserstein rate scale as n − ( η +1) / (2 η + d ∗ ) . In particular, manifold- adaptiv e diﬀusion (T ang and Y ang, 2024) attains this exp onen t (for W 1 ) under a b oundaryless- manifold assumption, while MAGT attains the same intrinsic-dimension exponent for W 2 without requiring a no-b oundary condition. By con trast, existing guaran tees for ambien t- space diﬀusion and ﬂo w-matc hing methods typically scale with the ambien t dimension D , leading to slow er rates when d ∗ ≪ D . T o the b est of our knowledge, comparable nonasymp- totic W asserstein-risk guaran tees for normalizing ﬂo ws are not curren tly av ailable. A salien t diﬀerence is that the analysis of manifold-adaptiv e diﬀusion (T ang and Y ang, 2024) assumes the data manifold is without b oundary . This excludes many practical set- tings in whic h the supp ort has a boun dary (e.g., manifolds embedded in a bounded region), including the six syn thetic manifolds considered in Section 6.1. In con trast, our pull-bac k analysis do es not rely on a b oundaryless assumption. Empirically (Section 6.1), MA GT remains eﬀectiv e in precisely these boundary-aﬀected regimes. Neural netw ork architecture. Bey ond the rate comparison in T able 2, Corollary 2 sug- gests a ﬂexible arc hitecture trade-oﬀ for MA GT: the rate is achiev ed b y c ho osing width and depth so that the pro duct W L scales as ⌈ ( n/ log 5 n ) κ ⌉ . This p ermits relativ ely deep arc hi- tectures provided the width is adjusted accordingly . In con trast, the constructions in Ok o et al. (2023); T ang and Y ang (2024) t ypically realize their rates b y letting the width (and sparsit y) gro w rapidly with n , yielding substantially larger net works that can b e less aligned with standard practical design choices. 4 Practical c hoices for Mon te Carlo appro ximation This section gives practical sc hemes for appro ximating the posterior mean that en ters the transp ort-based score estimator, together with nonasymptotic b ounds con trolling the ﬁnite- anc hor term ε ( ˜ π , t, K ) in Theorem 3. Relation to existing Mon te Carlo/QMC theory . Lemmas 1 – 3 b elow rely on standard to ols from self-normalized imp ortance sampling (SNIS) and quasi–Monte Carlo (QMC): nonasymptotic SNIS error b ounds/v ariance expansions (see, e.g., Owen (2013)) and the K oksma–Hla wk a inequalit y together with classical discrepancy estimates for lo w-discrepancy p oin t sets (see, e.g., Niederreiter (1992); Dic k and Pillichshammer (2010)). W e restate these b ounds mainly to trac k how the constants depend on the smo othing level σ t and the in trinsic dimension d , and to make the ﬁnite-anchor term ε ( ˜ π , t, K ) in Theorem 3 explicit. The pro ofs are giv en in App endix D. 15 T able 2: Theoretical guarantees for score-based diﬀusion and ﬂo w mo dels (rates up to a p olylogarithmic factor of n and constants). Metho d Metric Rate Key assumptions Estimator / class (scaling in n ) Diﬀusion (Oko et al., 2023) TV , W 1 n − η / (2 η + D ) (TV), n − ( η +1 − δ ) / (2 η + D ) ( W 1 ) η -smo oth density in R D (Beso v-type); b oundary regularity ReLU score nets: L n = Θ(log n ) , W n , S n = e Θ  n D/ (2 η + D )  Manifold-adaptiv e diﬀusion (T ang and Y ang, 2024) W 1 n − ( η +1) / (2 η + d ∗ ) η -smo oth density on a d ∗ -manifold; no b oundary ReLU score nets: L n = Θ(log 4 n ) , W n , S n = e Θ  n d/ (2 η + d ∗ )  Lo wer-bound-free diﬀusion (Zhang et al., 2024) TV n − η / (2 η + D ) Sub-Gaussian data in R D ; (optionally) η -Sob olev with η ≤ 2 T runcated KDE plug-in score KDE-based ﬂow matc hing (Kunkel and T rabs, 2025) W 1 n − ( η +1) / (2 η + D ) η -smo oth density (Beso v); compact supp ort in R D Lipsc hitz vector-ﬁeld nets; suﬃcien tly expressive class MA GT (Theorem 3) W 2 n − ( η +1) / (2 η + d ∗ ) η -smo oth density on a d ∗ -manifold (b oundary allow ed) Anc hor-based score estimator; ﬂexible ReLU arc hitecture Recall from (2)–(4) that the smo othed score at noise level t dep ends on the p osterior mean m t ( y t ) := E [ h ( U ) | y t ] . Any ﬁnite-anc hor appro ximation pro duces e m t,K ( y t ) and the induced score estimate e s t,K ( y t ) = σ − 2 t  α t e m t,K ( y t ) − y t  . Since e s t,K ( y t ) − ∇ y t log p ( y t ) = α t σ 2 t  e m t,K ( y t ) − m t ( y t )  , (18) it suﬃces to control the approximation error of the p osterior mean for each metho d below. 4.1 MA GT-MC MA GT-MC is the baseline v arian t in whic h we approximate the p osterior exp ectation in (2) using standard Mon te Carlo anchors dra wn i.i.d. from the base distribution. Equiv alen tly , w e choose the prop osal in (3) to b e ˜ π ( · | y t ) ≡ π ( · ) , draw u 1 , . . . , u K ∼ π , and compute e s t,K ( y t ; h, π , π ) via (4). In this case, the imp ortance ratio cancels and the weigh ts are prop or- tional to the Gaussian lik eliho o d terms ϕ ( y t ; α t h ( u j ) , σ 2 t I D ) , yielding a simple ﬁnite-mixture appro ximation. This approac h is em barrassingly parallel and requires no optimization at inference time. Ho w ev er, when the p osterior π t ( u | y t ) is m uch more concen trated than the base distribution of π (e.g., for small σ t ), most anc hors receiv e negligible w eight, and the eﬀectiv e sample size can collapse. The QMC and MAP v ariants b elow are designed to mitigate this v ariance in complemen tary w a ys. Lemma 1 ( K -approximation error) . Consider the estimator with ˜ π = π , i.e., dra w i.i.d. anc hors U (1) , . . . , U ( K ) ∼ π (indep endent of Y t ) and form the transp ort-based score estimate 16 e s t,K ( Y t ; h, π , π ) via (4). Then for all suﬃciently large K , there exists a constant C > 0 dep ending only on ( B , d, D ) and the geometric constan ts in Assumptions 1 – 2 suc h that, E   e s t,K ( y t ; h, π , π ) − s t ( y t ; h )   2 2 ≤ C α 2 t K σ d +4 t , (19) where s t ( · ; h ) := ∇ y log p t ( y ) is the ideal (inﬁnite-anc hor) mixture score at lev el t . 4.2 MA GT-QMC MA GT-QMC replaces i.i.d. anc hors with quasi–Mon te Carlo p oin t sets that appro ximate the base distribution π more evenly . Concretely , we use a lo w-discrepancy sequence (optionally randomized via scram bling) in [0 , 1] d and map it to the target base (e.g., via the in verse CDF transform for factorized bases or other standard transp orts). W e then plug these anc hors in to (4) exactly as in MAGT-MC. Because e m t,K ( y t ) is a ratio of t w o p osterior exp ectations, reducing the in tegration er- ror of eac h term can signiﬁcantly improv e score stabilit y . In many smo oth settings, QMC yields a faster empirical conv ergence rate than K − 1 / 2 and often pro vides a practical v ariance reduction mec hanism without c hanging the underlying model. Star discrepancy and Hardy–Krause v ariation. F or a p oint set P K = { z 1 , . . . , z K } ⊂ [0 , 1] d , its star discrepancy is D ∗ ( P K ) := sup x ∈ [0 , 1] d      1 K K X j =1 1 { z j ∈ [0 , x ) } − d Y i =1 x i      , (20) where x = ( x 1 , . . . , x d ) and [0 , x ) := Q d i =1 [0 , x i ) is an anchored axis-aligned b ox. W e write V HK ( g ) for the Hardy–Krause v ariation of an in tegrand g : [0 , 1] d → R ; for smooth g it can b e b ounded in terms of mixed partial deriv atives. See standard QMC references for the formal deﬁnition and the K oksma–Hla wk a inequalit y (Niederreiter, 1992; Dic k and Pillichshammer, 2010). Lemma 2 (MA GT-QMC score discrepancy bound) . Fix t ∈ (0 , 1) and an observ ation y t ∈ R D . Assume there exists a measurable map T : [0 , 1] d → U such that T ( Z ) ∼ π when Z ∼ Unif ([0 , 1] d ) . Let P K = { z 1 , . . . , z K } ⊂ [0 , 1] d b e a p oin t set with star discrepancy D ∗ ( P K ) , and deﬁne the QMC anc hors U ( j ) := T ( z j ) . F orm the QMC p osterior-mean and score estimators e m QMC t,K ( y t ) := P K j =1 ϕ  y t ; α t h ( U ( j ) ) , σ 2 t I D  h ( U ( j ) ) P K j =1 ϕ ( y t ; α t h ( U ( j ) ) , σ 2 t I D ) , e s QMC t,K ( y t ) := 1 σ 2 t  α t e m QMC t,K ( y t ) − y t  . Deﬁne the in tegrands on [0 , 1] d , f 0 ( z ) := ϕ  y t ; α t h ( T ( z )) , σ 2 t I D  , f 1 ,r ( z ) := h r ( T ( z )) ϕ  y t ; α t h ( T ( z )) , σ 2 t I D  , r = 1 , . . . , D , 17 and set V HK ( f 0 ) for the Hardy–Krause v ariation of f 0 and V HK ( f 1 ) := P D r =1 V HK ( f 1 ,r ) . Let I 0 := R [0 , 1] d f 0 ( z ) d z and I 1 := R [0 , 1] d f 1 ( z ) d z . If V HK ( f 0 ) D ∗ ( P K ) ≤ I 0 / 2 , then   e s QMC t,K ( y t ) − ∇ y t log p t ( y t )   2 ≤ 2 α t σ 2 t I 0  V HK ( f 1 ) + ∥ m t ( y t ) ∥ 2 V HK ( f 0 )  D ∗ ( P K ) . (21) In particular, for classical low-discrepancy constructions one has D ∗ ( P K ) = O  K − 1 (log K ) d  , yielding a O  K − 1 (log K ) d  deterministic in tegration rate whenever V HK ( f 0 ) and V HK ( f 1 ) are ﬁnite (Niederreiter, 1992; Dick and Pillichshammer, 2010). 4.3 MA GT-MAP MA GT-MAP approximates the exp ectation using a data-dep enden t prop osal prior con- structed from a MAP-Laplace-Gauss-Newton appro ximation. This yields a more concen- trated prop osal distribution around high-p osterior-mass regions, thereb y impro ving the eﬃ- ciency of the exp ectation approximation. F or a ﬁxed noise lev el t and observ ation y t ∈ R D , the latent p osterior induced b y the base densit y π and the map h is π t ( u | y t ) ∝ π ( u ) ϕ  y t ; α t h ( u ) , σ 2 t I D  . (22) W e compute a MAP estimate ˆ u ∈ arg max u π t ( u | y t ) , equiv alently ˆ u ∈ arg min u Φ( u ) for the negativ e log-p osterior Φ( u ) := 1 2 σ 2 t   y t − α t h ( u )   2 2 − log π ( u ) . A Laplace approximation of (22) yields a Gaussian prop osal of the form q ( u | y t ) = N ( ˆ u , ˜ Σ ) . In MAGT-MAP w e use this data-dep endent Gaussian as the prop osal ˜ π ( · | y t ) = q ( · | y t ) in the imp ortance w eigh ts (3), which concen trates anc hors near high p osterior mass and t ypically impro v es the eﬀectiv e sample size when σ t is small. Gauss–Newton Laplace prop osal. Computing the exact Hessian ∇ 2 Φ( ˆ u ) can b e ex- p ensiv e b ecause it in v olves second deriv atives of h . Instead, we use a Gauss–Newton approx- imation based on the Jacobian J h ( ˆ u ) . Concretely , w e take ˆ Λ := I d + α 2 t σ 2 t J h ( ˆ u ) ⊤ J h ( ˆ u ) , ˜ Σ := ( ζ ˆ Λ + τ 2 I d ) − 1 , where I d is the d × d identit y matrix, ζ ≥ 0 is an optional inﬂation factor, and τ 2 ≥ 0 pro vides n umerical damping. Lemma 3 (MAGT-MAP self-normalized IS error) . Fix t ∈ (0 , 1) and y t ∈ R D . As- sume π t ( · | y t ) ≪ q ( · | y t ) and ∥ h ( u ) ∥ 2 ≤ B for all u . Let e m t,K ( y t ) denote the self- normalized importance-sampling estimator of m t ( y t ) = E π t ( ·| y t ) [ h ( U )] formed from i.i.d. 18 samples U (1) , . . . , U ( K ) ∼ q ( · | y t ) and weigh ts proportional to π t ( u | y t ) /q ( u | y t ) (equiv a- len tly , the unnormalized w eigh ts (3) with ˜ π = q ). Deﬁne the order-2 div ergence factor D 2  π t ( · | y t )   q ( · | y t )  := E q ( ·| y t ) "  π t ( U | y t ) q ( U | y t )  2 # = 1 + χ 2  π t ( · | y t )   q ( · | y t )  . Then, E h   e s t,K ( y t ; h, π , q ) − ∇ y t log p t ( y t )   2 2    y t i ≤ 32 α 2 t B 2 K σ 4 t D 2  π t ( · | y t )   q ( · | y t )  . Th us MA GT-MAP ac hieves the usual K − 1 / 2 ro ot-MSE rate, with a constant gov erned b y ho w w ell the Laplace prop osal matc hes the p osterior (through D 2 ) (Ow en, 2013). Standard Laplace-appro ximation b ounds quantify when the Gaussian prop osal q ( · | y t ) is close to the true p osterior: small lo cal curv ature mismatc h and accurate cov ariance appro ximation (together with p osterior concentration) yield small div ergence. Moreo ver, imp ortance-sampling eﬃciency dep ends on ho w well q ( · | y t ) matches the p osterior, with w eigh t disp ersion gov erned b y divergences such as D 2 ( π t ∥ q ) . Finally , ev en structured co v ari- ances (e.g., diagonal or lo w-rank) can work well in practice when they preserve the dominan t directions of ˆ Σ , whic h keeps the cov ariance-mismatc h con tribution small. 5 Implemen tation This section describes how we optimize the empirical ob jectiv e L n,K in (6), especially when the anchor budget K is large. Recall that K en ters the loss inside the score estimator e s t,K , through the self-normalized imp ortance weigh ts in (3). Consequen tly , increasing K impro v es the Mon te Carlo accuracy of eac h p er-sample score estimate, but it also c hanges the computational proﬁle of sto chastic gradient descent (SGD) used in optimization. Concerning the SGD minibatc h computation { ( y ( b ) 0 , y ( b ) t ) } B b =1 at a ﬁxed time t and a set of anchors { u k } K k =1 , deﬁne cen ter outputs ˜ y ( k ) 0 = h θ ( u k ) ∈ R D . In the common c hoice ˜ π ≡ π , the unnormalized w eights are prop ortional to Gaussian lik eliho o ds, so the (normalized) soft assignmen t for item b is w b,k := exp  z b,k  P K j =1 exp  z b,j  , z b,k := − ∥ y ( b ) t − α t ˜ y ( k ) 0 ∥ 2 2 2 σ 2 t ; b = 1 , . . . , B ; k = 1 , . . . , K. (23) The p osterior mean is m b = P K k =1 w b,k ˜ y ( k ) 0 , and the score estimator b ecomes e s t,K ( y ( b ) t ) = ( α t m b − y ( b ) t ) /σ 2 t . Th us ev aluating one SGD step for (6) requires (i) computing all logits z b,k ’s; b = 1 , . . . , B ; k = 1 , . . . , K , (ii) forming a K -term softmax and weigh ted sum p er batc h item, and (iii) diﬀerentiating through these op erations. If w e directly implement (6) with automatic diﬀeren tiation, the computation graph con- tains all B K in teractions. This has t wo practical issues. First, the computational cost of the forw ard pass p er step scales as O ( B K D ) (distance ev aluations and weigh ted sums). Second, naively backpropagating through the softmax and weigh ted sums requires storing 19 { z b,k , w b,k , ˜ y ( k ) 0 } (and in termediate activ ations through h θ ), which can scale like O ( B K ) plus the activ ations for K anc hor forw ard passes. When K ≫ B , this is often the b ottlenec k. In addition, the softmax in (23) can b e numerically unstable for large K as man y logits may lie far in the tail. In practice, we compute it with a log-sum-exp stabilization, i.e., subtract max k z b,k b efore exp onen tiating. F rom the optimization standp oint, K also controls the sto chasticit y of the gradien t: small K yields a noisier Mon te Carlo appro ximation to the p osterior mean and hence a higher- v ariance sto chastic gradien t, while larger K reduces this v ariance but increases p er-step cost. Our implemen tation therefore separates (a) the Monte Carlo noise induced by ﬁnite K from (b) the usual minibatch noise induced by ﬁnite B . Exact gradien ts with resp ect to anchor outputs plus c h unk ed bac kprop. T o k eep the estimator in (4) unchanged while making SGD practical for large K , w e use a tw o-stage gradien t computation. Stage 1 (No-gr ad forwar d; c ompute exact c enter gr adients). W e compute the center outputs ˜ y ( k ) 0 = h θ ( u k ) and the logits/w eights (23) without storing the full autograd graph, where h is parametrized b y h θ with θ indicating mo del parameters. Giv en the minibatc h loss, w e then compute the exact gradien t of the loss with resp ect to eac h cen ter output, g k := ∂ L/∂ ˜ y ( k ) 0 , using a closed-form expression obtained by diﬀeren tiating through m b = P k w b,k ˜ y ( k ) 0 and the softmax Jacobian. Stage 2 (Chunke d VJP thr ough h θ ). Once { g k } K k =1 are computed, the parameter gradient factors through the Jacobian: ∇ θ L = K X k =1 ∇ θ h θ ( u k ) ⊤ g k = ∇ θ K X k =1 ⟨ h θ ( u k ) , g k ⟩ , (24) where ∇ θ h θ ( u k ) ∈ R D ×| θ | denotes the Jacobian of h θ ( u k ) with resp ect to the parameter vector θ . Equation (24) means we can backpropagate through h θ b y treating g k as constan ts and pro cessing anchors in ch unks of size K c . P eak memory then scales with K c rather than K , while the computed gradient is exact for the c hosen anchors. Algorithm 1 summarizes the resulting up date. 20 Algorithm 1 Best single-level transport map ˆ h ˆ λ giv en ( K, d ) Require: T raining data D train , v alidation data D v al , laten t dimension d , anc hor count K , ev aluation metric E ( · , · ) , candidate times T = { t 1 , . . . , t L } , learning rate r Ensure: T rained transp ort ˆ h ˆ λ 1: for t ∈ T do 2: for e = 1 , . . . , max ep o ch do 3: for eac h minibatch { Y ( i ) } B i =1 ⊂ D train do 4: Sample noises { Z ( i ) } B i =1 ∼ N ( 0 , I ) ; set t b ← t 5: α b = α ( t b ) , σ b = σ ( t b ) ; Y ( b ) t ← α b Y ( b ) + σ b Z ( b ) 6: Parameter up date: θ ← Upda te ( θ , r ; { ( Y ( b ) t , t b ) } B b =1 , K , d, K c , M ) 7: end for 8: end for 9: Set ˆ h ( t,d,K ) ← h θ 10: Sample anc hors { U j } |D v al | j =1 ∼ π U ; Generate b D ( t ) ← { ˆ h ( t,d,K ) ( U j ) } |D v al | j =1 11: Compute d ( t ) ← E  b D ( t ) , D v al  12: end for 13: ˆ t ← arg min t ∈T d ( t ) 14: Set ˆ λ ← ( ˆ t, d, K ) 15: Set ˆ h ˆ λ ← ˆ h ( ˆ t,d,K ) 16: return ˆ h ˆ λ 17: function Upd a te ( θ , r ; { ( Y ( b ) t , t b ) } B b =1 , K , d, K c , M ) 18: Sample laten ts u k ∼ π for k = 1 , . . . , K 19: Compute cen ters without grad : ˜ Y ( k ) = h θ ( u k ) 20: Phase 1 (no-grad forw ard & exact cen ter gradien ts). 21: for b = 1 to B do 22: z b,k ← −∥ Y ( b ) t − α b ˜ Y ( k ) ∥ 2 / (2 σ 2 b ) , w b,k ← softmax k ( z b, · ) 23: m b ← P K k =1 w b,k ˜ Y ( k ) , sP b ← ( α b m b − Y ( b ) t ) /σ 2 b 24: end for 25: Deﬁne true score T b ; set g b ← sP b − T b , c b ← ( α b /σ 2 b ) g b , ∆ b,k ← ⟨ ˜ Y ( k ) − m b , c b ⟩ 26: F or each k , set g k ← P B b =1 h w b,k c b + w b,k ∆ b,k  α b σ 2 b Y ( b ) t − α 2 b σ 2 b ˜ Y ( k )  i 27: Phase 2 (c h unk ed VJP; g frozen). 28: for c hunks K ⊂ { 1 , . . . , K } of size ≤ K c do 29: S K ( θ ) ← P k ∈K ⟨ h θ ( u k ) , g k ⟩ (tr e at g as c onstant) 30: Bac kprop ∇ θ S K ( θ ) and accum ulate 31: end for 32: Gradien t step: θ ← θ − r P K ∇ θ S K ( θ ) 33: return θ 34: end function 21 6 Exp erimen ts This section ev aluates whether MAGT can reconcile three ob jectives that are often in tension in generativ e mo deling: (i) gener ation ﬁdelity (matc hing the target distribution), (ii) sam- pling eﬃciency (fast inference-time generation), and (iii) manifold alignment (concen trating probabilit y mass near the in trinsic lo w-dimensional supp ort rather than leaking into the am- bien t space). W e b enc hmark MA GT against diﬀusion-, ODE/transp ort-, and adversarial- based baselines on b oth con trolled syn thetic manifolds (where ground-truth geometry is kno wn) and real datasets (images and tabular/high-dimensional sequences). 6.1 Syn thetic b enchmarks: lo w-dimensional manifolds W e ﬁrst ev aluate on controlled syn thetic distributions where b oth the ground-truth distribu- tion and (for manifold datasets) the underlying manifold are kno wn. T able 3 rep orts six rep- resen tativ e b enc hmarks: four non-Gaussian distributions in R 2 ( rings2d , spir al2d , mo ons2d , che cker2d ) and t wo thin manifolds embedded in R 3 ( helix3d , torus3d ). F or the manifold datasets, w e sample laten t parameters from a uniform distribution, map them through a nonlinear embedding in to R 3 , and add small i.i.d. Gaussian jitter; full sim ulation details are pro vided in Appendix A. Data splits and ev aluation proto col. A cross all syn thetic b enchmarks, the training set D train = { y i 0 } n i =1 con tains n = 10 , 000 observ ations. W e use an independent v alidation set D v al of size 5 , 000 for h yp erparameter selection and a held-out test set for computing W 2 and the oﬀ-manifold rate. Unless otherwise stated, each metho d generates 10 , 000 samples for ev aluation. MA GT conﬁguration (syn thetic). F or eac h dataset, w e set the latent dimension d equal to the kno wn intrinsic dimension rep orted in T able 3 and use the standard Gaussian base π U = N (0 , I d ) . The transp ort h θ : R d → R D is a 5-hidden-la y er MLP of width 512 (ReLU). W e train MA GT with the MA GT-MC score estimator (proposal ˜ π = π U ) and tune the smo othing level t b y v alidation: w e searc h t ∈ { 0 . 1 , 0 . 2 , . . . , 0 . 9 } and select the v alue minimizing the ﬁxed- t score-matc hing loss on D v al . Unless otherwise noted, we rep ort results at anc hor budget K = 1024 ; Figure 2 further studies sensitivit y to K and t . Diﬀusion baselines (syn thetic). W e train a time-conditioned score netw ork with the same MLP backbone and a 32-dimensional time embedding, using denoising score matc hing o v er t ∈ [ t low , t high ] . Sampling uses: (i) DDIM (Song, Meng and Ermon, 2021) with 1000 steps and η = 1 . 0 on t ∈ [0 . 05 , 0 . 90] (NFE = 1000 ); and (ii) DPM-Solver++ (Lu et al., 2022) with either 20 (ﬁrst-order) or 40 (second-order midp oint) function ev aluations on the same in terv al. Flo w matc hing baseline (synthetic). W e train a time-conditioned v elo cit y ﬁeld with the same MLP bac kb one and 32-dimensional time em b edding. Sampling solves the learned ODE with a midpoint integrator and step size 0 . 05 , yielding 20 v elo city ev aluations (NFE = 20 ) o v er t ∈ [0 . 05 , 0 . 90] . 22 W GAN-GP baseline (syn thetic). W e train W GAN-GP (Arjo vsky et al., 2017; Gulra- jani et al., 2017) with b oth generator and critic implemen ted as 5-hidden-la y er MLPs of width 512 (ReLU), matc hing the MA GT bac kb one capacit y . The generator tak es a d -dimensional laten t input (matc hing the intrinsic dimension for this controlled setting), and we use ﬁve critic up dates p er generator up date, gradient-penalty coeﬃcient 10, and Adam with learning rate 10 − 4 and ( β 1 , β 2 ) = (0 . 5 , 0 . 999) . Iterativ e reﬁnemen t using MA GT’s score estimator. W e additionally ev aluate MAGT– DDIM (M-DDIM), whic h uses the same learned transp ort h θ and anchor-based score esti- mator e s t,K as MAGT, but p erforms iterativ e DDIM-style reﬁnemen t from t high = 0 . 90 to t low = 0 . 05 for 205 steps (NFE = 205 ; η = 1 . 0 ). T o a void repeatedly recomputing anc hor outputs, we cac he the anc hor bank { h θ ( u k ) } K k =1 with u k ∼ N (0 , I d ) once and reuse it at ev- ery reﬁnement step. This baseline isolates whether MAGT’s empirical gains come from (a) the single-level score ob jectiv e and anchor p osterior estimator, or (b) the one-shot amortized transp ort. Fidelit y , manifold alignmen t, and sp eed. T able 3 sho ws that MA GT outp erforms all diﬀusion samplers on W 2 across all six manifolds, and improv es o ver ﬂow matc hing on ﬁv e of six (with che cker2d essentially tied in W 2 ), while using only a single transp ort ev aluation at sampling time (NFE = 1 ). MAGT also exhibits the strongest manifold alignment : on the thin manifolds helix3d and torus3d it attains oﬀ-manifold rates of 0 . 0938 and 0 . 0341 , compared with 0 . 1673 / 0 . 5557 for ﬂo w matching and 0 . 333 – 0 . 386 / 0 . 634 – 0 . 670 for diﬀusion samplers at practical step budgets. Finally , MA GT is the fastest high-ﬁdelity method in this suite, generating 10 , 000 samples in ≈ 10 − 3 seconds, while d iﬀusion and ODE-based samplers require tens to thousands of net work ev aluations. In terpretation. T wo aspects of the exp erimen tal conﬁguration are imp ortant for in terpret- ing T able 3. First, only MAGT and the GAN baseline are dimension-mismatched one-shot generators: MAGT learns h θ : R d → R D with d ≪ D (matched to the in trinsic dimension), so samples lie on the d -dimensional image of h θ b y construction. By contra st, diﬀusion and ODE baselines mo del am bien t-space dynamics and m ust learn to contract probability mass in directions normal to the manifold, which can leav e residual oﬀ-manifold scatter under ﬁnite capacit y and ﬁnite solv er steps. Second, MA GT is trained at a single smo othing lev el t (selected b y v alidation), so mo deling capacit y is fo cused on matching the ﬁxed- t score that app ears in the pull-back b ound (Theorem 1); iterativ e samplers rep eatedly apply ap- pro ximate scores/v elo cities across man y steps, so approximation and discretization errors can accum ulate. This also helps explain wh y M-DDIM need not outp erform the one-shot MA GT sampler: it rep eatedly queries an appro ximate score estimator (ﬁnite K ) ov er 205 reﬁnemen t steps. Figure 1 shows synthetic samples generated b y MA GT across six to y tasks. In all cases, MA GT concentrates probabilit y mass tigh tly along the underlying data manifold, with only a few outliers near the b oundary . Relativ e to diﬀusion (DDIM), MA GT pro duces visibly cleaner manifold support and ac hiev es uniformly lo wer W 2 in T able 3 while incurring sub- stan tially low er sampling cost; relativ e to ﬂo w matc hing, MAGT is typically b etter and otherwise similar in ﬁdelity , again at m uch low er sampling cost. 23 T able 3: Empirical W 2 ( ↓ ) , oﬀ-manifold rate (fraction of samples with distance > 0 . 1 ; ( ↓ ) ), and w all-clo ck sampling time (seconds) to generate 10 , 000 samples (identical batc hing and hardw are across metho ds). Paren theses rep ort standard deviations across runs for W 2 and the oﬀ-manifold rate. NFE denotes the num b er of score/velocity netw ork ev aluations. Boldface indicates the b est-p erforming method for eac h metric. rings2d ( d = 1 ) spiral2d ( d = 1 ) moons2d ( d = 2 ) W 2 ( ↓ ) % out ( ↓ ) W 2 ( ↓ ) % out ( ↓ ) W 2 ( ↓ ) % out ( ↓ ) MAGT NFE= 1 0.0592 (0.0070) 0.1406 (0.0255) 0.0767 (0.0185) 0.4150 (0.1501) 0.0331 (0.0036) 0.0360 (0.0103) Time 0.001 0.001 0.001 M-DDIM NFE= 205 0.0613 (0.0100) 0.2014 (0.0211) 0.0836 (0.0054) 0.3740 (0.0927) 0.0733 (0.0219) 0.2199 (0.0709) Time 0.615 0.613 0.668 DPM++ (1s) NFE= 20 0.1052 (0.0228) 0.6905 (0.0283) 0.1992 (0.0300) 0.7584 (0.0261) 0.1199 (0.0083) 0.1672 (0.0293) Time 0.153 0.132 0.164 DPM++ (2m) NFE= 40 0.1057 (0.0239) 0.6922 (0.0188) 0.1976 (0.0290) 0.7644 (0.0259) 0.1222 (0.0115) 0.1614 (0.0200) Time 0.300 0.256 0.327 DDIM NFE= 1000 0.1107 (0.0172) 0.7057 (0.0188) 0.1494 (0.0292) 0.7767 (0.0254) 0.0920 (0.0311) 0.1843 (0.0146) Time 3.190 2.720 3.392 WGAN NFE= 1 0.2904 (0.1112) 0.7316 (0.0622) 0.3603 (0.0472) 0.8110 (0.0769) 0.4562 (0.1310) 0.9292 (0.0829) Time 0.001 0.001 0.001 FM NFE= 20 0.0712 (0.0026) 0.5838 (0.0142) 0.0954 (0.0030) 0.6422 (0.0162) 0.0470 (0.0191) 0.0556 (0.0168) Time 0.091 0.090 0.091 chec ker2d ( d = 2 ) helix3d ( d = 1 ) torus3d ( d = 2 ) W 2 ( ↓ ) % out ( ↓ ) W 2 ( ↓ ) % out ( ↓ ) W 2 ( ↓ ) % out ( ↓ ) MAGT NFE= 1 0.0608 (0.0040) 0.0064 (0.0029) 0.0425 (0.0022) 0.0938 (0.0501) 0.0717 (0.0049) 0.0341 (0.0182) Time 0.001 0.001 0.001 M-DDIM NFE= 205 0.0640 (0.0063) 0.0088 (0.0040) 0.0536 (0.0086) 0.2812 (0.0740) 0.0922 (0.0123) 0.1208 (0.0223) Time 0.615 0.664 0.616 DPM++ (1s) NFE= 20 0.0944 (0.0213) 0.0228 (0.0205) 0.1115 (0.0162) 0.3338 (0.0230) 0.1650 (0.0223) 0.6453 (0.0656) Time 0.141 0.183 0.137 DPM++ (2m) NFE= 40 0.0914 (0.0216) 0.0257 (0.0195) 0.1056 (0.0128) 0.3356 (0.0275) 0.1640 (0.0221) 0.6344 (0.0588) Time 0.261 0.362 0.276 DDIM NFE= 1000 0.1108 (0.0276) 0.0860 (0.0407) 0.0883 (0.0162) 0.3861 (0.0300) 0.1603 (0.0269) 0.6699 (0.0658) Time 2.760 3.709 2.934 WGAN NFE= 1 0.2470 (0.0646) 0.3107 (0.3267) 0.2352 (0.1335) 0.8212 (0.1933) 0.4036 (0.0285) 0.7916 (0.0337) Time 0.001 0.001 0.001 FM NFE= 20 0.0592 (0.0015) 0.0147 (0.0047) 0.0533 (0.0072) 0.1673 (0.0294) 0.1130 (0.0074) 0.5557 (0.0268) Time 0.090 0.099 0.092 24 Figure 1: Qualitative comparison of generativ e mo dels on six syn thetic manifolds. Each row corresp onds to one toy dataset (rings2d, spiral2d, mo ons2d, c heck er2d, helix3d, torus3d). Columns sho w, from left to right, ground-truth samples, MA GT one-shot transp ort samples, diﬀusion-mo del samples generated with DDIM, and ﬂo w-matc hing samples. 25 Figure 2: Eﬀect of sample size n , anc hor coun t K , and smo othing lev el t on MAGT and MA GT–DDIM across six syn thetic b enchmarks. Curves rep ort W asserstein distance ( W 2 ; lo w er is b etter). Consistent with the bias–v ariance trade-oﬀ in Section 3, increasing n and K impro v es ﬁdelit y , while in termediate noise lev els provide the most stable p erformance. 26 Sensitivit y to sample size, anc hor count, and smo othing level for MA GT. Fig- ure 2 examines the eﬀect of the training sample size n , the num b er of mixture anc hors K , and the smo othing level t . Consistent with the bias–v ariance trade-oﬀ described in Section 3, increasing either n or K reduces the W 2 error. Performance is most stable at in termediate noise levels: small t ampliﬁes v ariance in the posterior score estimator, while large t ov er- smo oths the underlying manifold geometry . T ogether, these results supp ort ﬁxed- t training as an eﬀectiv e bias–v ariance compromise, eliminating the need to in tegrate a full diﬀusion tra jectory . 6.2 Real-data b enc hmarks Datasets. W e ev aluate the prop osed metho d on image and tabular/high-dimensional b ench- marks. MNIST (LeCun et al., 1998) contains 28 × 28 grayscale handwritten digits (60,000 training, 10,000 test). CIF AR10-0 is the single-class subset of CIF AR-10 (Krizhevsky and Hin ton, 2009) con taining only class 0 (airplanes), with 5,000 training and 1,000 test im- ages. The single-class setting isolates a single semantic mo de and yields a more concen trated distribution, pro viding a stress test for manifold-aligned generators. F or tabular data, we use Superconduct (Hamidieh, 2018), which con tains 21 , 263 sam- ples with D = 81 numeric features. W e mo del the standardized feature vectors and ev aluate distributional discrepancy b et w een generated samples and a held-out test split (20% of the data). W e also consider Genomes from the 1000 Genomes Pro ject (The 1000 Genomes Pro ject Consortium, 2015). F ollowing prior studies (Y elmen et al., 2023; Ahrono viz and Gronau, 2024), w e fo cus on D = 10 , 000 biallelic SNPs on c hromosome 6 (a 3 Mbp region including HLA genes), enco ded as binary sequences. W e use 4,004 genomes for training and 1,002 for testing (stratiﬁed b y con tinental group). Mo del arc hitectures. W e matc h capacity within eac h mo dalit y as closely as practical, sub ject to standard arc hitectures for eac h baseline. (1) Images (MNIST, CIF AR10-0): MAGT uses a 5-lay er con volutional generator h θ , and the GAN baseline adopts the same generator arc hitecture. Diﬀusion and ﬂo w-matc hing baselines use a 2D U-Net (v on Platen et al., 2022) to parameterize the score (diﬀusion) and velocity ﬁeld (ﬂow matc hing), whic h is substan tially more in tricate than our generator net w ork. (2) Sup erconduct: All metho ds use a 5-hidden-lay er MLP of width 512 (ReLU). Dif- fusion and ﬂow matching additionally tak e a 32-d time embedding as input. The GAN generator tak es a 64-dimensional laten t v ector. (3) Genomes: Diﬀusion and MA GT use closely matched 1D U-Net bac kb ones, following the architecture used in genomic diﬀusion (Kenneweg et al., 2025). In MA GT, w e addition- ally include a linear pro jection that maps the low-dimensional latent input in to the channel dimension exp ected b y the U-Net. Baselines and sampler settings. W e compare against DDIM (Song, Meng and Ermon, 2021) (diﬀusion) and ﬂow matching ( FM ) (Lipman et al., 2022; Liu et al., 2022) as strong 27 T able 4: MAGT conﬁguration on real-data b enchmarks. W e rep ort the base laten t distribu- tion π U , the MA GT p osterior-estimation v ariant, the anc hor budget K in the score estimator, the candidate smo othing lev els t , the latent dimension d , and the v alidation metric used for mo del selection. Dataset V ariant Base π K t candidates d candidates Selection metric MNIST MC N (0 , I 64 ) × Bern(0 . 5) 16 4096 { 0 . 1 , 0 . 2 , . . . , 0 . 9 } 80 FID CIF AR10-0 MAP N (0 , I 128 ) 16384 { 0 . 01 , 0 . 1 } 128 FID Superconduct MC N (0 , I d ) 49152 { 0 . 1 , 0 . 2 , . . . , 0 . 9 } { 16 , 32 , 64 } W 2 Genomes MC N (0 , I 128 ) 2048 { 0 . 1 , 0 . 2 , . . . , 0 . 9 } 128 W 2 iterativ e baselines, as well as W GAN-GP as a one-shot baseline. W e run DDIM with 200 denoising steps on the in terv al t ∈ [0 . 05 , 0 . 90] with sto c hasticity parameter η = 1 . 0 . F or FM, w e integrate the learned ODE with a midp oint solv er using step size 0 . 05 (20 steps, NFE = 20 ) on the same time in terv al. F or GANs, sampling is one generator forward pass (NFE = 1 ); we use a latent dimension that matches MA GT whenev er applicable (MNIST: d = 80 , CIF AR10-0: d = 128 , Genomes: d = 128 ) and use a 64-dimensional latent for Sup erconduct, matc hing the strongest-performing MAGT setting ( d = 64 ). MA GT conﬁguration. T able 4 rep orts the exact MAGT conﬁgurations used in our real- data exp erimen ts: the base laten t distribution π , laten t dimension d , anc hor budget K in the ﬁxed- t score estimator used during training, the candidate smo othing levels t , the p osterior- estimation v ariant (MC and MAP), and the v alidation metric used for mo del selection. Unless otherwise noted, sampling alw a ys dra ws U ∼ π and outputs a sample h θ ( U ) in one forw ard pass. F or CIF AR10-0 w e adopt MAGT-MAP to stabilize training: at small t the laten t p osterior b ecomes sharply concentrated, and the MAP–Laplace proposal yields a substan tially higher eﬀective sample size than prior sampling at the same anc hor budget. W e p erform mo del selection using the v alidation metric rep orted in the last column of T able 4. The selected conﬁgurations are t = 0 . 3 for MNIST, t = 0 . 1 for CIF AR10-0, ( t, d ) = (0 . 8 , 64) for Sup erconduct, and t = 0 . 3 for Genomes. Quan titative results and in terpretation. As summarized in T able 5, across all four real-data b enchmarks, MAGT consistently surpasses b oth the diﬀusion baseline (DDIM) and the GAN baseline in sample ﬁdelit y . It delivers the b est o verall p erformance on MNIST, Sup erconduct, and Genomes, and ranks second on CIF AR10-0, narrowly trailing ﬂo w matc h- ing (FM). Imp ortan tly , MAGT requires only a single forw ard pass through the generator at sampling time (NFE = 1 ). This one-shot generation design yields orders-of-magnitude reduc- tions in w all-clo ck latency relative to iterative diﬀusion and ODE-based approaches, while main taining equal or sup erior ﬁdelit y . On MNIST, MA GT sligh tly outp erforms DDIM in FID (109.22 vs. 109.53) while reducing sampling time from 320.63s to 0.25s under our ev aluation protocol. Relativ e to ﬂow matc hing (FM), MAGT improv es FID (109.22 vs. 115.00) while reducing sampling time from 30.18s to 0.25s. On CIF AR10-0, FM attains the b est FID (61.91), while MA GT still substan tially outp erforms DDIM (89.09) and GAN (159.79) in ﬁdelity at dramatically lo wer cost (0.09s vs. 121.82s). The GAN baseline is extremely fast in this single-class setting (0.002s) but has 28 T able 5: Fidelity (FID for image datasets; W 2 for tabular datasets) and wall-clock sampling time (in seconds) required to generate the ev aluation batch under identical hardw are and batc hing conditions. Lo w er v alues indicate b etter p erformance for b oth metrics. Boldface highligh ts the b est ﬁdelit y result within eac h dataset. A dash ("–") denotes that no imple- men tation is a v ailable for the corresp onding example. Dataset Quantit y MAGT DDIM FM GAN MNIST FID ↓ 109.22 109.53 115.00 109.23 Time (s) ↓ 0.25 320.63 30.18 0.19 CIF AR10-0 FID ↓ 65.64 89.09 61.91 159.79 Time (s) ↓ 0.09 22.97 121.82 0.002 Sup erconduct W 2 ↓ 0.0998 0.3982 0.1263 0.1443 Time (s) ↓ 0.0020 0.0974 0.0750 0.0013 Genomes W 2 ↓ 0.1688 0.2993 – 0.4911 Time (s) ↓ 1.63 980.34 – 0.11 substan tially w orse p erceptual ﬁdelit y (FID 159.79). On Sup erconduct, MA GT achiev es the low est distributional error in W 2 (0.0998), im- pro ving ov er FM (0.1263), GAN (0.1443), and the DDIM baseline (0.3982); this corresp onds to a 74 . 9% reduction in W 2 relativ e to DDIM (and 21 . 0% relative to FM), a substan tial gain in tabular ﬁdelit y . Sampling is essentially instan taneous (0.0020s). On Genomes, MAGT again ac hiev es the low est W 2 (0.1688), with substantial margins o ver DDIM (0.2993) and GAN (0.4911); this is a 43 . 6% reduction in W 2 relativ e to DDIM (and 65 . 6% relative to GAN). These results indicate that the manifold-aligned transp ort remains eﬀective even in v ery high am bient dimensions. A dditional represen tativ e samples generated by MA GT are pro vided in Appendix A. Empirical p erformance and underlying mec hanisms. MAGT is not merely comp et- itiv e: in our ev aluations it outp erforms diﬀusion baselines and GANs across all b enc hmarks, and it ac hieves esp ecially strong reductions in oﬀ-manifold leak age on thin manifolds (T a- ble 3). These empirical gains are driv en b y t wo structural features of the exp erimental conﬁguration. First, MA GT emplo ys a dimension-aligned transp ort map h θ : R d → R D , so ev ery generated sample lies in the d -dimensional image of h θ b y construction. This built-in supp ort constrain t directly targets the leak age captured b y the oﬀ-manifold rate. By contrast, ﬂow and diﬀusion-based metho ds must map R D to R D . In the manifold regime, these approaches m ust learn to concen trate probabilit y mass in to a lo wer-dimensional set by con tracting in directions normal to the data manifold—a diﬃcult task under ﬁnite capacit y and ﬁnite solver steps that can leav e residual oﬀ-manifold scatter. Second, MAGT concentrates learning at a single smo othing lev el t and matc hes the smo othed score at that level. The pull-bac k inequality (Theorem 1) b ounds W asserstein generation error by a ﬁxed-level score discrepancy , with a geometry- and t -dep endent pref- actor. This p ersp ectiv e explains the empirical bias–v ariance trade-oﬀ in t (Figure 2): v ery 29 small t pro duces highly anisotropic scores and high estimator v ariance, whereas very large t o v ersmo oths geometric structure. Selecting an in termediate t via v alidation therefore fo cuses mo deling capacity precisely where the bound is ev aluated, a voiding the accum ulation of ap- pro ximation and discretization error across many time steps. Moreov er, the anc hor-based p osterior estimator explicitly a verages ov er laten t pre-images consistent with a noisy observ a- tion, stabilizing score estimation in normal directions and improving supp ort concentration relativ e to iterativ e diﬀusion/ODE baselines at comparable ﬁdelity . 7 Discussion MA GT shows that a single-level (ﬁxed- t ) training ob jective can recov er a high-ﬁdelit y gen- erator in the manifold regime while retaining one-shot sampling. Crucially , b y iden tifying a manifold-induced transp ort from a low-dimensional latent to the ambien t space, MA GT can matc h and often surpass diﬀusion-baseline ﬁdelity in practice (e.g., T ables 3 and 5) without learning or simulating a full reverse-time diﬀusion pro cess: generation do es not require in te- grating a long tra jectory and therefore a v oids the discretization error and stepwise sto chastic sampling error that can accumulate in diﬀusion samplers. Conceptually , the metho d decouples supp ort ﬁdelity from the need to sim ulate an en- tire reverse-time tra jectory . By learning a transport map whose induced Gaussian smoothing admits an explicit p osterior iden tity for the score, MA GT turns ﬁxed- t denoising in to a prac- tical generative mec hanism. In this view, the complexit y of diﬀusion-st yle time-dep endent mo dels is replaced b y a single, manifold-aligned transp ort together with a one-shot score ev aluation; any remaining appro ximation stems primarily from the ﬁnite-anchor posterior appro ximation rather than from time discretization. The smo othing lev el t pla ys a dual role. Larger t impro ves numerical stability of score esti- mation (the p osterior ov er U is less concen trated) but increases smo othing bias and can blur ﬁne-scale structure. Smaller t reduces bias but mak es the laten t p osterior sharply p eak ed, whic h stresses ﬁnite-anchor appro ximations. Our experiments suggest that an in termediate t often pro vides the b est bias–v ariance trade-oﬀ; dev eloping principled, data-adaptive rules for selecting t (or using a small set of carefully chosen noise lev els) is a promising direction. The anc hor approximation is the main computational lever in MA GT. Prior sampling is simple and parallelizable, QMC reduces v ariance for mo derate intrinsic dimension, and MAP/Laplace prop osals improv e eﬀectiv e sample size when the p osterior is highly concen- trated. A practical limitation is that very small t ma y require large K to main tain stable w eigh ts; further w ork could explore learned prop osals, amortized MAP initializations, or hierarc hical anchor banks. Finally , MAGT induces an in trinsic densit y on the learned im- age manifold (directly computable for generated samples) and, at the ﬁxed smo othing lev el t > 0 , an ambien t densit y that can b e appro ximated using the same anc hor bank. Dev el- oping robust pro cedures for ev aluating these quantities on arbitrary observ ed p oints (whic h ma y require approximate in version) and for leveraging them for calibrated out-of-distribution scoring remains an op en and practically imp ortan t problem. More broadly , MAGT suggests that explicitly dimension-mismatc hed, non-inv ertible trans- p orts can pro vide a ligh tw eight alternative to diﬀusion-style mo deling when data concen trate near thin supp orts. Rather than ﬁtting a time-dependent score ﬁeld and sampling it via a dis- 30 cretized reverse-time pro cess, one can learn the manifold-induced transp ort and use a single- lev el p osterior iden tit y for generation. Poten tial extensions include conditional generation, higher-resolution image b enc hmarks, and h ybrid samplers that use MAGT for initialization follo w ed b y a short reﬁnement c hain when maxim um ﬁdelit y is required. 31 App endix A Exp erimen t details and more results A.1 T o y Datasets Rings2d. W e generate a t wo-dimensional mixture of concen tric rings. First sample a discrete radius index k ∼ Unif { 1 , . . . , K } and set the ring radius r k (e.g., equally spaced radii in a ﬁxed interv al). Then draw an angle θ ∼ Unif [0 , 2 π ] and form the noiseless p oint x 1 = r k cos θ , x 2 = r k sin θ . Finally , we add Gaussian jitter ε ∼ N (0 , σ 2 I 2 ) to obtain x = ( x 1 , x 2 ) ⊤ + ε . The noise lev el σ is set b y the jitter parameter (default σ = 0 . 02 ). Spiral2d. W e sample a laten t parameter t ∼ Unif [ t min , t max ] and construct a planar spiral in p olar form. Let the radius gro w with t , e.g., r ( t ) = a + bt for constan ts a, b > 0 , and set the angle to b e θ ( t ) = t . The noiseless p oin t is x 1 = r ( t ) cos t, x 2 = r ( t ) sin t. W e then add Gaussian jitter ε ∼ N (0 , σ 2 I 2 ) to obtain x = ( x 1 , x 2 ) ⊤ + ε , with σ con trolled b y jitter . Mo ons2d. W e generate the standard t w o-mo ons dataset consisting of tw o interlea ving semicircles. Sample a lab el c ∈ { 0 , 1 } uniformly and draw an angle θ ∼ Unif [0 , π ] . F or the ﬁrst moon ( c = 0 ), set x 1 = cos θ , x 2 = sin θ . F or the second mo on ( c = 1 ), w e apply a shift to create the in terlea ving structure: x 1 = 1 − cos θ , x 2 = 1 − sin θ − δ, where δ > 0 con trols the vertical separation (ﬁxed throughout the exp eriments). As in other settings, we add Gaussian jitter ε ∼ N (0 , σ 2 I 2 ) to obtain x = ( x 1 , x 2 ) ⊤ + ε , with σ set by jitter . Chec ker2d. W e generate a tw o-dimensional chec kerboard distribution supp orted on alter- nating squares of a regular grid. Let m ∈ N denote the num b er of cells p er axis and partition [ − 1 , 1] 2 in to an m × m grid with cell width w = 2 /m . Sample in teger indices ( i, j ) uniformly from { 0 , . . . , m − 1 } 2 sub ject to the parit y constrain t ( i + j ) mo d 2 = 0 (i.e., only the “black” squares). Conditional on ( i, j ) , dra w a p oin t uniformly within the selected cell: x 1 ∼ Unif  − 1 + iw , − 1 + ( i + 1) w  , x 2 ∼ Unif  − 1 + j w , − 1 + ( j + 1) w  . Finally , w e add Gaussian jitter ε ∼ N (0 , σ 2 I 2 ) to obtain x = ( x 1 , x 2 ) ⊤ + ε , where σ is set by the jitter parameter (default σ = 0 . 02 ). 32 Helix3d. W e sample t ∼ Unif [0 , 2 π ] and deﬁne a three-dimensional helix: x 1 = cos( t ) , x 2 = sin( t ) , x 3 = t. Noise ε ∼ N (0 , σ 2 I 3 ) is added to obtain x = ( x 1 , x 2 , x 3 ) ⊤ + ε . T orus3d. W e sample tw o indep enden t latent parameters ( u, v ) ∼ Unif [0 , 2 π ] 2 . Given a ma jor radius R and minor radius r , the torus em b edding in R 3 is x 1 = ( R + r cos v ) cos u, x 2 = ( R + r cos v ) sin u, x 3 = r sin v . Again, Gaussian jitter ε ∼ N (0 , σ 2 I 3 ) is added. W e use R = 2 and r = 1 . A.2 Examples of generated samples In this subsection, we present additional qualitative results to supplemen t the main quan- titativ e ev aluations. W e include representativ e sample grids for MNIST and CIF AR10-0 (airplanes) to visually compare the p erceptual qualit y and div ersity of unconditional gener- ations pro duced b y MAGT, diﬀusion sampling (DDIM), and ﬂo w matc hing. F or the genomic exp eriment, w e pro vide class-wise tw o-dimensional PCA visualizations of real test genomes and syn thetic genomes generated b y MA GT and a diﬀusion baseline. PCA is ﬁt separately within eac h class and applied to both real and generated samples from that class, enabling an interpretable comparison of class-conditional structure in the high-dimensional SNP space. Figure 3: Unconditional generation on MNIST, comparing samples from MAGT, DDIM, and ﬂo w matching (FM), alongside held-out real test images (left to righ t). B Pro ofs in Section 3 This appendix gathers the pro ofs and auxiliary technical lemmas underlying the results presen ted in Section 3 of the main text. Pr o of of The or em 1. W e w ork with the VP probability ﬂo w with constan t sc hedule β ( s ) ≡ 1 . In the standard time parameter s ≥ 0 , the closed-form co eﬃcients are α ( s ) = exp( − s/ 2) and σ 2 ( s ) = 1 − exp( − s ) . P arameterizing b y the noise level t := σ 2 ( s ) ∈ (0 , 1) giv es s = Γ( t ) := − log (1 − t ) , α t = √ 1 − t, σ 2 t = t. (25) 33 (a) MAGT (b) Diﬀusion(DDIM) (c) Flow matching Figure 4: Unconditional generation results on CIF AR10-0 (airplanes), comparing MAGT (left) and ﬂo w matc hing (FM) (righ t). Figure 5: Class-wise PCA pro jections for ﬁve classes, comparing real genomic data with samples generated b y MA GT and diﬀusion-based models. 34 Set T := Γ( t ) . Let p s and ˜ p s denote the VP-smo othed densities at time s of the data and generator, resp ectiv ely , so that p T = p t and ˜ p T = ˜ p t . Apply Lemma 14 on [0 , T ] with the iden tiﬁcation p s ← ˜ p s and q s ← p s . By symmetry of W 2 , W 2 ( P Y 0 , P ˜ Y 0 ) = W 2 ( p 0 , ˜ p 0 ) ≤ exp  Z T 0 L ( s ) ds  W 2 ( ˜ p t , p t ) + exp  Z T 0 L ( s ) ds  Z T 0 exp  Z T u ( K − L )  d u p J T , where (with this choice of roles) J T := Z R D   ∇ log ˜ p t ( y ) − ∇ log p t ( y )   2 2 p t ( y ) d y = J ( p t ∥ ˜ p t ) . With t ≤ t max = c 2 tube ρ 2 M , we hav e σ 2 s ≤ t for all s ∈ [0 , T ] , hence σ s ≤ c tube ρ M , so tub e pro jections are well-deﬁned along the ﬂow. Lemma 15 applies to b oth p s and ˜ p s and yields ∥∇ 2 log p s ∥ op ∨ ∥∇ 2 log ˜ p s ∥ op ≤ L ⋆ ( s ) for all s ≤ T , with L ⋆ ( s ) := C ( γ ) T σ γ − 2 s α γ s + ( C ( γ ) S ) 2 1 − θ t σ 2 γ − 2 s α 2 γ s , θ t = C ( γ ) N t γ (1 − t ) γ . Moreo v er, for the VP ﬂow v s ( x ) = − 1 2 x −∇ log ρ s ( x ) , w e ha ve ∥∇ v s ( x ) ∥ op ≤ 1 2 + ∥∇ 2 log ρ s ( x ) ∥ op , hence Assumption 4(A1) holds with L ( s ) := 1 2 + L ⋆ ( s ) . Finally , by the score-gap gro wth lemma (Lemma 16), Assumption 4(A2) holds with K ( s ) := 1 2 + 4 L ⋆ ( s ) . In particular, with this choice, K ( s ) − L ( s ) = 3 L ⋆ ( s ) . By Lemma 11 with reference measure ˜ p t , W 2 ( ˜ p t , p t ) = W 2 ( p t , ˜ p t ) ≤ 1 C LSI ( ˜ p t ) p J ( p t ∥ ˜ p t ) = ¯ C LSI ( t ) p J T . Moreo v er, b y Lemma 10 applied to ˜ p t , ¯ C LSI ( t ) = 1 C LSI ( ˜ p t ) ≤ α 2 t M 2 + σ 2 t min { C LSI ( π ) , 1 } = (1 − t ) M 2 + t min { C LSI ( π ) , 1 } . Change v ariables r = σ 2 s = 1 − exp( − s ) so that dr = (1 − r ) ds and α 2 s = 1 − r . Then σ γ − 2 s α γ s = r γ / 2 − 1 (1 − r ) γ / 2 , σ 2 γ − 2 s α 2 γ s = r γ − 1 (1 − r ) γ , ds = dr 1 − r . Using (1 − r ) − a ≤ (1 − t ) − a for r ∈ [0 , t ] yields Z T 0 L ⋆ ( s ) ds ≤ 2 γ C ( γ ) T t γ / 2 (1 − t ) 1+ γ / 2 + 1 γ ( C ( γ ) S ) 2 1 − θ t t γ (1 − t ) 1+ γ =: I γ ( t ) . 35 Since R T 0 L ( s ) ds = 1 2 T + R T 0 L ⋆ ( s ) ds , w e obtain the (updated) b ound exp  Z T 0 L  ≤ exp  1 2 T  exp  I γ ( t )  = exp  I γ ( t )  √ 1 − t =: Φ( t ) . Also, using K − L = 3 L ⋆ and monotonicit y of the in tegral, exp  Z T 0 L  Z T 0 exp  Z T u ( K − L )  du ≤ Φ( t ) Z T 0 exp  3 Z T u L ⋆ ( r ) dr  du ≤ Φ( t ) Z T 0 exp  3 I γ ( t )  du = T exp  4 I γ ( t )  √ 1 − t =: Ψ( t ) , where T = Γ( t ) = − log (1 − t ) . Under the VP schedule σ 2 t = t and α 2 t = 1 − t , T w eedie’s form ula gives m p,t ( y ) = y + t ∇ log p t ( y ) α t , m ˜ p,t ( y ) = y + t ∇ log ˜ p t ( y ) α t , hence ∥ m p,t ( y ) − m ˜ p,t ( y ) ∥ 2 2 = t 2 1 − t ∥∇ log p t ( y ) − ∇ log ˜ p t ( y ) ∥ 2 2 . T aking exp ectation under Y ∼ p t sho ws E MAG ( t ) = E Y ∼ p t ∥ m p,t ( Y ) − m ˜ p,t ( Y ) ∥ 2 2 = t 2 1 − t J T , so p J T = √ 1 − t t p E MAG ( t ) . Com bining the pull-bac k inequality with the W 2 –Fisher b ound yields W 2 ( P Y 0 , P ˜ Y 0 ) ≤  Φ( t ) ¯ C LSI ( t ) + Ψ( t )  p J T =  Φ( t ) ¯ C LSI ( t ) + Ψ( t )  √ 1 − t t p E MAG ( t ) . Then w e show that C PB ( t ) = O ( t − 1 ) . W e can rewrite C PB ( t ) = 1 t  exp  I γ ( t )  ¯ C LSI ( t ) + Γ( t ) exp  4 I γ ( t )   , (26) and I γ ( t ) = A 1 t γ / 2 (1 − t ) 1+ γ / 2 + A 2 1 − θ t t γ (1 − t ) 1+ γ , θ t = C ( γ ) N t γ (1 − t ) γ . Fix an y t ≤ 1 / 2 . Then (1 − t ) − a ≤ 2 a for ev ery a ≥ 0 , and θ t ≤ 2 γ C ( γ ) N t γ . In particular, for all suﬃciently small t w e ha ve θ t ≤ 1 / 2 , hence (1 − θ t ) − 1 ≤ 2 . Therefore, for all suﬃcien tly small t , I γ ( t ) ≤ c 1 t γ / 2 + c 2 t γ ≤ C t γ / 2 , 36 for constan ts c 1 , c 2 , C dep ending only on the problem parameters. Hence I γ ( t ) → 0 as t ↓ 0 , and in particular exp  I γ ( t )  = 1 + o (1) , exp  4 I γ ( t )  = 1 + o (1) . Since (1 − t ) M 2 + t ≤ M 2 + 1 , we ha ve ¯ C LSI ( t ) ≤ M 2 +1 min { C LSI ( π ) , 1 } =: C 0 , so ¯ C LSI ( t ) = O (1) as t ↓ 0 . Also, the T a ylor expansion giv es Γ( t ) = − log (1 − t ) = t + O ( t 2 ) . So in particular Γ( t ) ≤ 2 t for all suﬃcien tly small t . Plugging the bounds into (26), for all suﬃcien tly small t w e obtain C PB ( t ) ≤ 1 t  exp  I γ ( t )  C 0 + (2 t ) exp  4 I γ ( t )   = C 0 exp  I γ ( t )  t + 2 exp  4 I γ ( t )  = O ( t − 1 ) , b ecause exp  I γ ( t )  and exp  4 I γ ( t )  remain boun ded and tend to 1 as t ↓ 0 . Pr o of of The or em 2. Let X i := ( Y i t , Y i 0 ) , i = 1 , . . . , n , b e the i.i.d. sample with la w P , and write P n f := n − 1 P n i =1 f ( X i ) . Let h # ∈ arg min h ∈H R ( h ) b e a p opulation risk minimizer o v er H (a b est appro ximation to h ∗ ). Under (15) we hav e ρ 2 ( h ∗ , h # ) ≤ ε 2 / 4 and, writing ∆ K := sup h, x   ℓ K ( x ; h ) − ℓ ( x ; h )   , also ∆ K ≤ ε 2 / 8 . F or l = 0 , 1 , . . . deﬁne the shells A l := n h ∈ H : 2 l ε 2 ≤ ρ 2 ( h ∗ , h ) < 2 l +1 ε 2 o . Since ρ ( h ∗ , ˆ h λ ) ≥ ε implies ˆ h λ ∈ ∪ l ≥ 0 A l , P  ρ ( h ∗ , ˆ h λ ) ≥ ε  ≤ ∞ X l =0 P ∗  ˆ h λ ∈ A l  . Because ˆ h λ ∈ arg min h ∈H P n ℓ K ( · ; h ) , P n  ℓ K ( · ; h # ) − ℓ K ( · ; ˆ h λ )  ≥ 0 . Hence, on the even t { ˆ h λ ∈ A l } , sup h ∈ A l P n  ℓ K ( · ; h # ) − ℓ K ( · ; h )  ≥ 0 . F or any h ∈ A l , using the deﬁnition of ρ 2 and the uniform error b ound ∆ K , E  ℓ K ( · ; h # ) − ℓ K ( · ; h )  = E  ℓ ( · ; h # ) − ℓ ( · ; h )  + E  ℓ K − ℓ  ( · ; h # ) − E  ℓ K − ℓ  ( · ; h ) ≤ −  R ( h ) − R ( h # )  + 2∆ K = −  ρ 2 ( h ∗ , h ) − ρ 2 ( h ∗ , h # )  + 2∆ K ≤ −  2 l − 1 4  ε 2 + 2∆ K ≤ −  2 l − 1 2  ε 2 . 37 Therefore, P ∗  ˆ h λ ∈ A l  ≤ P ∗  sup h ∈ A l ν n  ℓ K ( · ; h # ) − ℓ K ( · ; h )  ≥ √ n (2 l − 1 2 ) ε 2  , where ν n ( f ) := √ n ( P n − P ) f . T o apply Lemma 13 to eac h set A l , set M l := √ n (2 l − 1 2 ) ε 2 , v 2 l := 8 c v 2 l +1 ε 2 . By Lemma 6 and the triangle inequalit y , sup h ∈ A l V ar  ℓ ( · ; h ) − ℓ ( · ; h ⋆ )  ≤ v 2 l . Since ℓ K − ℓ is uniformly b ounded b y ∆ K and ∆ K ≤ ε 2 / 8 , the same b ound (up to an absolute numerical factor absorb ed in to c v ) holds for ℓ K ( · ; h ) − ℓ K ( · ; h ⋆ ) . Moreo v er, the centered class satisﬁes Bernstein’s condition with constant c b b y Lemma 7 (again unaﬀected b y a uniformly b ounded p erturbation). With k ≥ c b / (4 c v ) , the mean–v ariance condition (36) in Lemma 13 holds for ( M l , v 2 l ) , and the entrop y condition (14) implies (37) uniformly o ver l ≥ 0 (the least fav orable case is l = 0 ). Thus Lemma 13 yields P ∗  sup h ∈ A l ν n  ℓ K ( · ; h ⋆ ) − ℓ K ( · ; h )  ≥ M l  ≤ 3 exp  − (1 − k ) M 2 l 2 [4 v 2 l + M l c b / (3 √ n )]  ≤ 3 exp − (1 − k ) (2 l − 1 2 ) 2 nε 2 (64 c v + 2 c b 3 ) 2 l +1 ! . Summing o ver l ≥ 0 giv es P  ρ ( h ∗ , ˆ h λ ) ≥ ε  ≤ 4 exp  − c e nε 2  , c e = 1 − k 8(64 c v + 2 c b 3 ) . This completes the pro of. Pr o of of The or em 3. The result follows b y combining Theorems 1 and 2 with the appro xi- mation and estimation b ounds in Theorems 4 and 5. Pr o of of Cor ol lary 2. Recall the deﬁnitions b := η + 1 2 η , κ := b 2( η +1) d ∗ + 2 b , r := 2( η + 1) d ∗ κ = η + 1 2 η + d ∗ . By the c hoice of W and L , ( W L ) − 2( η +1) d ∗ =   n/ log 5 n  κ  − 2( η +1) d ∗ =  n log 5 n  − r . Moreo v er, since κ ∈ (0 , 1) for ev ery d ∗ ≥ 1 and η > 0 , w e hav e W L ≤ n for all n ≥ 3 , hence log( W L ) ≤ log n . Therefore,  ( W L ) 2 log 5 ( W L ) n  b ≤  ( W L ) 2 log 5 n n  b =   n/ log 5 n  2 κ log 5 n n  b =  n log 5 n  − b (1 − 2 κ ) . 38 Finally , by the deﬁnition of κ , r = 2( η + 1) d ∗ κ = 2( η + 1) d ∗ · b 2( η +1) d ∗ + 2 b = b  1 − 2 κ  , so the t wo W L -dep endent terms deca y at the same rate  n log 5 n  − r , which pro v es the stated b ound. B.1 Appro ximation error Theorem 4 (Approximation error) . Let s t ( · ; h ) := ∇ log p h Y t ( · ) and s ⋆ t ( · ) := s t ( · ; h ∗ ) = ∇ log p h ∗ Y t ( · ) . Supp ose h ⋆ ∼ C η +1 ( U , B ) with a b ounded support U , giv en K and H := NN( W , L , B) with W = c W W log W and L = c L L log L , we can b ound the appro ximation error inf h ∈H E ∥ e s t,K ( Y t ; h, π , π ) − s ⋆ t ( Y t ) ∥ 2 2 ≲ α 2 t σ 4 t ( W L ) − 2( η +1) d ∗ + ε ( ˜ π , t, K ) . (27) Pr o of. Fix h ∈ H and consider the K -anc hor estimator e s t,K ( · ; h, π , π ) in (4). By ( a + b ) 2 ≤ 2 a 2 + 2 b 2 , E   e s t,K ( Y t ; h, π , π ) − s ⋆ t ( Y t )   2 2 ≤ 2 E   e s t,K ( Y t ; h, π , π ) − s t ( Y t ; h )   2 2 + 2 E   s t ( Y t ; h ) − s ⋆ t ( Y t )   2 2 . The ﬁrst term is the Monte Carlo approximation error and is b ounded b y the results in Section 4, yielding the ε ( ˜ π , t, K ) form. The second term is con trolled by (i) the appro xi- mation rate of h ∗ b y net w orks in H when h ∗ ∈ C η +1 ( U , B ) in Lemma 12 and (ii) the score p erturbation b ound in Lemma 5, whic h turns a uniform appro ximation error on h in to an L 2 error on the induced score. T aking the inﬁm um o v er h ∈ H yields the claimed b ound. Lemma 4 (F réchet deriv ative of the score w.r.t. h ) . Fix a noise lev el t ∈ (0 , 1) with α t > 0 and σ t > 0 . Let ( U , A , π ) b e a probabilit y space and let h : U → R D b e measurable with E π ∥ h ( U ) ∥ < ∞ . Set a ( u ) := α t h ( u ) and deﬁne the smo othed density p h Y t ( x ) = Z ϕ σ t  x − a ( u )  π ( d u ) , ϕ σ t ( z ) = (2 π σ 2 t ) − D/ 2 exp  −∥ z ∥ 2 / (2 σ 2 t )  . Deﬁne the induced score at lev el t b y s t ( x ; h ) := ∇ x log p h Y t ( x ) . Let r h ( u | x ) := ϕ σ t ( x − a ( u )) R ϕ σ t ( x − a ( v )) π ( dv ) , m h ( x ) := Z a ( u ) r h ( u | x ) π ( d u ) = E [ a ( U ) | Y t = x ] , where Y t = α t h ( U ) + σ t Z with U ∼ π and Z ∼ N (0 , I D ) . Then for an y direction δ h ∈ L ∞ ( π ; R D ) the F réc het deriv ativ e exists and D h s t ( x ; h )[ δ h ] = α t σ 2 t Z r h ( u | x ) h I D + ( a ( u ) − m h ( x ))( x − a ( u )) ⊤ σ 2 t i δ h ( u ) π ( d u ) . (28) 39 Pr o of. Diﬀeren tiate p h Y t ( x ) = R ϕ σ t ( x − a ( u )) π ( d u ) in x : ∇ x log p h Y t ( x ) = R ( a ( u ) − x ) ϕ σ t ( x − a ( u )) π ( d u ) σ 2 t R ϕ σ t ( x − a ( u )) π ( d u ) = m h ( x ) − x σ 2 t , so it suﬃces to diﬀeren tiate m h ( x ) with respect to h . Consider the n umerator and denomi- nator N ( x ) := Z a ( u ) ϕ σ t ( x − a ( u )) π ( d u ) , D ( x ) := Z ϕ σ t ( x − a ( u )) π ( d u ) = p h Y t ( x ) , so m h ( x ) = N ( x ) /D ( x ) . F or a perturbation δ h , write δa = α t δ h and apply the quotien t rule: δ m h ( x ) = δ N ( x ) D ( x ) − N ( x ) D ( x ) 2 δ D ( x ) . A direct diﬀeren tiation of the Gaussian factor yields δ D ( x ) = Z ϕ σ t ( x − a ( u )) ( x − a ( u )) ⊤ σ 2 t δ a ( u ) π ( d u ) , and similarly δ N ( x ) = Z h I D + a ( u )( x − a ( u )) ⊤ σ 2 t i ϕ σ t ( x − a ( u )) δ a ( u ) π ( d u ) . Com bining the last three displa ys, using r h ( u | x ) = ϕ σ t ( x − a ( u )) /D ( x ) and m h ( x ) = N ( x ) /D ( x ) , w e obtain δ m h ( x ) = Z r h ( u | x ) h I D + ( a ( u ) − m h ( x ))( x − a ( u )) ⊤ σ 2 t i δ a ( u ) π ( d u ) . Since s t ( x ; h ) = ( m h ( x ) − x ) /σ 2 t and δ a = α t δ h , this giv es (28). Dominated con v ergence (justiﬁed b y bounded δ h and Gaussian env elop es) allo ws in terc hanging diﬀeren tiation and in tegration. Lemma 5 (Score perturbation b ound under ∥ h − h ∗ ∥ ∞ ) . Let h, h ∗ : U → R D and ﬁx t with α t > 0 and σ t > 0 . Assume ∥ h − h ∗ ∥ ∞ ≤ ε (i.e., sup u ∥ h ( u ) − h ∗ ( u ) ∥ ≤ ε ). Then   s t ( · ; h ) − s t ( · ; h ∗ )   L 2 ( p h ∗ Y t ) ≤ C D α t σ 2 t ε, C D := p 2 + 2 D ( D + 2) . (29) Pr o of. Deﬁne the interpolation h υ := h ∗ + υ ( h − h ∗ ) , 0 ≤ υ ≤ 1 , and write p υ := p h υ Y t . By the fundamen tal theorem of calculus and Lemma 4, s t ( · ; h ) − s t ( · ; h ∗ ) = Z 1 0 D h s t ( · ; h υ )[ h − h ∗ ] dυ . By Mink owski and Jensen, for an y υ ∈ [0 , 1] , ∥ s t ( · ; h ) − s t ( · ; h ∗ ) ∥ L 2 ( p υ ) ≤ Z 1 0 ∥ D h s t ( · ; h µ )[ δ h ] ∥ L 2 ( p υ ) dµ ≤ sup µ ∈ [0 , 1] ∥ D h s t ( · ; h µ )[ δ h ] ∥ L 2 ( p υ ) , 40 where δ h := h − h ∗ and ∥ δ h ∥ ∞ ≤ ε . Using (28) point wise in x , Jensen (for the posterior a v erage), and ∥ δ h ∥ ∞ ≤ ε ,   D h s t ( x ; h µ )[ δ h ]   ≤ α t σ 2 t ε    I D + ( a µ ( u ) − m µ ( x ))( x − a µ ( u )) ⊤ σ 2 t    op , r µ ( ·| x ) , where a µ ( u ) := α t h µ ( u ) , r µ ( · | x ) := r h µ ( · | x ) , and m µ ( x ) := m h µ ( x ) . Here ∥ · ∥ op , r denotes the r ( · | x ) -av erage of the squared op erator norm under a square ro ot. Bounding ( α + β ) 2 ≤ 2( α 2 + β 2 ) and using that the op erator norm of a rank-one matrix is the pro duct of v ector norms, ∥ M µ ( · , x ) ∥ 2 op ,r ≤ 2  1 + 1 σ 4 t E  ∥ a µ ( U ) − m µ ( X ) ∥ 2 ∥ X − a µ ( U ) ∥ 2   X = x   . Let A µ ( x ) := E [ ∥ a µ − m µ ∥ 2 | X = x ] = tr Cov( a µ | x ) and B µ ( x ) := E [ ∥ X − a µ ∥ 2 | X = x ] . Note that A µ ( x ) ≤ B µ ( x ) b ecause B µ ( x ) = ∥ x − m µ ( x ) ∥ 2 + A µ ( x ) . Hence E p υ ∥ M µ ( · , x ) ∥ 2 op ,r ≤ 2  1 + 1 σ 4 t E p υ  B µ ( X ) 2   . By Jensen, B µ ( X ) 2 = ( E [ ∥ X − a µ ∥ 2 | X ]) 2 ≤ E [ ∥ X − a µ ∥ 4 | X ] , th us E p υ  B µ ( X ) 2  ≤ E p υ ∥ X − a µ ( U ) ∥ 4 . But conditionally on U , X − a µ ( U ) = σ t Z with Z ∼ N (0 , I D ) , so E ∥ X − a µ ( U ) ∥ 4 = σ 4 t E ∥ Z ∥ 4 = σ 4 t D ( D + 2) . Com bining the displa ys and taking square roots, ∥ D h s t ( · ; h µ )[ δ h ] ∥ L 2 ( p υ ) ≤ α t σ 2 t ε p 2 + 2 D ( D + 2) . Since this bound is uniform in µ and υ , taking υ = 0 giv es (29). B.2 Estimation error Lemma 6 (V ariance–mean) . Recall that the score mo del satisﬁes s t ( y t ; h ) = α t m h ( y t ) − y t σ 2 t , m h ( y t ) := E [ h ( U ) | Y t = y t ] . Assume ∥ h ∗ ∥ ∞ ≤ B and sup h ∈H ∥ h ∥ ∞ ≤ B , and deﬁne the excess risk ρ 2 ( h ∗ , h ) = E  ℓ t ( · ; h ) − ℓ t ( · ; h ∗ )  . Then for all suﬃciently small ε > 0 , sup { ρ ( h ∗ ,h ) ≤ ε : h ∈H} V ar  ℓ t ( · ; h ) − ℓ t ( · ; h ∗ )  ≤ c v ε 2 , with c v = 40 α 2 t B 2 σ 4 t . 41 Pr o of. By the Gaussian conditional score, ∇ y t log p ( y t | y 0 ) = ( α t y 0 − y t ) /σ 2 t , hence s t ( Y t ; h ) − ∇ Y t log p ( Y t | Y 0 ) = α t σ 2 t  m h ( Y t ) − Y 0  , and therefore ℓ t ( Y t , Y 0 ; h ) = α 2 t σ 4 t ∥ m h ( Y t ) − Y 0 ∥ 2 2 . Let m h := m h ( Y t ) and m 0 := m h ∗ ( Y t ) and set ∆ := m h − m 0 . Then ℓ t ( · ; h ) − ℓ t ( · ; h ∗ ) = α 2 t σ 4 t  ∥ ∆ ∥ 2 2 + 2 ⟨ ∆ , m 0 − Y 0 ⟩  . Since m 0 ( Y t ) = E [ Y 0 | Y t ] , w e hav e E [ m 0 − Y 0 | Y t ] = 0 , hence ρ 2 ( h ∗ , h ) = E [ ℓ t ( · ; h ) − ℓ t ( · ; h ∗ )] = α 2 t σ 4 t E ∥ ∆ ∥ 2 2 . By the sup norm of the transp ort class, ∥ m h ∥ ≤ B , ∥ m 0 ∥ ≤ B , and ∥ Y 0 ∥ ≤ B almost surely , so ∥ ∆ ∥ ≤ 2 B and ∥ m 0 − Y 0 ∥ ≤ 2 B . Using ( a + b ) 2 ≤ 2 a 2 + 2 b 2 and Cauc hy–Sc hw arz, E h  ℓ t ( · ; h ) − ℓ t ( · ; h ∗ )  2 i ≤ α 4 t σ 8 t E h 2 ∥ ∆ ∥ 4 2 + 8 ∥ ∆ ∥ 2 2 ∥ m 0 − Y 0 ∥ 2 2 i ≤ α 4 t σ 8 t  2(2 B ) 2 + 8(2 B ) 2  E ∥ ∆ ∥ 2 2 = 40 α 4 t B 2 σ 8 t E ∥ ∆ ∥ 2 2 . Since V ar( Z ) ≤ E [ Z 2 ] , com bining with E ∥ ∆ ∥ 2 2 = ( σ 4 t /α 2 t ) ρ 2 ( h ∗ , h ) giv es V ar  ℓ t ( · ; h ) − ℓ t ( · ; h ∗ )  ≤ 40 α 2 t B 2 σ 4 t ρ 2 ( h ∗ , h ) . T aking the suprem um o v er ρ ( h ∗ , h ) ≤ ε yields the claim. Lemma 7 (Bernstein’s condition) . With c b = 16 α 2 t B 2 σ 4 t , the cen tered excess-loss class F ε := n f h ( X ) := ∆ ℓ h ( X ) − E [∆ ℓ h ( X )] : ρ ( h ∗ , h ) ≤ ε, h ∈ H o , ∆ ℓ h := ℓ t ( · ; h ) − ℓ t ( · ; h ∗ ) , satisﬁes Bernstein’s condition in the follo wing momen t form: there exists v 2 = v 2 ( ε ) suc h that sup f ∈F ε V ar( f ( X )) ≤ v 2 and, for all integers k ≥ 2 , sup f ∈F ε E | f ( X ) | k ≤ 1 2 k ! v 2 c k − 2 b . (30) Moreo v er, using Lemma 6 (v ariance–mean), one ma y tak e v 2 ( ε ) = c v ε 2 , where c v = 40 α 2 t B 2 σ 4 t . 42 Pr o of. Fix t ∈ (0 , 1) and assume the forw ard p erturbation mo del Y 0 = h ∗ ( U ) , Y t = α t Y 0 + σ t Z , Z ∼ N (0 , I D ) , with Z indep enden t of U . Assume the score mo del admits the posterior-mean represen tation s t ( y t ; h ) = α t m h ( y t ) − y t σ 2 t , m h ( y t ) := E [ h ( U ) | Y t = y t ] , so that the ideal loss reduces to ℓ t ( Y t , Y 0 ; h ) = α 2 t σ 4 t ∥ m h ( Y t ) − Y 0 ∥ 2 2 . W e ha v e ∥ h ( U ) ∥ 2 ≤ B a.s. and therefore ∥ m h ( Y t ) ∥ 2 = ∥ E [ h ( U ) | Y t ] ∥ 2 ≤ B a.s. Also ∥ Y 0 ∥ 2 = ∥ h ∗ ( U ) ∥ 2 ≤ B a.s. Using the reduced loss form, 0 ≤ ℓ t ( · ; h ) = α 2 t σ 4 t ∥ m h ( Y t ) − Y 0 ∥ 2 2 ≤ α 2 t σ 4 t (2 B ) 2 . The same bound holds for ℓ t ( · ; h ∗ ) . Thus | ∆ ℓ h | ≤ 8 α 2 t B 2 σ 4 t ⇒ | f h | = | ∆ ℓ h − E ∆ ℓ h | ≤ 16 α 2 t B 2 σ 4 t = c b a.s. F or any centered random v ariable f with | f | ≤ c b almost surely and any integer k ≥ 2 , | f | k ≤ c k − 2 b f 2 ⇒ E | f | k ≤ c k − 2 b E [ f 2 ] = c k − 2 b V ar( f ) . No w let v 2 := sup f ∈F ε V ar( f ) . Then for all k ≥ 2 , sup f ∈F ε E | f | k ≤ v 2 c k − 2 b . Since 1 2 k ! ≥ 1 for all k ≥ 2 , this implies (30). Finally , Lemma 6 giv es on the lo calized class ρ ( h ∗ , h ) ≤ ε that V ar(∆ ℓ h ) ≤ c v ρ 2 ( h ∗ , h ) ≤ c v ε 2 , hence V ar( f h ) = V ar(∆ ℓ h ) ≤ c v ε 2 . Therefore one can tak e v 2 ( ε ) = c v ε 2 . Theorem 5 (Estimation Error) . Supp ose H = NN( W , L ) , then there exists c N N > 0 , such that ε ≥ min δ ≥ 1 ,ζ ≥ 1 max ( c N N  W 2 L 2 log 5 ( W L ) c 2 h n  2 2 − δ +  C 2 t,ζ c 2 h n  2 2 − δ (1 − D 2 ζ ) ) (31) satisﬁes the in tegral en tropy equation (14). Pr o of. Consider solving the entrop y equation. Z 4 c 1 / 2 v ε kε 2 / 16 H 1 / 2 B ( u, L ) du ≤ c h n 1 / 2 ε 2 . 43 Note that we ha v e the L ⊂ C ζ ([ − R t , R t ] D , α t σ ζ +4 t ) with R t ≍ σ t log n . The entrop y for the smo oth class is b ounded b y H 1 / 2 B ( u, L ) ≲ 1 σ ζ +4 t u − D/ (2 ζ ) . Then we only need to solv e the follo wing suﬃcien t condition for ε , for a ﬁxed 1 ≤ δ < 2 , c h n 1 / 2 ε 2 ≥ Z ε δ kε 2 / 16 H 1 / 2 B ( u, L )d u + Z ∞ ε δ H 1 / 2 B ( u, C ζ ([ − R t , R t ] D , 1 σ ζ +4 t ))d u Then w e shows an upp er b ounds for the righ t side. F or the ﬁrst term, w e can sho w Z ε δ kε 2 / 16 H 1 / 2 B ( u, L )d u ≤ ε δ H 1 / 2 B ( k ε 2 / 16 , L ) F or the second term, let C t,ζ := 1 σ ζ +4 t and assume D/ (2 ζ ) > 1 , Z ∞ ε δ H 1 / 2 B ( u, C ζ ([ − R t , R t ] D , 1 σ η +4 t ))d u ≤ C t,ζ ε δ (1 − D 2 ζ ) Com bine the t wo b ounds, w e hav e the en tropy inequality , c h n 1 / 2 ε 2 ≥ ε δ H 1 / 2 B ( k ε 2 / 16 , L ) + C t,ζ ε δ (1 − D 2 ζ ) . (32) Then use the Lipsc hitz transfer lemma (Lemma 8) and plug in the entrop y bound for the NN class in Lemma 9, w e get the bound that ε ≥ min 1 ≤ δ < 2 ζ ≥ 1 ( c N N  W 2 L 2 log 5 ( W L ) c 2 h n  1 2(2 − δ ) +  C 2 t,ζ c 2 h n  1 2(2 − δ (1 − D 2 ζ )) ) . (33) Let ζ = D ( η +2) d ∗ and δ = η +2 η +1 . Note that D 2 ζ = d ∗ 2( η +2) , so the assumption D 2 ζ > 1 holds whenever d ∗ > 2( η + 2) . W e can get the target that ε ≳  W 2 L 2 log 5 ( W L ) c 2 h n  η +1 2 η + n − η +1 2 η + d ∗ c h σ 4+ D d ∗ ( η +2) t . (34) Lemma 8 (Lipschitz transfer w.r.t. cen ters) . Fix ∥ x ∥ ≤ R x and supp ose ∥ y i ∥ , ∥ y ′ i ∥ ≤ R y for all i . Deﬁne g y ( x ) = x σ 2 − 1 σ 2 K X i =1 w i ( x ; y ) y i , w ( x ; y ) = softmax  s ( x ; y )  , s i ( x ; y ) := − ∥ x − y i ∥ 2 2 σ 2 . Then ∥ g y ( x ) − g y ′ ( x ) ∥ ≤ C 0 σ 2 ∥ y − y ′ ∥ ∞ ,K , C 0 := 1 + R y ( R x + R y ) σ 2 . Th us, with B := R x + R y σ 2 + R x , | f y ( x ) − f y ′ ( x ) | ≤ 2 B C 0 σ 2 ∥ y − y ′ ∥ ∞ ,K . 44 Pr o of. W rite g y ( x ) − g y ′ ( x ) = − 1 σ 2  P K i =1 w i ( x ; y ) y i − P K i =1 w i ( x ; y ′ ) y ′ i  . A dd and sub- tract P i w i ( x ; y ′ ) y i to obtain K X i =1 w i ( x ; y ) y i − K X i =1 w i ( x ; y ′ ) y ′ i = K X i =1 w ′ i ( y i − y ′ i ) + K X i =1 ( w i − w ′ i ) y i , where w i := w i ( x ; y ) and w ′ i := w i ( x ; y ′ ) . Therefore, ∥ g y ( x ) − g y ′ ( x ) ∥ ≤ 1 σ 2      K X i =1 w ′ i ( y i − y ′ i )      + 1 σ 2      K X i =1 ( w i − w ′ i ) y i      ≤ 1 σ 2 K X i =1 w ′ i ∥ y i − y ′ i ∥ + 1 σ 2 K X i =1 | w i − w ′ i | ∥ y i ∥ ≤ 1 σ 2 ∥ y − y ′ ∥ ∞ ,K + R y σ 2 ∥ w − w ′ ∥ 1 . Then w e b ound ∥ w − w ′ ∥ 1 b y ∥ s − s ′ ∥ ∞ . F or softmax w i = exp( s i ) / P j exp( s j ) , the Jacobian satisﬁes ∂ w i ∂ s j = w i ( 1 { i = j } − w j ) . F or any direction a ∈ R K , the directional deriv ative is ( J a ) i = w i  a i − P K j =1 w j a j  . Hence ∥ J a ∥ 1 = K X i =1 w i      a i − K X j =1 w j a j      ≤ K X i =1 w i  | a i | +    K X j =1 w j a j     ≤ K X i =1 w i  ∥ a ∥ ∞ + ∥ a ∥ ∞  = 2 ∥ a ∥ ∞ . By the mean v alue theorem applied to the smo oth map s 7→ softmax( s ) along the segment s τ = s ′ + τ ( s − s ′ ) , τ ∈ [0 , 1] , w e get ∥ w − w ′ ∥ 1 ≤ sup τ ∈ [0 , 1] ∥ J ( s τ )( s − s ′ ) ∥ 1 ≤ 2 ∥ s − s ′ ∥ ∞ . Next, w e b ound ∥ s − s ′ ∥ ∞ in terms of ∥ y − y ′ ∥ ∞ ,K . F or eac h i , | s i ( x ; y ) − s i ( x ; y ′ ) | = 1 2 σ 2    ∥ x − y i ∥ 2 − ∥ x − y ′ i ∥ 2    ≤ 1 2 σ 2 ∥ y i − y ′ i ∥  ∥ x − y i ∥ + ∥ x − y ′ i ∥  ≤ 1 2 σ 2 ∥ y i − y ′ i ∥ ( R x + R y + R x + R y ) = R x + R y σ 2 ∥ y i − y ′ i ∥ . T aking the maxim um o v er i giv es ∥ s − s ′ ∥ ∞ ≤ R x + R y σ 2 ∥ y − y ′ ∥ ∞ ,K . 45 Com bining, ∥ w − w ′ ∥ 1 ≤ 2 ∥ s − s ′ ∥ ∞ ≤ 2 R x + R y σ 2 ∥ y − y ′ ∥ ∞ ,K . Plugging in to the earlier estimate for ∥ g y ( x ) − g y ′ ( x ) ∥ yields ∥ g y ( x ) − g y ′ ( x ) ∥ ≤ 1 σ 2  1 + 2 R y ( R x + R y ) σ 2  ∥ y − y ′ ∥ ∞ ,K . Finally , using   ∥ a ∥ 2 − ∥ b ∥ 2   ≤ ( ∥ a ∥ + ∥ b ∥ ) ∥ a − b ∥ with a = g y ( x ) − x and b = g y ′ ( x ) − x , w e obtain | f y ( x ) − f y ′ ( x ) | ≤ ( ∥ g y ( x ) − x ∥ + ∥ g y ′ ( x ) − x ∥ ) ∥ g y ( x ) − g y ′ ( x ) ∥ . Under ∥ x ∥ ≤ R x and ∥ y i ∥ ≤ R y , w e hav e ∥ g y ( x ) ∥ ≤ ∥ x ∥ σ 2 + 1 σ 2 X i w i ∥ y i ∥ ≤ R x + R y σ 2 , so ∥ g y ( x ) − x ∥ ≤ R x + R y σ 2 + R x =: B , and similarly for y ′ . Therefore, | f y ( x ) − f y ′ ( x ) | ≤ 2 B ∥ g y ( x ) − g y ′ ( x ) ∥ ≤ 2 B C 0 σ 2 ∥ y − y ′ ∥ ∞ ,K . Lemma 9 (Empirical L ∞ co v ering of H ) . Let H = H ( W , L ) b e ReLU netw orks of depth L and width W , with output dimension d y and range b ound ∥ h ( u ) ∥ ≤ R h . Then for any ﬁnite set { U i } K i =1 and an y η ∈ (0 , 2 R h ] , log N  η , H , ∥ · ∥ ∞ ,K  ≤ C 1 d y Pdim( H ) log  C 2 R h η  , where Pdim( H ) is the pseudo-dimension of the (scalar-output) net work class and C 1 , C 2 are univ ersal constan ts. F or ReLU nets, Pdim( H ) ≤ C 3 W L log ( eW ) , hence log N  η , H , ∥ · ∥ ∞ ,K  ≤ C d y W L log ( eW ) log  C ′ R h η  . Pr o of. Apply Thm. 12.5 of Anthon y and Bartlett (2009) to eac h co ordinate class { u 7→ h ℓ ( u ) } , use the range bound to normalize, and union b ound o ver d co ordinates to pass from scalar to v ector outputs under ℓ ∞ on the sample. The pseudo-dimension upp er b ound for piecewise-linear nets is from Bartlett et al. (2019). 46 C Auxiliary lemmas Lemma 10 (LSI condition) . Assume the latent prior π satisﬁes a log–Sob olev inequalit y with constan t C LSI ( π ) > 0 (e.g., π = N (0 , I d ) giv es C LSI ( π ) = 1 ). Fix any h ∈ H and deﬁne ˜ Y 0 = h ( U ) with U ∼ π and ˜ Y t = α t ˜ Y 0 + σ t Z with Z ∼ N (0 , I D ) indep endent. Under Assumption 1, the map h is M –Lipschitz, and q t satisﬁes a log–Sobolev inequality with C LSI ( q t ) ≥ min { C LSI ( π ) , 1 } α 2 t M 2 + σ 2 t . (35) Pr o of. Let µ := π ⊗ N (0 , I D ) b e the joint la w of ( U, Z ) ∈ R d × R D . By tensorization of log–Sob olev inequalities, µ satisﬁes an LSI with constan t C LSI ( µ ) = min { C LSI ( π ) , 1 } . Deﬁne the (deterministic) map F ( u, z ) := α t h ( u ) + σ t z so that F # µ is the law of ˜ Y t . F or an y smo oth φ : R D → R , set ψ ( u, z ) := φ ( F ( u, z )) . By the c hain rule, ∥∇ ( u,z ) ψ ( u, z ) ∥ 2 2 ≤  α 2 t ∥ J h ( u ) ∥ 2 op + σ 2 t  ∥∇ φ ( F ( u, z )) ∥ 2 2 ≤  α 2 t M 2 + σ 2 t  ∥∇ φ ( F ( u, z )) ∥ 2 2 , where the last inequalit y uses Assumption 1. Applying the LSI for µ to ψ and rewriting the result under the pushforward F # µ yields En t q t ( φ 2 ) ≤ 2( α 2 t M 2 + σ 2 t ) C LSI ( µ ) Z R D ∥∇ φ ( y ) ∥ 2 2 q t ( y ) d y . This pro ves (35). Lemma 11 (LSI ⇒ W 2 –Fisher c hain, Theorem 22.17 of Villani et al. (2009)) . If p satisﬁes LSI with constant ρ > 0 (Deﬁnition 3), then p also satisﬁes T alagrand’s T 2 inequalit y and, for an y q ≪ p , W 2 2 ( q , p ) ≤ 2 ρ KL  q ∥ p  ≤ 1 ρ 2 J ( q ∥ p ) , where J ( q ∥ p ) := R ∥∇ log q − ∇ log p ∥ 2 2 q is the relativ e Fisher information. The following is a ReLU approximation result for a Hölder class of smo oth functions, whic h is a simpliﬁed v ersion of Theorem 1.1 in Lu et al. (2021) and Lemma 11 in Huang, Jiao, Li, Liu, W ang and Y ang (2022). Lemma 12 (Lemma 11 in Huang, Jiao, Li, Liu, W ang and Y ang (2022)) . F or an y f ∈ C r ([0 , 1] d , R , B ) , there exists a ReLU net work Φ with W = c W ( W log W ) , L = c L ( L log L ) and E = ( W L ) c E with some p ositiv e constants c W , c L and c E dep enden t on d and r , suc h that sup x ∈ [0 , 1] d | Φ( x ) − f ( x ) | = O ( B ( W L ) − 2 r d ) . Lemma 13. Assume that f ( Y ) ∈ F satisﬁes the Bernstein condition with some constan t c b for an i.i.d. sample Y 1 , · · · , Y n . Let ϕ ( M , v 2 , F ) = M 2 2[4 v 2 + M c b / 3 n 1 / 2 ] , where V ar( f ( Y )) ≤ v 2 . Assume that M ≤ k n 1 / 2 v 2 / 4 c b , (36) 47 with 0 < k < 1 and Z v kM / (8 n 1 / 2 ) H 1 / 2 B ( u, F ) du ≤ M k 3 / 2 / 2 10 , (37) then P ∗ ( sup { f ∈F } n − 1 / 2 n X i =1 ( f ( Y i ) − E f ( Y i )) ≥ M ) ≤ 3 exp( − (1 − k ) ϕ ( M , v 2 , n )) , where P ∗ denotes the outer probability . Sp eciﬁcally , for an y ev en t A , P ∗ ( A ) := inf { P ( B ) : A ⊆ B , B measurable } . Pr o of of L emma 13. The result follows from the same arguments as in the pro of of Theorem 3 in Shen and W ong (1994) with V ar( f ( X )) ≤ v 2 . Note that Bernstein’s condition replaces the upp er b oundedness condition there, and the second condition of (4.6) there is not needed here. W e now present the tec hnical lemmas that will serv e as the foundation for the pro of of Theorem 1. F or s ∈ [0 , t ] , let p s , q s b e C 2 densities on R D with ﬁnite second moments solving the con tin uity equations ∂ s p s + ∇ · ( p s v p s ) = 0 , ∂ s q s + ∇ · ( q s v q s ) = 0 , (38) where the (v ariance-preserving) probabilit y–ﬂo w ﬁelds are v p s ( x ) = − 1 2 β ( s ) x − β ( s ) ∇ log p s ( x ) , v q s ( x ) = − 1 2 β ( s ) x − β ( s ) ∇ log q s ( x ) , (39) with a measurable schedule β ( s ) ≥ 0 . Denote the scores and their diﬀerence by s p := ∇ log p s , s q := ∇ log q s , ∆ := s p − s q , and deﬁne J s := R R D ∥ ∆( y ) ∥ 2 q s ( y ) d y . Assume there is a measurable L ⋆ ( s ) ≥ 0 such that for all x and all s ∈ [0 , t ] , ∥∇ 2 log p s ( x ) ∥ op ≤ L ⋆ ( s ) , ∥∇ 2 log q s ( x ) ∥ op ≤ L ⋆ ( s ) , (40) and that all in tegrals b elo w are justiﬁed (suﬃcient decay/in tegrability; b oundary terms v anish). Assumption 4. Ther e exist me asur able functions L ( · ) , K ( · ) : [0 , t ] → [0 , ∞ ) such that, for al l s ∈ [0 , t ] : (A1) Flow Lipschitz: ∥∇ v p s ( x ) ∥ op ≤ L ( s ) , ∥∇ v q s ( x ) ∥ op ≤ L ( s ) ∀ x. (41) (A2) Sc or e–gap gr owth: d ds J s ≤ 2 K ( s ) J s . (42) 48 Lemma 14 (V ariable-co eﬃcient pull–bac k) . Let P := p 0 and Q := q 0 . Under Assumption 4, for ev ery t > 0 , W 2 ( p 0 , q 0 ) ≤ e R t 0 L W 2 ( p t , q t ) + e R t 0 L Z t 0 β ( u ) exp  Z t u ( K ( r ) − L ( r )) dr  du p J t . (43) Pr o of. Let π t b e an optimal coupling of p t , q t ; dra w ( X t , Y t ) ∼ π t and ev olve b ackwar d X s := Φ p s ← t ( X t ) , Y s := Φ q s ← t ( Y t ) , s ∈ [0 , t ] . Then X s ∼ p s , Y s ∼ q s . Set ∆ tra j s := X s − Y s and R s :=  E ∥ ∆ tra j s ∥ 2  1 / 2 ; then W 2 ( p s , q s ) ≤ R s , W 2 ( p 0 , q 0 ) ≤ R 0 , and W 2 ( p t , q t ) ≤ R t . Diﬀeren tiate 1 2 ∥ ∆ tra j s ∥ 2 and use the ﬂow ODEs: d ds 1 2 ∥ ∆ tra j s ∥ 2 =  ∆ tra j s , v p s ( X s ) − v p s ( Y s )  +  ∆ tra j s , v p s ( Y s ) − v q s ( Y s )  . By (41), ∥ v p s ( X s ) − v p s ( Y s ) ∥ ≤ L ( s ) ∥ ∆ tra j s ∥ . Moreo ver, v p s − v q s = − β ( s ) ∆ s (where ∆ s := ∇ log p s − ∇ log q s ) p oin t wise, so  ∆ tra j s , v p s ( Y s ) − v q s ( Y s )  ≤ β ( s ) ∥ ∆ tra j s ∥ ∥ ∆ s ( Y s ) ∥ . T aking exp ectations and applying Cauc hy–Sc hw arz yields d ds R s ≤ L ( s ) R s + β ( s ) p J s , 0 ≤ s ≤ t. (44) By Grön wall from s to t (in tegrating the backw ard ﬂo w stabilit y), R s ≤ e R t s L R t + Z t s β ( u ) e R u s L p J u du. (45) By (42) and Grön wall, w e assume the gro wth condition implies that for 0 ≤ u ≤ t , p J u ≤ e R t u K p J t . (46) Insert (46) in to (45) with s = 0 , and use W 2 ( p 0 , q 0 ) ≤ R 0 , R t ≥ W 2 ( p t , q t ) : W 2 ( p 0 , q 0 ) ≤ e R t 0 L W 2 ( p t , q t ) + e R t 0 L Z t 0 β ( u ) e − R t u L e R t u K du p J t , whic h is (43). Lemma 15 (Hessian b ound with h ∈ C 1+ η ) . Assume the latent prior has bounded Hessian: sup u ∥∇ 2 u log π U ( u ) ∥ ≤ Λ 2 . Consider the VP corruption at lev el s , Y s = α s X + σ s Z , α s ∈ (0 , 1] , σ s > 0 , with X = h ( U ) and σ s ≤ cρ (inside the tub e). Then, with H s ( y ) := 49 ∇ 2 log p s ( y ) , there exist constan ts C ( γ ) T , C ( γ ) S , C ( γ ) N dep ending only on ( m, M , H γ , Λ 2 , ρ ) suc h that for all y , ∥ Π T H s ( y )Π T ∥ op ≤ C ( γ ) T σ γ − 2 s α γ s , (47) ∥ Π T H s ( y )Π N ∥ op ≤ C ( γ ) S σ γ − 2 s α γ s , (48) Π N H s ( y )Π N ⪯ −  σ − 2 s − C ( γ ) N σ 2 γ − 2 s α 2 γ s  Π N . (49) Consequen tly , L ⋆ ( s ) := sup y λ max  H s ( y )  ≤ C γ σ γ − 2 s α γ s , C γ := C ( γ ) T +  C ( γ ) S  2 1 − C ( γ ) N σ 2 γ s α 2 γ s . Pr o of. F or Y s = α s X + σ s Z with Z ∼ N (0 , I D ) indep enden t of X , the score and Hessian satisfy ∇ log p s ( y ) = α s σ 2 s  E [ X | Y s = y ] − y α s  , H s ( y ) = ∇ 2 log p s ( y ) = α 2 s σ 4 s Co v( X | Y s = y ) − 1 σ 2 s I D . (A) Fix y and let x 0 := Π M ( y /α s ) b e the unique nearest-p oin t pro jection on to M (w ell- deﬁned since σ s ≤ cρ ). Let Π T , Π N denote orthogonal pro jections on to the tangent/normal spaces at x 0 . Because x 0 is the nearest-point pro jection, the residual r := y /α s − x 0 is normal: Π T r = 0 and r = Π N r . Cho ose u 0 suc h that h ( u 0 ) = x 0 , and write a lo cal C 1 ,γ parametrization of M : for ξ in a small ball in R d , x ( ξ ) = x 0 + J ξ + R ( ξ ) , J := J h ( u 0 ) , ∥ R ( ξ ) ∥ ≤ C ∥ ξ ∥ 1+ γ , (B) where C depends only on ( m, H γ ) . The conditional la w of ξ given Y s = y has (unnormalized) densit y proportional to exp  − 1 2 σ 2 s ∥ r − ( J ξ + R ( ξ )) ∥ 2  π U ( u 0 + ξ ) . Using (B) and Π T r = 0 , one obtains standard Laplace/Gaussian comparison b ounds imply- ing: there exists c 0 > 0 (dep ending only on ( m, M , H γ ) ) such that the p osterior concen trates on {∥ ξ ∥ ≲ ε } , and the momen ts ob ey E [ ξ | y ] = O  ε 1+ γ  , (50) Co v( ξ | y ) = σ 2 s α 2 s ( J ⊤ J ) − 1 + O  σ 2+ γ s α 2+ γ s  , (51) E [ ∥ ξ ∥ 2+ γ | y ] = O ( ε 2+ γ ) , E [ ∥ ξ ∥ 2+2 γ | y ] = O ( ε 2+2 γ ) . (52) 50 W rite X − x 0 = J ξ + R ( ξ ) and pro ject: δ x T := Π T ( X − x 0 ) = J ξ + O ( ∥ ξ ∥ 1+ γ ) , δ x N := Π N ( X − x 0 ) = O ( ∥ ξ ∥ 1+ γ ) . Using (51)–(52) and ∥ J ∥ op ≤ M , ∥ ( J ⊤ J ) − 1 ∥ op ≤ m − 2 , we obtain the blo c k co v ariance b ounds Co v T := Π T Co v( X | y )Π T = J Co v ( ξ | y ) J ⊤ + O  E ∥ ξ ∥ 2+ γ | y  = σ 2 s α 2 s Π T + O  σ 2+ γ s α 2+ γ s  , (53) Co v T N := Π T Co v( X | y )Π N = O  E ∥ J ξ ∥ ∥ ξ ∥ 1+ γ | y  = O  σ 2+ γ s α 2+ γ s  , (54) Co v N := Π N Co v( X | y )Π N = O  E ∥ ξ ∥ 2+2 γ | y  = O  σ 2+2 γ s α 2+2 γ s  . (55) Using (A) and (53)–(55), Π T H s Π T = α 2 s σ 4 s Co v T − 1 σ 2 s Π T = O  α 2 s σ 4 s · σ 2+ γ s α 2+ γ s  = O  σ γ − 2 s α γ s  , whic h giv es (47). Similarly , Π T H s Π N = α 2 s σ 4 s Co v T N = O  σ γ − 2 s α γ s  , giving (48). Finally , Π N H s Π N = α 2 s σ 4 s Co v N − 1 σ 2 s Π N ⪯ −  σ − 2 s − C σ 2 γ − 2 s α 2 γ s  Π N , whic h is (49) after renaming constants. In the ( T , N ) blo c k form, write H s =  A B B ⊤ C  , A = Π T H s Π T , B = Π T H s Π N , C = Π N H s Π N . By (49), − C ⪰ µ I with µ := σ − 2 s − C ( γ ) N σ 2 γ − 2 s α 2 γ s = σ − 2 s  1 − θ s  , θ s = C ( γ ) N σ 2 γ s α 2 γ s . When θ s < 1 , the Sc h ur complemen t b ound implies λ max ( H s ) ≤ ∥ A ∥ op + ∥ B ∥ 2 op µ . Com bine with (47)–(48) to get λ max ( H s ) ≤ C ( γ ) T σ γ − 2 s α γ s +  C ( γ ) S  2 1 − θ s σ γ − 2 s α γ s = C γ σ γ − 2 s α γ s , whic h yields the stated env elop e for L ⋆ ( s ) . 51 Lemma 16 (Score–gap growth along the q –ﬂow) . Let p s , q s solv e the con tinuit y equations (38) with VP probability–ﬂo w ﬁelds (39), and deﬁne s p = ∇ log p s , s q = ∇ log q s , ∆ := s p − s q , J s := Z R D ∥ ∆ ∥ 2 2 q s . Assume the Hessian env elop e (40) holds: ∥∇ 2 log p s ( x ) ∥ op ≤ L ⋆ ( s ) , ∥∇ 2 log q s ( x ) ∥ op ≤ L ⋆ ( s ) ( ∀ x ) . Then, for all 0 ≤ u ≤ t , J u ≤ J t exp  2 Z t u β ( r )  1 2 + 4 L ⋆ ( r )  dr  . (56) Equiv alently , dJ s /ds ≤ 2 K ( s ) J s with K ( s ) = β ( s )  1 2 + 4 L ⋆ ( s )  . Pr o of. Note that, if ρ s solv es ∂ s ρ s + ∇ · ( ρ s v s ) = 0 , then its score s ρ := ∇ log ρ s satisﬁes ∂ s s ρ + ( ∇ s ρ ) v s + ( ∇ v s ) ⊤ s ρ + ∇ ( ∇ · v s ) = 0 . (57) So w e apply (57) to ( p s , v p s ) and ( q s , v q s ) . Then Subtract the tw o identities and rewrite the transport part along v q s : ∂ s ∆ + ( ∇ ∆) v q s + ( ∇ v q s ) ⊤ ∆ = − h ( ∇ s p )( v p s − v q s ) +  ( ∇ v p s ) ⊤ − ( ∇ v q s ) ⊤  s p + ∇  ∇ · ( v p s − v q s )  i . (58) Using (39) one has v p s − v q s = − β ( s )∆ , ∇ v p s − ∇ v q s = − β ( s ) ∇ ∆ , ∇ · ( v p s − v q s ) = − β ( s ) ∇ · ∆ , so (58) becomes ∂ s ∆ + ( ∇ ∆) v q s + ( ∇ v q s ) ⊤ ∆ = β ( s ) h ( ∇ s p )∆ + ( ∇ ∆) ⊤ s p + ∇ ( ∇ · ∆) i . (59) Diﬀeren tiate J s = R ∥ ∆ ∥ 2 q s and use ∂ s q s = −∇ · ( q s v q s ) : d ds J s = Z 2 ⟨ ∆ , ∂ s ∆ ⟩ q s + Z ∥ ∆ ∥ 2 ∂ s q s = Z 2 ⟨ ∆ , ∂ s ∆ ⟩ q s − Z ∥ ∆ ∥ 2 ∇ · ( q s v q s ) = Z 2 ⟨ ∆ , ∂ s ∆ ⟩ q s + Z q s v q s · ∇ ( ∥ ∆ ∥ 2 ) = 2 Z q s ⟨ ∆ , ∂ s ∆ + ( ∇ ∆) v q s ⟩ . Insert (59) to obtain d ds J s = − 2 Z q s  ∆ , ( ∇ v q s ) ⊤ ∆  + 2 β ( s ) Z q s ⟨ ∆ , ( ∇ s p )∆ ⟩ + 2 β ( s ) I s , (60) 52 where I s := Z q s  ⟨ ∆ , ∇ ( ∇ · ∆) ⟩ +  ∆ , ( ∇ ∆) ⊤ s p   . Let f := ∇ · ∆ . Using s p = s q + ∆ we split I s = Z q s  ⟨ ∆ , ∇ f ⟩ +  ∆ , ( ∇ ∆) ⊤ s q   + Z q s  ∆ , ( ∇ ∆) ⊤ ∆  . The ﬁrst brac ket equals − R q s ∥∇ ∆ ∥ 2 F ≤ 0 b y Lemma 17. F or the second term, use  ∆ , ( ∇ ∆) ⊤ ∆  = ∆ ⊤ ( ∇ ∆)∆ ≤ ∥∇ ∆ ∥ op ∥ ∆ ∥ 2 , ∥∇ ∆ ∥ op ≤ ∥∇ 2 log p s ∥ op + ∥∇ 2 log q s ∥ op ≤ 2 L ⋆ ( s ) , hence I s ≤ 2 L ⋆ ( s ) J s . (61) F rom (39), ∇ v q s ( x ) = − 1 2 β ( s ) I − β ( s ) ∇ 2 log q s ( x ) ⇒ ∥∇ v q s ∥ op ≤ β ( s )  1 2 + L ⋆ ( s )  . Also, Z q s ⟨ ∆ , ( ∇ s p )∆ ⟩ ≤ L ⋆ ( s ) Z q s ∥ ∆ ∥ 2 = L ⋆ ( s ) J s . Insert these bounds and (61) in to (60): d ds J s ≤ 2 β ( s )  1 2 + L ⋆ ( s )  J s + 2 β ( s ) L ⋆ ( s ) J s + 4 β ( s ) L ⋆ ( s ) J s =  β ( s ) + 8 β ( s ) L ⋆ ( s )  J s . Equiv alently , d ds J s ≤ 2 β ( s )  1 2 + 4 L ⋆ ( s )  J s . Applying Grön wall on [ u, t ] yields (56). Lemma 17 (W eighted IBP identit y along the q –ﬂow) . Let q b e a C 2 densit y on R D with score s q = ∇ log q . Let ∆ = ∇ g b e a C 2 gradien t ﬁeld (so ∇ ∆ = ∇ 2 g is symmetric), and set f := ∇ · ∆ . Assume suﬃcient decay/in tegrability so that b oundary terms v anish. Then Z R D q ⟨ ∆ , ∇ f ⟩ dx + Z R D q  ∆ , ( ∇ ∆) ⊤ s q  dx = − Z R D q ∥∇ ∆ ∥ 2 F dx ≤ 0 . (62) Pr o of. W rite s q = ∇ log q , so q s q = ∇ q . Using integration by parts (b oundary terms v anish), Z q  ∆ , ( ∇ ∆) ⊤ s q  dx = Z ∆ ⊤ ( ∇ ∆) ∇ q dx = − Z q ∇ ·  ∆ ⊤ ( ∇ ∆)  dx. Compute the div ergence in co ordinates (summation con ven tion): ∇ ·  ∆ ⊤ ( ∇ ∆)  = ∂ j  ∆ i ∂ i ∆ j  = ( ∂ j ∆ i )( ∂ i ∆ j ) + ∆ i ∂ i ( ∂ j ∆ j ) . Since ∆ = ∇ g , we hav e ∂ j ∆ i = ∂ i ∆ j , hence ( ∂ j ∆ i )( ∂ i ∆ j ) = X i,j ( ∂ i ∆ j ) 2 = ∥∇ ∆ ∥ 2 F , ∆ i ∂ i ( ∂ j ∆ j ) = ⟨ ∆ , ∇ f ⟩ . Therefore Z q  ∆ , ( ∇ ∆) ⊤ s q  dx = − Z q ∥∇ ∆ ∥ 2 F dx − Z q ⟨ ∆ , ∇ f ⟩ dx, whic h is exactly (62). 53 D Pro ofs in Section 4 Pr o of of L emma 1. Fix t ∈ (0 , 1) and let U ∼ π . F or y ∈ R D deﬁne ϕ σ t ( z ) := (2 π σ 2 t ) − D/ 2 exp  − ∥ z ∥ 2 2 σ 2 t  , w t ( u ; y ) := ϕ σ t  y − α t h ( u )  , and the (unnormalized) mixture density B t ( y ) := E π  w t ( U ; y )  = Z U ϕ σ t  y − α t h ( u )  π ( u ) d u = p t ( y ) . Deﬁne also the p osterior mean of h at lev el t , m t ( y ) := E [ h ( U ) | Y t = y ] = E π  w t ( U ; y ) h ( U )  B t ( y ) . Giv en i.i.d. anc hors U (1) , . . . , U ( K ) iid ∼ π , deﬁne the self-normalized estimator of m t ( y ) , e m t,K ( y ) := P K j =1 w t ( U ( j ) ; y ) h ( U ( j ) ) P K j =1 w t ( U ( j ) ; y ) . When ˜ π ≡ π , the transp ort-based score estimator (4) satisﬁes e s t,K ( y ; h, π , π ) = 1 σ 2 t  α t e m t,K ( y ) − y  . Moreo v er, b y (2), s t ( y ; h ) = ∇ y log p t ( y ) = 1 σ 2 t  α t m t ( y ) − y  . Therefore, for all y ∈ R D , e s t,K ( y ; h, π , π ) − s t ( y ; h ) = α t σ 2 t  e m t,K ( y ) − m t ( y )  , (63) and hence E   e s t,K ( Y t ; h, π , π ) − s t ( Y t ; h )   2 2 = α 2 t σ 4 t E   e m t,K ( Y t ) − m t ( Y t )   2 2 . (64) In the ﬁrst step, w e b ound the conditional mean-squared error of the self-normalized estimator e m t,K ( y ) for ﬁxed y . Assume E π [ w t ( U ; y ) 4 ∥ h ( U ) ∥ 4 ] < ∞ and B t ( y ) > 0 . Then the standard self-normalized imp ortance sampling expansion (e.g., Owen (2013, Ch. 9)) giv es E   e m t,K ( y ) − m t ( y )   2 2 = 1 K E π  w t ( U ; y ) 2 ∥ h ( U ) − m t ( y ) ∥ 2 2  B t ( y ) 2 + O  1 K 2  , (65) where the expectation is o v er the anc hors U (1: K ) conditional on Y t = y . In the second step, we rewrite the leading term in (65). Using the iden tity ϕ σ t ( z ) 2 = (4 π σ 2 t ) − D/ 2 ϕ σ t / √ 2 ( z ) , 54 w e obtain E π  w t ( U ; y ) 2 g ( U )  = (4 π σ 2 t ) − D/ 2 E π h ϕ σ t / √ 2  y − α t h ( U )  g ( U ) i for an y measurable g . Deﬁne B t,σ t / √ 2 ( y ) := E π h ϕ σ t / √ 2  y − α t h ( U )  i , C t ( y ) := (4 π σ 2 t ) − D/ 2 B t,σ t / √ 2 ( y ) B t ( y ) 2 . Also deﬁne the p osterior at bandwidth σ t / √ 2 b y q t,σ t / √ 2 ( u | y ) := ϕ σ t / √ 2  y − α t h ( u )  π ( u ) B t,σ t / √ 2 ( y ) . Applying the abov e iden tity with g ( U ) = ∥ h ( U ) − m t ( y ) ∥ 2 2 yields E π  w t ( U ; y ) 2 ∥ h ( U ) − m t ( y ) ∥ 2 2  B t ( y ) 2 = C t ( y ) E q t,σ t / √ 2 ( ·| y )  ∥ h ( U ) − m t ( y ) ∥ 2 2  . (66) Com bining (65) and (66) giv es E   e m t,K ( y ) − m t ( y )   2 2 = C t ( y ) K E q t,σ t / √ 2 ( ·| y )  ∥ h ( U ) − m t ( y ) ∥ 2 2  + O  1 K 2  . (67) Then, we b ound C t ( y ) using Lemma 18 (Gaussian–manifold con volution). The marginal densit y is B t ( y ) = p t ( y ) = Z U ϕ σ t  y − α t h ( u )  π ( u ) d u , whic h is a Gaussian smo othing of the image manifold α t M at scale σ t . Applying Lemma 18 with intrinsic dimension d (and observing that the exp onen tial terms cancel in the ratio deﬁning C t ) yields constan ts c 1 , c 2 , σ 0 > 0 suc h that c 1 σ − d t ≤ C t ( y ) ≤ c 2 σ − d t ( y ∈ T r ( α t M ) , σ t ≤ σ 0 ) . (68) In the next step, w e b ound the p osterior second-moment term. Since ∥ h ( U ) ∥ 2 ≤ B almost surely , w e ha v e for all y , ∥ h ( U ) − m t ( y ) ∥ 2 2 ≤  ∥ h ( U ) ∥ 2 + ∥ m t ( y ) ∥ 2  2 ≤ 4 B 2 , hence E q t,σ t / √ 2 ( ·| y )  ∥ h ( U ) − m t ( y ) ∥ 2 2  ≤ 4 B 2 . (69) Com bining (67), (68), and (69), and c ho osing K suﬃciently large so that the O ( K − 2 ) term is dominated b y the leading term, w e obtain for y ∈ T r ( α t M ) and σ t ≤ σ 0 , E   e m t,K ( y ) − m t ( y )   2 2 ≤ C ′ K σ − d t , for a constant C ′ > 0 dep ending only on ( B , d, D ) and the geometric constan ts in Assump- tions 1 – 2. T aking exp ectation ov er Y t and substituting in to (64) yields E   e s t,K ( Y t ; h, π , π ) − s t ( Y t ; h )   2 2 ≤ α 2 t σ 4 t · C ′ K σ − d t = C α 2 t K σ d +4 t , whic h is (19). 55 Lemma 18 (Gaussian–manifold conv olution: tw o–sided bounds ) . Let h ∈ H reg satisfy Assumption 1 on a b ounded latent domain U ⊂ R d , and let π b e a laten t density on U satisfying 0 < π min ≤ π ( u ) ≤ π max < ∞ for all u ∈ U . Assume furthermore that M := h ( U ) has reac h at least ρ M > 0 (Assumption 2). F or α > 0 deﬁne p X α ( x ) := Z U ϕ α ( x − h ( u )) π ( u ) d u , ϕ α ( z ) = (2 π α 2 ) − D/ 2 exp  − ∥ z ∥ 2 2 α 2  . Fix r ∈ (0 , ρ M ) and consider x ∈ T r ( M ) = { x : dist( x , M ) ≤ r } . Then there exist constan ts c ℓ , c u > 0 and α 0 ∈ (0 , r ) , dep ending only on ( D , d, π min , π max , m, M , ρ M , r, U ) , suc h that for all x ∈ T r ( M ) and all α ∈ (0 , α 0 ] , c ℓ α d − D exp  − dist( x , M ) 2 2 α 2  ≤ p X α ( x ) ≤ c u α d − D exp  − dist( x , M ) 2 2 α 2  . Pr o of. Fix x ∈ T r ( M ) and write δ = dist( x , M ) . Since r < ρ M and reac h( M ) ≥ ρ M , there exists a unique nearest p oint y ⋆ ∈ M with ∥ x − y ⋆ ∥ = δ . Let v := x − y ⋆ , so ∥ v ∥ = δ and v ⊥ T y ⋆ M . First, b y p ositive reac h there exists r 0 = r 0 ( ρ M ) and a lo cal c hart Ψ : B d ( r 0 ) → M around y ⋆ of the form Ψ( w ) = y ⋆ + P w + ψ ( w ) , where P is an isometry on to T y ⋆ M , ψ ( w ) ∈ N y ⋆ M , ψ (0) = 0 , D ψ (0) = 0 , and ∥ ψ ( w ) ∥ ≤ K ∥ w ∥ 2 ( ∥ w ∥ ≤ r 0 ) . W rite y ( w ) := Ψ( w ) . Since v , ψ ( w ) ∈ N y ⋆ M and P w ∈ T y ⋆ M are orthogonal, ∥ x − y ( w ) ∥ 2 = ∥ v − ψ ( w ) ∥ 2 + ∥ w ∥ 2 . After shrinking r 0 if necessary , there exist constan ts a 1 , a 2 > 0 suc h that δ 2 + a 1 ∥ w ∥ 2 ≤ ∥ x − y ( w ) ∥ 2 ≤ δ 2 + a 2 ∥ w ∥ 2 ( ∥ w ∥ ≤ r 0 ) . Then, c ho ose u ⋆ ∈ U such that h ( u ⋆ ) = y ⋆ . By Assumption 1(i), J h ( u ⋆ ) has smallest singular v alue at least m . Th us, b y the in v erse function prop erty , there exists r 1 ∈ (0 , r 0 ) and a C 1 map Θ : B d ( r 1 ) → U such that h (Θ( w )) = Ψ( w ) , Θ(0) = u ⋆ . Diﬀeren tiating giv es J h (Θ( w )) J Θ ( w ) = D Ψ( w ) . Since D Ψ( w ) = P + Dψ ( w ) and ∥ D ψ ( w ) ∥ ≤ 2 K ∥ w ∥ , shrinking r 1 so that ∥ D ψ ( w ) ∥ ≤ 1 2 yields 1 2 M ≤ s min ( J Θ ( w )) ≤ s max ( J Θ ( w )) ≤ 3 2 m . Hence (2 M ) − d ≤ | det( J Θ ( w )) | ≤ (3 / (2 m )) d ( w ∈ B d ( r 1 )) . 56 Moreo v er, δ 2 + a 1 ∥ w ∥ 2 ≤ ∥ x − h (Θ( w )) ∥ 2 ≤ δ 2 + a 2 ∥ w ∥ 2 . Next, decompose p X α ( x ) = (2 π α 2 ) − D/ 2 Z U exp  − ∥ x − h ( u ) ∥ 2 2 α 2  π ( u ) d u =: I near + I far , where I near in tegrates o v er Θ( B d ( r 1 )) . Changing v ariables u = Θ( w ) and using π ≤ π max and the lo wer quadratic b ound, I near ≤ (2 π α 2 ) − D/ 2 π max (3 / (2 m )) d exp  − δ 2 2 α 2  Z R d exp  − a 1 ∥ w ∥ 2 2 α 2  d w . Ev aluating the Gaussian in tegral yields I near ≤ C (1) u α d − D exp  − δ 2 2 α 2  . F or u / ∈ Θ( B d ( r 1 )) , con tinuit y and uniqueness of pro jection imply ∥ x − h ( u ) ∥ 2 ≥ δ 2 + κ for some κ > 0 . Hence I far ≤ (2 π α 2 ) − D/ 2 π max v ol( U ) exp  − δ 2 + κ 2 α 2  . Since exp  − κ 2 α 2  = o ( α d ) , this term is absorb ed into the same b ound for small α . Thus p X α ( x ) ≤ c u α d − D exp  − δ 2 2 α 2  . Finally , restrict to ∥ w ∥ ≤ cα . Using π ≥ π min and the upper quadratic b ound, p X α ( x ) ≥ (2 π α 2 ) − D/ 2 π min (2 M ) − d exp  − δ 2 2 α 2  Z ∥ w ∥≤ cα exp  − a 2 ∥ w ∥ 2 2 α 2  d w . Bounding the exponential b elow and using vol d ( B d ( cα )) = ω d c d α d giv es p X α ( x ) ≥ c ℓ α d − D exp  − δ 2 2 α 2  . 57 Pr o of of L emma 2. Let Z ∼ Unif [0 , 1] d and assume T : [0 , 1] d → R d is measurable with T ( Z ) ∼ π . Then for an y integrable ψ : R d → R , Z R d ψ ( u ) π ( d u ) = E [ ψ ( T ( Z ))] = Z [0 , 1] d ψ ( T ( z )) d z . Applying this with ψ 0 ( u ) := ϕ ( y t ; α t h ( u ) , σ 2 t I D ) and ψ 1 ( u ) := h ( u ) ϕ ( y t ; α t h ( u ) , σ 2 t I D ) giv es I 0 = Z [0 , 1] d f 0 ( z ) d z , I 1 = Z [0 , 1] d f 1 ( z ) d z , m t ( y t ) = I 1 I 0 . The K oksma–Hla wk a inequalit y (Niederreiter, 1992) states that for a scalar integrand g : [0 , 1] d → R with ﬁnite Hardy–Krause v ariation V HK ( g ) ,      1 K K X j =1 g ( z j ) − Z [0 , 1] d g ( z ) d z      ≤ V HK ( g ) D ∗ ( P K ) . Applying this to f 0 yields | I 0 ,K − I 0 | ≤ V HK ( f 0 ) D ∗ ( P K ) . (70) F or the v ector in tegrand f 1 = ( f 1 , 1 , . . . , f 1 ,p ) with p := dim( h ( u )) (typically p = d ), apply K oksma–Hla wk a co ordinate-wise and use ∥ · ∥ 2 ≤ P p r =1 | · | to get ∥ I 1 ,K − I 1 ∥ 2 ≤ p X r =1      1 K K X j =1 f 1 ,r ( z j ) − Z f 1 ,r      ≤ p X r =1 V HK ( f 1 ,r ) ! D ∗ ( P K ) =: V HK ( f 1 ) D ∗ ( P K ) . Assume V HK ( f 0 ) D ∗ ( P K ) ≤ I 0 / 2 . Then b y (70), I 0 ,K ≥ I 0 − | I 0 ,K − I 0 | ≥ I 0 / 2 , so 1 I 0 ,K ≤ 2 I 0 . No w decomp ose the ratio error: e m QMC t,K ( y t ) − m t ( y t ) = I 1 ,K I 0 ,K − I 1 I 0 = I 1 ,K − I 1 I 0 ,K + I 1  1 I 0 ,K − 1 I 0  . T aking ℓ 2 -norms and using    1 I 0 ,K − 1 I 0    = | I 0 ,K − I 0 | I 0 I 0 ,K giv es ∥ e m QMC t,K ( y t ) − m t ( y t ) ∥ 2 ≤ ∥ I 1 ,K − I 1 ∥ 2 I 0 ,K + ∥ I 1 ∥ 2 I 0 I 0 ,K | I 0 ,K − I 0 | . Using I 0 ,K ≥ I 0 / 2 and ∥ I 1 ∥ 2 = I 0 ∥ m t ( y t ) ∥ 2 yields ∥ e m QMC t,K ( y t ) − m t ( y t ) ∥ 2 ≤ 2 I 0 ∥ I 1 ,K − I 1 ∥ 2 + 2 ∥ m t ( y t ) ∥ 2 I 0 | I 0 ,K − I 0 | . Finally substitute the Koksma–Hla wk a b ounds for ∥ I 1 ,K − I 1 ∥ 2 and | I 0 ,K − I 0 | to obtain ∥ e m QMC t,K ( y t ) − m t ( y t ) ∥ 2 ≤ 2 I 0  V HK ( f 1 ) + ∥ m t ( y t ) ∥ 2 V HK ( f 0 )  D ∗ ( P K ) . 58 Applying the iden tity   e s QMC t,K ( y t ) − ∇ y t log p Y t ( y t )   2 = α t σ 2 t   e m QMC t,K ( y t ) − m t ( y t )   2 , w e obtain the result stated in the lemma. Pr o of of L emma 3. Fix t ∈ (0 , 1) and y t ∈ R D , and condition throughout on y t (so all exp ectations and probabilities b elo w are conditional on y t ). W rite p ( u ) := π t ( u | y t ) , q ( u ) := q ( u | y t ) , w ( u ) := p ( u ) q ( u ) . By assumption p ≪ q , hence w is well-deﬁned q -a.e. and satisﬁes w ( u ) ≥ 0 . Let U 1 , . . . , U K i.i.d. ∼ q and denote w i := w ( U i ) . Deﬁne B K := 1 K K X i =1 w i , A K := 1 K K X i =1 w i h ( U i ) ∈ R D . Then the self-normalized estimator can b e written as e m t,K ( y t ) = A K B K . Moreov er, m t ( y t ) = E p [ h ( U )] = Z h ( u ) p ( u ) d u = Z h ( u ) w ( u ) q ( u ) d u = E q [ w ( U ) h ( U )] , so E [ A K ] = m t ( y t ) . Using E [ A K ] = m t and E [ B K ] = 1 , e m t,K − m t = A K B K − m t = A K − m t B K B K = ( A K − m t ) − m t ( B K − 1) B K . (71) Hence, b y ( a + b ) 2 ≤ 2 a 2 + 2 b 2 , ∥ e m t,K − m t ∥ 2 2 ≤ 2 ∥ A K − m t ∥ 2 2 B 2 K + 2 ∥ m t ∥ 2 2 ( B K − 1) 2 B 2 K . (72) Then we deﬁne the go o d ev en t set G := { B K ≥ 1 / 2 } . On G w e hav e 1 /B 2 K ≤ 4 , so (72) implies ∥ e m t,K − m t ∥ 2 2 1 G ≤ 8 ∥ A K − m t ∥ 2 2 + 8 ∥ m t ∥ 2 2 ( B K − 1) 2 . (73) On G c , since w i ≥ 0 and B K > 0 a.s., the normalized w eights ¯ w i := w i / P K j =1 w j form a con v ex com bination, hence e m t,K = K X i =1 ¯ w i h ( U i ) , so ∥ e m t,K ∥ 2 ≤ max i ∥ h ( U i ) ∥ 2 ≤ B . Also ∥ m t ∥ 2 ≤ E p ∥ h ( U ) ∥ 2 ≤ B . Therefore ∥ e m t,K − m t ∥ 2 2 1 G c ≤ ( ∥ e m t,K ∥ 2 + ∥ m t ∥ 2 ) 2 1 G c ≤ 4 B 2 1 G c . (74) 59 T aking exp ectations and com bining (73)–(74) yields E ∥ e m t,K − m t ∥ 2 2 ≤ 8 E ∥ A K − m t ∥ 2 2 + 8 ∥ m t ∥ 2 2 E ( B K − 1) 2 + 4 B 2 P ( G c ) . (75) Let X i := w i h ( U i ) ∈ R D , so that A K = 1 K P K i =1 X i with E [ X i ] = m t and i.i.d. across i . Then E ∥ A K − m t ∥ 2 2 = E      1 K K X i =1 ( X i − E X i )      2 2 = 1 K E ∥ X 1 − E X 1 ∥ 2 2 ≤ 1 K E ∥ X 1 ∥ 2 2 . (76) Using ∥ h ( u ) ∥ 2 ≤ B and X 1 = w ( U ) h ( U ) , E ∥ X 1 ∥ 2 2 = E q  w ( U ) 2 ∥ h ( U ) ∥ 2 2  ≤ B 2 E q [ w ( U ) 2 ] = B 2 D 2 ( p ∥ q ) . Plugging in to (76) yields E ∥ A K − m t ∥ 2 2 ≤ B 2 K D 2 ( p ∥ q ) . (77) Next, since B K = 1 K P K i =1 w i with E [ w i ] = 1 , E ( B K − 1) 2 = V ar( B K ) = 1 K V ar( w 1 ) ≤ 1 K E [ w 2 1 ] = 1 K D 2 ( p ∥ q ) . (78) With G c = { B K < 1 / 2 } ⊂ {| B K − 1 | ≥ 1 / 2 } , Cheb yshev giv es P ( G c ) ≤ P ( | B K − 1 | ≥ 1 / 2) ≤ E ( B K − 1) 2 (1 / 2) 2 = 4 E ( B K − 1) 2 . Using (78) yields P ( G c ) ≤ 4 K D 2 ( p ∥ q ) . (79) Substitute (77), (78), (79) into (75), and use ∥ m t ∥ 2 ≤ B : E ∥ e m t,K − m t ∥ 2 2 ≤ 8 · B 2 K D 2 ( p ∥ q ) + 8 B 2 · 1 K D 2 ( p ∥ q ) + 4 B 2 · 4 K D 2 ( p ∥ q ) = 32 B 2 K D 2 ( p ∥ q ) . Finally , the score error follo ws from e s t,K ( y t ; h, π , q ) − ∇ y t log p t ( y t ) = α t σ 2 t  e m t,K ( y t ) − m t ( y t )  , so squaring and taking conditional exp ectations yields the stated b ound. 60 References Ahrono viz, S. and Gronau, I. (2024), ‘Genome-ac-gan: Enhancing synthetic genotype gen- eration through auxiliary classiﬁcation’, bioRxiv pp. 2024–02. Alb ergo, M. S. and V anden-Eijnden, E. (2023), ‘Sto chastic in terp olants: A unifying framew ork for ﬂows and diﬀusions’, Pr o c e e dings of the National A c ademy of Scienc es 120 (36), e2303906120. An thon y , M. and Bartlett, P . L. (2009), Neur al Network L e arning: The or etic al F oundations , Cam bridge Univ ersit y Press, Cam bridge, UK. Arjo vsky , M., Chintala, S. and Bottou, L. (2017), W asserstein generativ e adv ersarial net- w orks, in ‘Proceedings of the 34th In ternational Conference on Machine Learning’, PMLR, pp. 214–223. Bartlett, P . L., Harvey , N., Liaw, C. and Mehrabian, A. (2019), ‘Nearly-tight v c-dimension and pseudodimension b ounds for piecewise linear neural net w orks’, The Journal of Ma- chine L e arning R ese ar ch 20 (1), 2285–2301. Chen, H., Lee, H. and Lu, J. (2023), Impro ved analysis of score-based generative modeling: User-friendly b ounds under minimal smo othness assumptions, in ‘International Conference on Mac hine Learning’, PMLR, pp. 4735–4763. De Bortoli, V., Thornton, J., Heng, J. and Doucet, A. (2022), Riemannian score-based generativ e mo deling, in ‘Adv ances in Neural Information Pro cessing Systems’, V ol. 35, pp. 12791–12804. Dhariw al, P . and Nichol, A. Q. (2021), Diﬀusion models b eat gans on image syn thesis, in ‘A dv ances in Neural Information Pro cessing Systems’, V ol. 34, pp. 8780–8794. Dic k, J. and Pillic hshammer, F. (2010), Digital Nets and Se quenc es: Discr ep ancy The ory and Quasi-Monte Carlo Inte gr ation , V ol. V ol. 157 of Cambridge Mono gr aphs on Applie d and Computational Mathematics , Cam bridge Univ ersit y Press, Cam bridge, UK. Dinh, L., Sohl-Dickstein, J. and Bengio, S. (2017), Densit y estimation using real nvp, in ‘In ternational Conference on Learning Representations’. Gulra jani, I., Ahmed, F., Arjo vsky , M., Dumoulin, V. and Courville, A. (2017), Improv ed training of w asserstein gans, in ‘Adv ances in Neural Information Processing Systems’, pp. 5767–5777. Hamidieh, K. (2018), ‘A data-driven statistical mo del for predicting the critical temp erature of a superconductor’, Computational Materials Scienc e 154 , 346–354. Huang, C.-W., Agha johari, M., Bose, J., Panangaden, P . and Courville, A. C. (2022), ‘Rie- mannian diﬀusion mo dels’, A dvanc es in Neur al Information Pr o c essing Systems 35 , 2750– 2761. 61 Huang, J., Jiao, Y., Li, Z., Liu, S., W ang, Y. and Y ang, Y. (2022), ‘An error analysis of generativ e adv ersarial netw orks for learning distributions’, Journal of machine le arning r ese ar ch 23 (116), 1–43. Hyvärinen, A. (2005), ‘Estimation of non-normalized statistical mo dels b y score matc hing’, Journal of Machine L e arning R ese ar ch 6 , 695–709. Karras, T., Aittala, M., Laine, S., Härkönen, E., Hellsten, J., Leh tinen, J. and Aila, T. (2022), Elucidating the design space of diﬀusion-based generative mo dels, in ‘Adv ances in Neural Information Processing Systems’, V ol. 35, pp. 26565–26577. Kennew eg, P ., Dandinasiv ara, R., Luo, X., Hammer, B. and Sc hönhuth, A. (2025), ‘Generat- ing syn thetic genotypes using diﬀusion models’, Bioinformatics 41 (Supplemen t_1), i484– i492. Kingma, D. P . and Dhariwal, P . (2018), Glow: Generativ e ﬂo w with in vertible 1x1 con volu- tions, in ‘Adv ances in Neural Information Pro cessing Systems’, V ol. 31, pp. 10236–10245. Kiric henk o, P ., Izmailov, P . and Wilson, A. G. (2020), Wh y normalizing ﬂo ws fail to detect out-of-distribution data, in ‘Adv ances in Neural Information Pro cessing Systems’, V ol. 33, pp. 20578–20589. K ob yzev, I., Prince, S. J. and Brubaker, M. A. (2021), ‘Normalizing ﬂo ws: An introduction and review of current metho ds’, IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e 43 (11), 3964–3979. Krizhevsky , A. and Hinton, G. (2009), Learning m ultiple la yers of features from tin y images, in ‘Pro ceedings of the 2009 conference on computer vision and pattern recognition’, IEEE, pp. 1378–1385. Kunk el, L. and T rabs, M. (2025), ‘On the minimax opti malit y of ﬂow matc hing through the connection to k ernel densit y estimation’, arXiv pr eprint arXiv:2504.13336 . LeCun, Y., Bottou, L., Bengio, Y. and Haﬀner, P . (1998), ‘Gradient-based learning applied to document recognition’, Pr o c e e dings of the IEEE 86 (11), 2278–2324. Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nic k el, M. and Le, M. (2022), ‘Flo w matc hing for generativ e mo deling’, arXiv pr eprint arXiv:2210.02747 . Liu, X., Gong, C. and Li, Q. (2022), ‘Rectiﬁed ﬂo w: A marginal preserving approach to optimal transp ort’, arXiv pr eprint arXiv:2209.14577 . Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C. and Zhu, J. (2022), Dpm-solver: A fast o de solver for diﬀusion probabilistic model sampling in around 10 steps, in ‘A dv ances in Neural Information Processin g Systems’, V ol. 35, pp. 5775–5787. Lu, J., Shen, Z., Y ang, H. and Zhang, S. (2021), ‘Deep netw ork approximation for smo oth functions’, SIAM Journal on Mathematic al Analysis 53 (5), 5465–5506. 62 Nalisnic k, E., Matsuk aw a, A., T eh, Y. W., Gorur, D. and Lakshminaray anan, B. (2019), Do deep generative mo dels kno w what they don’t kno w?, in ‘International Conference on Learning Represen tations’. Niederreiter, H. (1992), R andom Numb er Gener ation and Quasi-Monte Carlo Metho ds , So- ciet y for Industrial and Applied Mathematics (SIAM), Philadelphia, P A, USA. Ok o, K., Akiy ama, S. and Suzuki, T. (2023), Diﬀusion mo dels are minimax optimal distri- bution estimators, in ‘In ternational Conference on Machine Learning’, PMLR, pp. 26517– 26582. Ow en, A. B. (2013), ‘Monte carlo theory , metho ds and examples’, Stanfor d University . URL: https://statweb.stanfor d.e du/ owen/mc/ P apamak arios, G., Nalisnic k, E., Rezende, D. J., Mohamed, S. and Lakshminara yanan, B. (2021), ‘Normalizing ﬂows for probabilistic mo deling and inference’, Journal of Machine L e arning R ese ar ch 22 (57), 1–64. Ren, J., Liu, P . J., F ertig, E., Sno ek, J., P oplin, R., DePristo, M., Dillon, J. V. and Laksh- minara y anan, B. (2019), Lik eliho o d ratios for out-of-distribution detection, in ‘Adv ances in Neural Information Pro cessing Systems’, V ol. 32. Rom bac h, R., Blattmann, A., Lorenz, D., Esser, P . and Ommer, B. (2022), High-resolution image syn thesis with latent diﬀusion mo dels, in ‘Pro ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition’, pp. 10684–10695. Salimans, T. and Ho, J. (2022), Progressive distillation for fast sampling of diﬀusion models, in ‘In ternational Conference on Learning Represen tations’. Shen, X. (1997), ‘On metho ds of sieves and p enalization’, The A nnals of Statistics 25 (6), 2555–2591. Shen, X. and W ong, W. H. (1994), ‘Conv ergence rate of siev e estimates’, The Annals of Statistics 22 (2), 580–615. Song, J., Meng, C. and Ermon, S. (2021), Denoising diﬀusion implicit mo dels, in ‘Interna- tional Conference on Learning Representations (ICLR)’. Song, Y., Dhariwal, P ., Chen, M. and Sutskev er, I. (2023), Consistency mo dels, in ‘Interna- tional Conference on Machine Learning’. Song, Y., Sohl-Dickstein, J., Kingma, D. P ., Kumar, A., Ermon, S. and Poole, B. (2021), Score-based generative mo deling through sto chastic diﬀeren tial equations, in ‘In ternational Conference on Learning Representations (ICLR)’. T ang, R. and Y ang, Y. (2024), Adaptivit y of diﬀusion mo dels to manifold structures, in M. Claudel, P . Alquier et al., eds, ‘Pro ceedings of the 27th International Conference on Artiﬁcial Intelligence and Statistics (AIST A TS)’, V ol. 238 of Pr o c e e dings of Machine L e arn- ing R ese ar ch , PMLR, PMLR, pp. 1908–1916. 63 The 1000 Genomes Pro ject Consortium (2015), ‘A global reference for h uman genetic v ari- ation’, Natur e 526 (7571), 68–74. Villani, C. et al. (2009), Optimal tr ansp ort: old and new , V ol. 338, Springer. Vincen t, P . (2011), ‘A connection betw een score matc hing and denoising autoenco ders’, Neur al Computation 23 (7), 1661–1674. v on Platen, P ., P atil, S., Lozhko v, A., Cuenca, P ., Lambert, N., Rasul, K., Dav aadorj, M., Nair, D., Paul, S., Berman, W., Xu, Y., Liu, S. and W olf, T. (2022), ‘Diﬀusers: State-of- the-art diﬀusion models’, https://github.com/huggingface/diffusers . Y elmen, B., Decelle, A., Boulos, L. L., Szatko wnik, A., F urtlehner, C., Charpiat, G. and Ja y , F. (2023), ‘Deep conv olutional and conditional neural netw orks for large-scale genomic data generation’, PLOS Computational Biolo gy 19 (10), e1011584. Zhang, K., Yin, C. H., Liang, F. and Liu, J. (2024), Minimax optimality of score-based diﬀusion mo dels: b ey ond the density lo w er b ound assumptions, in ‘Pro ceedings of the 41st In ternational Conference on Mac hine Learning’, pp. 60134–60178. Zheng, K., Lu, C., Chen, J. and Zh u, J. (2023), Dpm-solv er-v3: Improv ed diﬀusion ode solv er with empirical mo del statistics, in ‘A dv ances in Neural Information Pro cessing Systems (NeurIPS)’. 64

Manifold-Aligned Generative Transport

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment