Manifold-Aligned Generative Transport

High-dimensional generative modeling is fundamentally a manifold-learning problem: real data concentrate near a low-dimensional structure embedded in the ambient space. Effective generators must therefore balance support fidelity -- placing probabili…

Authors: Xinyu Tian, Xiaotong Shen

Manifold-Aligned Generative Transport
Manifold-Aligned Generativ e T ransp ort ∗ Xin yu Tian † , Xiaotong Shen ‡ . Abstract High-dimensional generative modeling is fundamen tally a manifold-learning prob- lem: real data concen trate near a low-dimensional structure embedded in the am bien t space. Effectiv e generators m ust therefore balance support fidelit y—placing proba- bilit y mass near the data manifold—with sampling efficiency . Diffusion mo dels often capture near-manifold structure but require man y iterativ e denoising steps and can leak off-supp ort; normalizing flo ws sample in one pass but are limited by in vertibilit y and dimension preserv ation. W e propose MAGT (Manifold-Aligned Generativ e T rans- p ort), a flo w-like generator that learns a one-shot, manifold-aligned transp ort from a lo w-dimensional base distribution to the data space. T raining is p erformed at a fixed Gaussian smo othing lev el, where the score is well-defined and numerically stable. W e appro ximate this fixed-lev el score using a finite set of laten t anchor points with self- normalized imp ortance sampling, yielding a tractable ob jectiv e. MA GT samples in a single forward pass, concen trates probabilit y near the learned supp ort, and induces an in trinsic densit y with respect to the manifold v olume measure, enabling principled lik e- liho o d ev aluation for generated samples. W e establish finite-sample W asserstein b ounds linking smo othing lev el and score-approximation accuracy to generativ e fidelity , and em- pirically improv e fidelit y and manifold concentration across synthetic and b enchmark datasets while sampling substantially faster than diffusion mo dels. Keywor ds: Manifold learning, Diffusion, Flo ws, High fidelit y , Syn thetic data generation. 1 In tro duction Mo dern generative mo deling is c haracterized by a trade-off betw een fidelit y and efficiency . Diffusion mo dels can pro duce highly realistic samples but typically rely on iterative denoising at inference time, whic h mak es generation exp ensive ev en with improv ed solvers and distil- lation (Dhariw al and Nic hol, 2021; Karras et al., 2022; Rom bach et al., 2022; Song et al., 2023; Salimans and Ho, 2022; Lu et al., 2022). On the other hand, normalizing flows enable ∗ This work was supported in part b y the National Science F oundation (NSF) under Grant DMS-2513668, b y the National Institutes of Health (NIH) under Grants R01A G069895, R01A G065636, R01AG074858, and U01A G073079, and b y the Minnesota Sup ercomputing Institute. (Corresponding author: Xiaotong Shen.) † Xin yu Tian is with the School of Statistics, Univ ersity of Minnesota, MN, 55455 USA (email: tianx@umn.edu) ‡ Xiaotong Shen is with the Sc ho ol of Statistics, Universit y of Minnesota, MN, 55455 USA (email: xshen@umn.edu) 1 single-pass sampling and tractable lik eliho o ds via change-of-v ariables training and inv ertible arc hitectures (Dinh et al., 2017; Kingma and Dhariwal, 2018; P apamak arios et al., 2021). Con tin uous-time, transp ort-based form ulations, including probabilit y flo w, flow matc hing, rectified flow, and sto chastic in terp olan ts, help bridge these paradigms by casting generation as transport, often reducing the n umber of function ev aluations needed for sampling (Song, Sohl-Dic kstein, Kingma, Kumar, Ermon and Poole, 2021; Lipman et al., 2022; Liu et al., 2022; Alb ergo and V anden-Eijnden, 2023). Nonetheless, the most efficien t flo w construc- tions remain constrained by in v ertibilit y and dimension preserv ation, while diffusion-based samplers require multiple ev aluations at inference time (K obyzev et al., 2021; P apamak arios et al., 2021). The limitations of existing approaches b ecome most acute in th e manifold regime, where data concentrate near a low-dimensional set embedded in a high-dimensional am bient space. This setting is common for images, biological measurements, and learned feature em b ed- dings. When probability mass lies near a thin support, am bient-space mo deling can waste capacit y in directions orthogonal to the data supp ort and may lead to off-manifold leak- age, miscalibrated lik eliho o ds, or unreliable uncertain ty estimates and out-of-distribution b eha vior (Nalisnick et al., 2019; Kiric henk o et al., 2020; Ren et al., 2019). Geometry-a ware generativ e metho ds aim to address these issues by incorp orating manifold structure into training, but accurately capturing the relev an t geometry while main taining scalabilit y and stable optimization remains c hallenging (De Bortoli et al., 2022; Huang, Agha johari, Bose, P anangaden and Courville, 2022). W e introduce MAGT ( Manifold-Aligne d Gener ative T r ansp ort ), a flow-inspired frame- w ork designed to reconcile high fidelity with one-shot sampling in the manifold regime. The metho d trains at a fixed lev el of Gaussian smo othing in the am bien t space, where the p er- turb ed data distribution has a w ell-defined densit y and score. A cen tral posterior iden tity sho ws that the smo othed score is determined b y the clean sample a veraged under the poste- rior giv en a noisy observ ation. Building on classical connections b et ween score matc hing and denoising (Hyv ärinen, 2005; Vincen t, 2011), MA GT approximates this conditional mean us- ing a finite collection of latent anc hors together with self-normalized imp ortance sampling. The anchor appro ximation can b e instan tiated with standard Mon te Carlo, quasi–Mon te Carlo v ariance reduction, or Laplace-based prop osals, yielding a practical score estimator and an end-to-end single-level denoising score-matching ob jective. On the theoretical side, w e establish a new single-lev el pull-bac k inequalit y that trans- lates a squared score discrepancy betw een t wo smo othed distributions at a fixed noise lev el in to a W asserstein error bound b et w een their corresp onding unsmoothed generators. This result highligh ts the roles of smoothing and underlying manifold geometry in determining generation error. Building on this inequality , w e com bine it with a finite-sample complexit y analysis of fixed-lev el score-matc hing risk minimization to obtain nonasymptotic generation b ounds whose rates dep end on the intrinsic dimension and explicitly quantify the anc hor appro ximation error. Empirically , exp eriments on synthetic manifolds as w ell as image and tabular b enchmarks demonstrate that MA GT outp erforms diffusion baselines in fidelit y across all rep orted set- tings and uniformly outp erforms GANs; relative to flow matc hing, MA GT matc hes or im- pro v es fidelity on the synthetic-manifold suite and is b est on three of four real b enchmarks (MNIST, Sup erconduct, Genomes), with CIF AR10-0 (airplanes) the only case where flow 2 matc hing attains a lo wer FID, using one-shot sampling, while sim ultaneously impro ving supp ort concen tration and substan tially reducing inference-time function ev aluations. Our con tributions are as follo ws. 1). Metho dology: W e in tro duce a non-inv ertible transport h : R d → R D tailored to the manifold regime, trained at a fixed Gaussian smo othing lev el via a p osterior score iden tit y . A finite set of laten t anc hors com bined with self-normalized imp ortance sampling yields a practical single-lev el denoising score-matching ob jectiv e. The learned h enables one- shot sampling and induces an in trinsic densit y on its image (with resp ect to d -dimensional Hausdorff measure) that is computable under mild regularit y conditions; see T able 1. 2). Theory: W e prov e (i) a new single-lev el pull-bac k inequalit y that con v erts fixed-lev el score error into W asserstein generation error, and (ii) an excess-risk bound for our fixed-lev el score-matc hing risk minimization with finite anchors via brac k eting entrop y . T ogether, these yield finite-sample generation rates that dep end on the in trinsic dimension and explicitly trac k smo othing and manifold geometry; see T able 2. 3). Algorithms: Practical Monte Carlo, quasi–Monte Carlo, and Laplace-based prop os- als for anc hor selection within a unified training ob jective. 4). Evidence: Empirical results on syn thetic and real image/tabular data indicate consisten t fidelity gains ov er diffusion baselines (all b enchmarks) and GANs (all b enc hmarks); relativ e to flo w matc hing, MA GT impro v es or matc hes fidelit y on syn thetic manifolds and is b est on MNIST, Sup erconduct, and Genomes, with CIF AR10-0 the only b enc hmark where flo w matc hing is clearly b etter in FID. In tabular settings, these gains are substantial: MAGT reduces W 2 b y 74 . 9% on Sup erconduct and 43 . 6% on Genomes relativ e to DDIM (T able 5). Under one-shot sampling, MA GT also substan tially improv es concen tration near the data supp ort and reduces inference-time netw ork ev aluations b y orders of magnitude. The remainder of the pap er is organized as follows. Section 2 in tro duces the MAGT framew ork, including the transp ort-based score iden tity , the resulting one-shot sampler, and practical considerations for likelihoo d ev aluation. Section 3 develops nonasymptotic risk b ounds that relate single-lev el score estimation error to W asserstein generation accuracy . Section 4 discusses practical Monte Carlo and quasi-Mon te Carlo sc hemes for appro ximating the conditional exp ectations that app ear in the MA GT score estimator. Section 5 describ es practical implementation details for training MAGT, including a memory-efficien t up date for large anc hor banks. Section 6 presen ts empirical results on synthetic manifolds and real image/tabular datasets. Section 7 concludes with a brief discussion. The App endix con tains pro ofs, auxiliary lemmas, and additional exp erimental and implementation details. 2 MA GT: Manifold-aligned generativ e transp ort 2.1 Dimension alignmen t via p erturbation Consider generative mo deling in which observ ations Y 0 ∈ R D concen trate near a low- dimensional manifold M ⊂ R D of in trinsic dimension d ≪ D . Our goal is to learn a deterministic tr ansp ort (generator) map h : R d → R D suc h that, for a laten t v ariable U dra wn from a base distribution with densit y π (e.g., a standard Gaussian on R d or a uni- form distribution on [0 , 1] d ), the generated sample Y 0 = h ( U ) follo ws the data distribution 3 p Y 0 . Imp ortantly , h need not b e in v ertible, and the latent and data dimensions may differ, whic h is essen tial when the target distribution concen trates on or near a lo w er-dimensional manifold. A key obstacle is that if p Y 0 is supp orted on a manifold, it can b e singular with resp ect to Leb esgue measure on R D , so an am bien t density and score for Y 0 ma y b e ill-defined. MA GT resolves this b y working at a fixed smo othing lev el t : w e add Gaussian noise in the am bien t space so that the corrupted v ariable Y t has an everywhere-positive densit y and a w ell-defined score ∇ y t log p Y t ( y t ) . Crucially , this smo othed score admits a p osterior/mixture represen tation in terms of h and the base distribution π , which we approximate with a finite set of laten t “anc hors” and then con vert into a one-shot transp ort map. Am bient Gaussian p erturbations. W e in tro duce a noise schedule ( α t , σ t ) t ∈ [0 , 1] and de- fine, for eac h t , a perturb ed observ ation Y t = α t Y 0 + σ t Z t , Z t ∼ N (0 , I D ) . (1) This construction defines a dimension-preserving Gaussian corruption of Y 0 directly in the am bien t space. It pro vides a probabilistic link b etw een the distribution of Y t in R D and that of the clean data Y 0 , which may b e supp orted on a d -dimensional manifold with d ≤ D . W e use only the marginal Gaussian corruption in (1); no und erlying SDE or diffusion dynamics are assumed. 2.2 Score matc hing and generator MA GT uses the p erturb ed data Y t to define a score-matc hing ob jectiv e that learns h via ∇ y t log p ( y t ) in the am bient space, linking the noisy observ ation Y t to its clean coun terpart Y 0 = h ( U ) . The noise lev el is con trolled by t through the sc hedule ( α t , σ t ) . F rom (1) and Y 0 = h ( U ) , the conditional densit y of Y t giv en U = u is p t ( y t | u ) = ϕ  y t ; α t h ( u ) , σ 2 t I D  , where ϕ ( y ; m , Σ ) denotes the densit y of N ( m , Σ ) and π is the base density of U . The marginal densit y of Y t is the con tinuous mixture p t ( y t ) = Z ϕ  y t ; α t h ( u ) , σ 2 t I D  π ( u ) d u . Differen tiating log p t ( y t ) yields the mixture score ∇ y t log p t ( y t ) = E [ ∇ y t log p t ( y t | U ) | Y t = y t ] = 1 σ 2 t  α t E [ h ( U ) | Y t = y t ] − y t  , (2) where ∇ y t log p t ( y t | u ) = − ( y t − α t h ( u )) /σ 2 t and the conditional exp ectation is under p ( u | y t ) ∝ π ( u ) p t ( y t | u ) . This identit y highlights that the mixture score dep ends on the posterior mean E [ h ( U ) | y t ] , which enco des the geometry of the latent space and the generator h . 4 T ransp ort-based score estimator. In (2), score estimation at noise lev el t requires com- puting the p osterior mean E [ h ( U ) | y t ] under p ( u | y t ) ∝ π ( u ) ϕ  y t ; α t h ( u ) , σ 2 t I D  . W e appro ximate this conditional exp ectation E [ h ( U ) | y t ] using self-normalize d imp ortanc e sam- pling . Specifically , let ˜ π ( · | y t ) b e a proposal distribution on the laten t space, possibly de- p ending on y t . W e draw U (1) , . . . , U ( K ) i.i.d. ∼ ˜ π ( · | y t ) , and form the unnormalized imp ortance w eigh ts ω ( j ) t ( y t ) := π  U ( j )  ˜ π  U ( j ) | y t  ϕ  y t ; α t h  U ( j )  , σ 2 t I D  . (3) Then, the p osterior mean E [ h ( U ) | y t ] is appro ximated b y e m t,K ( y t ) := P K j =1 ω ( j ) t ( y t ) h ( U ( j ) ) P K j =1 ω ( j ) t ( y t ) , and w e define the corresp onding transp ort-based score estimator e s t,K ( y t ; h, π , ˜ π ) := 1 σ 2 t  α t e m t,K ( y t ) − y t  . (4) When ˜ π ( · | y t ) ≡ π ( · ) (i.e., we sample anc hors from the generativ e base), the importance ratio cancels and ω ( j ) t ( y t ) ∝ ϕ  y t ; α t h ( U ( j ) ) , σ 2 t I D  , recov ering the finite-mixture form. This estimator is “transp ort-based” b ecause it is an explicit functional of the learned map h and the base distribution, with the conditional exp ectation in (2) approximated b y a w eighted set of anc hors { U ( j ) } K j =1 . Choice of prop osal distribution ˜ π . The base distribution π sp ecifies the generativ e mo del: dra w U ∼ π and set Y 0 = h ( U ) . The proposal ˜ π in (3) is purely a c omputational devic e for appro ximating E [ h ( U ) | y t ] : as long as ˜ π ( · | y t ) has supp ort cov ering the high- densit y regions of the p osterior and the imp ortance ratio π / ˜ π is included, the estimator in (4) is consisten t (and asymptotically un biased) for the true p osterior mean. F or small σ t (or for expressiv e h in high ambien t dimension), the laten t p osterior p ( u | y t ) ∝ π ( u ) ϕ ( y t ; α t h ( u ) , σ 2 t I D ) can b e sharply concen trated: only a tiny subset of latent p oin ts pro duces α t h ( u ) close to the observ ed y t . If w e dra w anc hors from the prior π , most samples receiv e negligible likelihoo d weigh t, leading to a low effective sample size and a high-v ariance estimate of E [ h ( U ) | y t ] (and therefore of the score). A prop osal ˜ π ( · | y t ) that b etter matc hes the p osterior geometry yields more balanced w eigh ts, impro ving n umerical stabilit y and reducing Monte Carlo v ariance without c hanging the underlying model. Section 4 presen ts practical c hoices that plug directly in to (3)–(4): (i) MAGT-MC uses ˜ π = π (baseline sampling); (ii) MA GT-QMC replaces i.i.d. dra ws from π with lo w- discrepancy p oin t sets to reduce in tegration error; and (iii) MAGT-MAP uses a data- dep enden t Gaussian prop osal ˜ π ( · | y t ) = q ( · | y t ) obtained from a MAP–Laplace approxima- tion to the p osterior. All three c hoices estimate the same quan tit y E [ h ( U ) | y t ] ; they differ only in ho w efficien tly they appro ximate it. T raining loss. Giv en the transp ort-based score estimator (4), w e estimate the transp ort map h by minimizing a single-level denoising score-matc hing ob jectiv e. Sp ecifically , for eac h training sample y i 0 ∼ p Y 0 , we dra w z i ∼ N (0 , I D ) and construct the p erturb ed observ ation 5 y i t = α t y i 0 + σ t z i . ℓ K ( y t , y 0 ; h ) :=    e s t,K ( y t ; h, π , ˜ π ) − ∇ y t log p ( y t | y 0 )    2 2 , (5) where p ( y t | y 0 ) is the normal densit y for N ( α t y 0 , σ 2 t I D ) . Given ( y i t , y i 0 ) n i =1 , w e then solv e the empirical risk minimization problem L n,K ( h ) := 1 n n X i =1 ℓ K  y i t , y i 0 ; h  , ˆ h λ ∈ arg min h ∈H L n,K ( h ) , (6) o v er a prescribed hypothesis class H (e.g., ReLU neural netw orks). When H is instan tiated b y a neural net w ork family { h θ : θ ∈ Θ } , w e equiv alently optimize o v er parameters θ and obtain ˆ θ λ ∈ arg min θ ∈ Θ L n,K ( h θ ) , with the learned transp ort defined a s ˆ h λ := h ˆ θ λ . (W e suppress the dependence on ˆ θ λ in the theory and write ˆ h λ for the learned function.) Here λ := ( t, d, K ) collects the tuning parameters: ( t, d ) gov ern the bias–v ariance trade-off of the estimator, while K controls the accuracy of the Monte Carlo approximation used in ℓ K . The tuning parameter ˆ λ = ( ˆ t, ˆ d, ˆ K ) is selected b y cross-v alidation: w e c ho ose λ to minimize a v alidation generative criterion (e.g., an estimated W asserstein distance) computed on an indep enden t v alidation set, and we rep ort the resulting generator ˆ h ˆ λ . A t the p opulation lev el, the conditional-score target is un biased for the marginal score at time t because E [ ∇ Y t log p ( Y t | Y 0 ) | Y t ] = ∇ Y t log p Y t ( Y t ) . This iden tity motiv ates the denoising score-matc hing ob jectiv e in (6); the theory in Section 4 makes the dep endence on the finite-anc hor approximation explicit. Sample generation. Given the selected generator ˆ h ˆ λ , we generate new samples b y drawing u ∼ π and pushing it forward through the learned transp ort, ˜ y 0 = ˆ h ˆ λ ( u ) , where ˆ λ is selected via cross-v alidation. Thus, ˆ h ˆ λ pro vides a one-pass sampler for the target distribution and, together with the anchor bank, defines the transp ort-based score estimator in (4). In trinsic densit y and lik eliho o d ev aluation. Assume h : R d → R D is C 1 and has rank d almost everywhere, and let M := h ( U ) denote its image ov er a laten t domain U ⊂ R d that con tains the supp ort of π . Assume further that π admits a density on U with resp ect to Leb esgue measure. By the classical area form ula, the pushforward measure h # π is absolutely con tin uous with respect to the d -dimensional Hausdorff measure H d restricted to M , and for H d -a.e. y 0 ∈ M , p M ( y 0 ) = X u ∈ h − 1 ( { y 0 } ) π ( u )   J h ( u )   , where   J h ( u )   = q det  J h ( u ) ⊤ J h ( u )  . (7) Here p M is the intrinsic density of Y 0 = h ( U ) with respect to H d | M , J h ( u ) ∈ R D × d is the Jacobian, and | J h ( u ) | is the d -dimensional Jacobian determinan t. If h is injective on U (e.g., a C 1 em b edding), then h − 1 ( { y 0 } ) is a singleton for H d -a.e. y 0 ∈ M , and (7) reduces to the familiar c hart form ula log p M  h ( u )  = log π ( u ) − 1 2 log det  J h ( u ) ⊤ J h ( u )  . (8) 6 More generally , if h has b ounded m ultiplicity , the sum in (7) contains finitely man y terms; ev aluating p M ( y 0 ) requires identifying and summing all contributing preimages. Equa- tion (8) gives the branch wise c hart contribution asso ciated with a sp ecific preimage u . F or generated samples y 0 = h ( u ) , this branch wise log-densit y is directly computable from ( u , h ( u )) and coincides with log p M ( y 0 ) whenever the fib er is a singleton (in particular, un- der injectivit y). If h is not injectiv e and additional preimages exist, (8) should be interpreted as a lo cal con tribution unless the remaining preimages are recov ered and included in (7). F or an observed p oint y 0 ∈ M , ev aluating p M ( y 0 ) requires iden tifying one or more laten t preim- ages solving h ( u ) = y 0 (or appro ximately minimizing ∥ h ( u ) − y 0 ∥ 2 ); this can b e done via a separate encoder or numerical optimization when needed, but is not required for sampling. Finally , note that an ambien t-space density for Y 0 do es not exist when its law is supp orted on a manifold; at an y fixed smo othing level t > 0 with σ t > 0 , how ever, the corrupted la w admits the mixture representation p Y t ( y ) = R ϕ ( y ; α t h ( u ) , σ 2 t I D ) π ( u ) d u , whic h can be estimated using the same anchor bank employ ed for score appro ximation. 2.3 Comparisons with diffusion and flo w mo dels This section ev aluates MAGT against diffusion- and flow-based baselines across several prac- tical dimensions, including sampling cost, supp ort alignmen t, lik eliho o d accessibility , com- putational fo otprint, and statistical guarantees. As sho wn in Sections 2.1 – 2.2, the fixed- t training scheme in MAGT yields an am bient-space mixture representation of Y t (so the smo othed densit y at lev el t is MC-estimable) and induces an intrinsic density on the learned manifold. This com bination bridges flo w-st yle densit y ev aluation on the support with the computational efficiency of one-shot sampling. T able 1 highligh ts that MAGT com bines one-shot sampling, support alignment to a thin manifold, and intrinsic densities on the learned supp ort, together with a Mon te Carlo route to smoothed ambien t lik eliho o ds at the training noise lev el. Sampling cost. MAGT generates samples in a single forward ev aluation of the transp ort map h , as in flo w mo dels. Diffusion models, b y con trast, generate samples b y numerically in tegrating a rev erse-time sto c hastic differen tial equation (SDE) or ordinary differen tial equa- tion (ODE) through sequential denoising steps, often requiring tens to thousands of neural net w ork ev aluations p er sample, making them substan tially slo wer without distillation (Lu et al., 2022; Karras et al., 2022). Moreo v er, b ecause this mo del relies on time discretization, reducing the n um b er of steps increases the discretization (solver) error, whic h v anishes only as the sequen tial denoising steps increase, or higher-order solvers are used (Chen et al., 2023; Zheng et al., 2023). This mak es MAGT attractiv e for interactiv e or streaming use without distillation. Supp ort alignment. MA GT trains at a fixed smo othing lev el t , concentrating probability mass near the data manifold and av oiding the b oundary bias that arises from av eraging across noise scales in diffusion. Its finite-mixture appro ximation further emphasizes anchors that b est explain each observ ation while down-w eigh ting off-manifold ones for improving b oundary fidelit y . This asp ect is confirmed by the exp eriment in Section 6. 7 T able 1: Comparison of MAGT with diffusion mo dels and normalizing flo ws. MA GT ac hiev es strong manifold alignmen t and high fidelit y with single-pass sampling, av oiding diffusion’s long c hains and flow in v ertibilit y . NFE denotes the n umber of function ev aluations during sampling. MA GT Diffusion/Flo w- matc hing Normalizing flo ws T raining Matc hing loss at fixed t Time-a vg. loss o ver t MLE via change of v ars Sampling cost One forward pass (NFE = 1 ) NFE steps (NFE ≫ 1 ) One inv erse pass Supp ort Manifold via fixed- t smo othing Near-manifold leak age No measure- 0 manifolds Arc hitecture Non-in vertible; dimensions ma y differ Unconstrained In vertible; dimensions m ust matc h Lik eliho o d In trinsic density on manifold via area form ula (exact for embeddings; otherwise requires summing preimages); smo othed ambien t densit y via anchor MC Unnormalized; no tractable lik eliho o d Exact (ambien t) F ailures t mis-sp ecification Boundary bias; high cost In vertibilit y bottleneck; manifold mismatch Lik eliho o ds on the supp ort and at fixed smo othing. MAGT induces an in trinsic densit y on its image manifold via the area form ula (7)–(8). In particular, for generated samples y 0 = h ( u ) one can ev aluate log p M ( y 0 ) in closed form from ( u , J h ( u )) when h is lo cally injectiv e. An ambien t-space likelihoo d for Y 0 is not defined when the target la w is manifold-supp orted; ho wev er, at the fixed smo othing lev el t > 0 , the corrupted law has densit y p Y t ( y ) = R ϕ ( y ; α t h ( u ) , σ 2 t I D ) π ( u ) d u , which can be appro ximated b y the same anc hor bank used for score estimation. Computational fo otprin t and arc hitectural freedom. Because h need not b e in vert- ible and the latent and data dimensions may differ, MA GT a v oids the D × D Jacobian log-determinan ts and the coupling or triangular constrain ts that are standard in inv ertible arc hitectures, particularly normalizing flo ws. This architectural freedom reduces training o v erhead and facilitates scaling to high-dimensional embeddings. Imp ortantly , neither train- ing nor inference requires taking the limit t → 0 ; this con trasts with many diffusion-based ob jectiv es, where score magnitudes can div erge as the noise level v anishes and ma y destabilize optimization. Statistical guaran tees. Our non-asymptotic risk analysis dep ends on the intrinsic di- mension d and geometric regularity of the manifold, rather than the am bient dimension D . This clarifies why MA GT remains data-efficient when observ ations are high-dimensional but effectiv ely lo w-dimensional in geometry . 8 3 Theory: excess risk and generation fidelit y This section establishes a finite-sample b ound for the one-shot generation accuracy of ˆ h λ ( U ) with U ∼ π indep enden t, measured b y the 2 -W asserstein error W 2 ( P Y , P ˜ Y ) with ˜ Y = ˆ h λ ( U ) estimated at a noise lev el t ∈ (0 , 1) . The analysis decomposes into t w o ingredients. First, Theorem 1 (a pull-bac k inequalit y) conv erts a single-level score mismatc h b et w een the smo othed laws at level t in to a b ound on W 2 at t = 0 . Second, Theorem 2 con trols the fixed- t score-matching excess risk of the empirical minimizer ˆ h λ of (6) via brack eting en trop y . This result is an adaptation to our setting of classical brac k eting-en tropy arguments for (generalized) M -estimators, as in Shen and W ong (1994). Combining these t wo ingredien ts yields the generation-fidelity b ound in Theorem 3. T o the b est of our kno wledge, b oth the pull-bac k inequality in Theorem 1 and the resulting generation-fidelit y bound in Theorem 3 are new. 3.1 Setup and geometric assumptions Let U ⊂ R d b e a bounded laten t domain and let π denote a base density on U with respect to Leb esgue measure. W e consider a manifold-supp orted data distribution that is wel l sp e cifie d b y an (unkno wn) transp ort map h ∗ : U → R D with sufficient smo othness to define a regular d -dimensional image manifold (precise regularit y is stated in Assumption 1). Define the target manifold M ∗ := h ∗ (supp π ) ⊂ R D , where supp denotes supp ort. Throughout, ∥ · ∥ denotes the Euclidean norm and ∥ A ∥ op := sup ∥ x ∥ =1 ∥ Ax ∥ denotes the op erator norm. W e imp ose the follo wing regularit y and geometric conditions. Definition 1 (Hölder class) . F or s > 0 , a b ounded set U ⊂ R d , and a radius B > 0 , we write C s ( U , B ) for the (vector-v alued) Hölder ball of order s : the set of functions h : U → R D whose deriv atives up to order ⌊ s ⌋ exist and are b ounded, and whose ⌊ s ⌋ -th deriv ative is ( s − ⌊ s ⌋ ) - Hölder with Hölder seminorm at most B . In particular, when s ∈ (0 , 1] this reduces to the condition ∥ h ( u ) − h ( v ) ∥ 2 ≤ B ∥ u − v ∥ s for all u , v ∈ U , together with sup u ∈U ∥ h ( u ) ∥ 2 ≤ B . Definition 2 (Reac h and tubular neigh b orho o d) . F or a closed set M ⊂ R D , its reac h reac h( M ) ∈ [0 , ∞ ] is the largest r such that ev ery p oint x with dist( x, M ) < r has a unique nearest-p oin t pro jection Π M ( x ) ∈ M . Equiv alen tly , the op en tub e T r ( M ) := { x ∈ R D : dist( x, M ) < r } admits a well-defined pro jection map x 7→ Π M ( x ) . Assumption 1 (Regular transp ort class) . The true tr ansp ort h ∗ lies in the Hölder smo oth class C η +1 ( U , B ) over a b ounde d latent domain U ⊂ R d ∗ . Set γ := min(1 , η ) . Ther e exist c onstants 0 < m ≤ M < ∞ and H γ < ∞ , and a r e gular subset H reg ⊆ H with h ∗ ∈ H reg , such that the fol lowing hold for every h ∈ H reg : (i) (F ul l r ank and c onditioning) The Jac obian J h ( u ) ∈ R D × d has r ank d for al l u ∈ U and its singular values lie in [ m, M ] . 9 (ii) ( C 1 ,γ r e gularity) J h is γ –Hölder with c onstant H γ , i.e., ∥ J h ( u ) − J h ( v ) ∥ op ≤ H γ ∥ u − v ∥ γ for al l u , v ∈ U . In the the or etic al r esults b elow, we assume the le arne d estimator ˆ h λ b elongs to H reg ; se e R emark 1 for discussion. Assumption 2 (P ositive reac h) . Ther e exists a c onstant ρ M > 0 such that the image man- ifold M ∗ := h ∗ (supp π ) has r e ach at le ast ρ M . Mor e over, every h ∈ H reg has an image manifold h (supp π ) with r e ach at le ast ρ M . Remark 1. Assumptions 1 – 2 imp ose uniform c hart regularit y (full-rank, w ell-conditioned Jacobians) and a p ositiv e reac h in order to con trol tubular neigh b orho o ds and justify the Hessian bounds that underlie the single-level pull-back analysis in Section 3. They are stated as uniform conditions o v er the restricted set H reg , but they can b e lo calized: b oth the smallest singular v alue of J h and the reac h of h (supp π ) are stable under sufficien tly small C 1 p erturbations of h on the b ounded domain U . Consequen tly , it is enough for the estimator ˆ h λ to lie in a C 1 neigh b orho o d of h ∗ , which is consisten t with the excess-risk con trol for large n . In practice, one can encourage mem b ership in H reg via Jacobian-conditioning penalties (e.g., p enalizing ∥ J h ( u ) ⊤ J h ( u ) − I d ∥ ov er sampled u ), sp ectral normalization, and p ost ho c c hec ks on a dense latent grid. Definition 3 (Log–Sob olev constant) . Let µ b e a p ositiv e density with resp ect to the Leb esgue measure. F or g ≥ 0 with R g dµ < ∞ , define the en trop y function as En t µ ( g ) := Z g log  g R g dµ  dµ W e say that µ satisfies a logarithmic Sob olev inequalit y (LSI) with constan t C LSI ( µ ) > 0 if En t µ ( f 2 ) ≤ 2 C LSI ( µ ) Z ∥∇ f ∥ 2 2 dµ for all smooth f ≡ const . Equiv alently , C LSI ( µ ) := inf f ≡ const 2 R ∥∇ f ( x ) ∥ 2 2 µ ( x ) dx Ent µ ( f 2 ) . Assumption 3 (Smo oth base density) . The b ase density π is C 2 on U and its lo g-density has b ounde d Hessian: Λ 2 := sup u ∈U   ∇ 2 u log π ( u )   op < ∞ . Mor e over, the latent density π satisfies a lo g–Sob olev ine quality with c onstant C LSI ( π ) > 0 . Under Assumptions 1 – 3, w e assume, for simplicit y , that b oth U and Y 0 ha v e bounded supp ort. In fact, this assumption can be relaxed to a uniform tail-con trol condition (e.g., sub-Gaussian tails), with only minor mo difications to the proof. 3.2 F rom a single-level score error to W 2 generation error Consider the v ariance-preserving (VP) schedule σ 2 t = t and α t = √ 1 − t for a fixed t ∈ (0 , 1) in (1). Define the smoothed v ariables Y t := α t Y 0 + σ t Z , ˜ Y t := α t ˜ Y 0 + σ t Z , Z ∼ N (0 , I D ) , 10 where Z is indep enden t of Y 0 and of ˜ Y 0 = ˆ h λ ( U ) . Let p t denote the densit y of Y t and let ˜ p t denote the densit y of ˜ Y t ; b oth are smo oth and ev erywhere p ositive for t > 0 . W e measure the single-level mismatch b y the squared error b et w een the t w o denoisers (p osterior means) under Gaussian corruption: E MAG ( t ) := E Y ∼ p t   m p,t ( Y ) − m ˜ p,t ( Y )   2 2 , (9) where m p,t ( y ) := E [ Y 0 | Y t = y ] and m ˜ p,t ( y ) := E [ ˜ Y 0 | ˜ Y t = y ] . F or Gaussian corruption, T weedie’s formula giv es m p,t ( y ) = y + t ∇ log p t ( y ) α t , m ˜ p,t ( y ) = y + t ∇ log ˜ p t ( y ) α t , and therefore E MAG ( t ) = t 2 1 − t E Y ∼ p t   ∇ log p t ( Y ) − ∇ log ˜ p t ( Y )   2 2 . In particular, the exp ectation on the right is the squared Fisher div ergence J ( p t ∥ ˜ p t ) (up to con v en tion). Recall that for p ositive densities q and p on R D , the (relativ e) Fisher div ergence is J ( q ∥ p ) := Z R D   ∇ log q ( y ) − ∇ log p ( y )   2 2 q ( y ) d y , see, e.g., Shen (1997). Theorem 1 sho ws that controlling E MAG ( t ) at a single noise level t > 0 controls the one-shot W 2 generation error. Theorem 1 (Single-lev el pull-back b ound) . Under Assumptions 1 – 3, suppose the VP noise lev el t ∈ (0 , 1) lies in the tub e regime t ≤ t max := c 2 tube ρ 2 M , θ t := C ( γ ) N t γ (1 − t ) γ < 1 . Then the one-shot generation error is controlled by the single-level mismatch: W 2  P Y 0 , P ˜ Y 0  ≤ C PB ( t ) p E MAG ( t ) , (10) where the pull-bac k constan t is C PB ( t ) := √ 1 − t t  Φ( t ) ¯ C LSI ( t ) + Ψ( t )  , ¯ C LSI ( t ) := (1 − t ) M 2 + t min { C LSI ( π ) , 1 } . Here, Φ( t ) := exp  I γ ( t )  √ 1 − t , Ψ( t ) := Γ( t ) exp  4 I γ ( t )  √ 1 − t , Γ( t ) := − log (1 − t ) , with I γ ( t ) := 2 γ C ( γ ) T t γ / 2 (1 − t ) 1+ γ / 2 + 1 γ ( C ( γ ) S ) 2 1 − θ t t γ (1 − t ) 1+ γ . The constan ts C ( γ ) T , C ( γ ) S , and C ( γ ) N dep end solely on the parameters ( m, M , H γ , Λ 2 , ρ M ) . Moreo v er, for fixed problem constants and t ≤ t max , the dominan t scaling is C PB ( t ) = O ( t − 1 ) as t ↓ 0 , and C PB ( t ) = O (1) if t is b ounded aw ay from 0 . 11 Theorem 1 pro vides a single-lev el b ound on W 2  P Y 0 , P ˜ Y 0  based on the score mismatc h at a fixed noise lev el 0 < t < t max , in con trast to diffusion analyses that integrate score errors o ver time. The result highligh ts a bias–stability trade-off in the c hoice of t : larger t yields smo other densities and more stable score estimation, but incurs greater smo othing bias, while smaller t reduces bias at the cost of more concen trated p osteriors and higher Mon te Carlo v ariance. Practical strategies for mitigating Mon te Carlo error at small t are discussed in Section 4. 3.3 Learning the single-lev el score b y empirical risk minimization T o connect Theorem 1 to the training ob jectiv e, we make explicit the roles of (i) the finite- anc hor appro ximation and (ii) the underlying p opulation score-matc hing risk. Recall the finite-anc hor loss ℓ K in (5) and the empirical ob jective L n,K ( h ) := 1 n n X i =1 ℓ K ( y i t , y i 0 ; h ) , ˆ h λ ∈ arg min h ∈H L n,K ( h ) , as in (6). Ideal (infinite-anc hor) risk. F or the statistical analysis, it is conv enient to introduce the ideal (infinite-anc hor) loss ℓ ( y t , y 0 ; h ) :=   s t ( y t ; h ) − ∇ y t log p ( y t | y 0 )   2 2 , (11) where s t ( · ; h ) := ∇ log p h t ( · ) is the score of the smo othed mo del induced b y h , and p h t denotes the densit y of Y h t := α t h ( U ) + σ t Z with U ∼ π and Z ∼ N (0 , I D ) . The corresp onding p opulation risk is R ( h ) := E  ℓ ( Y t , Y 0 ; h )  , where ( Y t , Y 0 ) follo w the data corruption model (1) with Y 0 = h ∗ ( U ) and U ∼ π . Define the excess p opulation risk relative to the ground-truth map h ∗ b y ρ 2 ( h ∗ , h ) := R ( h ) − R ( h ∗ ) ≥ 0 , ρ ( h ∗ , h ) := p ρ 2 ( h ∗ , h ) . (12) If h ∗ / ∈ H , the approximation error is inf h ∈H ρ 2 ( h ∗ , h ) . F rom score-matc hing risk to Fisher div ergence and E MAG ( t ) . Let p t denote the smo othed data density of Y t = α t Y 0 + σ t Z under h ∗ , and let p h t b e the smoothed densit y induced b y a candidate transp ort h as ab o v e. W rite s t ( · ; h ) = ∇ log p h t ( · ) , s t ( · ; h ∗ ) = ∇ log p t ( · ) . The conditional score target used for training satisfies the unbiasedness identit y E  ∇ y t log p ( Y t | Y 0 )   Y t  = ∇ log p t ( Y t ) , 12 whic h yields the orthogonal decomp osition R ( h ) = R ( h ∗ ) + E Y ∼ p t   s t ( Y ; h ) − s t ( Y ; h ∗ )   2 2 . Consequen tly , the excess risk is a Fisher-divergence-t yp e score mismatc h measured under the smoothed data law: ρ 2 ( h ∗ , h ) = E Y ∼ p t   ∇ log p h t ( Y ) − ∇ log p t ( Y )   2 2 = J  p t ∥ p h t  , (13) No w sp ecialize to the learned transp ort ˆ h λ and denote ˜ Y 0 := ˆ h λ ( U ) and ˜ p t := p ˆ h λ t , so that ˜ Y t = α t ˜ Y 0 + σ t Z has density ˜ p t . Under the VP sc hedule σ 2 t = t and α t = √ 1 − t , it implies that the denoiser mismatch in (9) satisfies E MAG ( t ) = t 2 1 − t E Y ∼ p t   ∇ log p t ( Y ) − ∇ log ˜ p t ( Y )   2 2 = t 2 1 − t ρ 2  h ∗ , ˆ h λ  . Corollary 1 (T raining-to- W 2 pip eline) . Fix t ∈ (0 , 1) in the tub e regime of Theorem 1. Let ˆ h λ b e the learned transp ort and ˜ Y 0 = ˆ h λ ( U ) its one-shot generator. Then W 2  P Y 0 , P ˜ Y 0  ≤  Φ( t ) ¯ C LSI ( t ) + Ψ( t )  ρ  h ∗ , ˆ h λ  . Brac keting en trop y . Let P denote the joint la w of ( Y t , Y 0 ) under (1). F or a class of measurable functions F on the sample space and u > 0 , let N B ( u, F , L 2 ( P )) b e the u - brac k eting num b er in L 2 ( P ) and H B ( u, F ) := log N B ( u, F , L 2 ( P )) its brac keting entrop y (Shen and W ong, 1994). W e apply this with the exc ess-loss class F :=  ℓ ( · , · ; h ) − ℓ ( · , · ; h ∗ ) : h ∈ H  . In (15) below, the generic v ariable x ranges ov er the sample space of ( Y t , Y 0 ) . Theorem 2 b ounds the excess risk of the empirical minimizer ˆ h λ in terms of (i) appro xima- tion error, (ii) a brack eting-en tropy i n tegral, and (iii) the additional p erturbation in tro duced b y the finite-anc hor loss ℓ K . Theorem 2 (Score-matc hing excess-risk b ound) . Fix an y k such that 0 < c b 4 c v ≤ k < 1 , with c v = 40 α 2 t B 2 σ 4 t and c b = 16 α 2 t B 2 σ 4 t . Let ˆ h λ ∈ arg min h ∈H L n,K ( h ) b e the empirical minimizer defined ab o v e. Then, for an y ε > 0 satisfying the en trop y condition Z 4 c 1 / 2 v ε kε 2 / 16 H 1 / 2 B ( u, F ) d u ≤ c h n 1 / 2 ε 2 , (14) and the lo wer b ound ε 2 ≥ max n 4 inf h ∈H ρ 2 ( h ∗ , h ) , 8 sup h, x   ℓ K ( y t , y 0 ; h ) − ℓ ( y t , y 0 ; h )   o , (15) with c h = k 3 / 2 / 2 11 , w e hav e the deviation bound P  ρ ( h ∗ , ˆ h λ ) ≥ ε  ≤ 4 exp  − c e n ε 2  , c e := 1 − k 8(64 c v + 2 c b 3 ) . 13 3.4 Main result: finite-sample generation fidelit y Before stating the main result, w e in tro duce sev eral definitions. Neural netw ork class. Fix in tegers d in , d out ≥ 1 and depth L ≥ 2 . A feedforw ard ReLU net w ork h : R d in → R d out is defined b y x (0) = x , x ( ℓ ) = σ  A ℓ x ( ℓ − 1) + b ℓ  ( ℓ = 1 , . . . , L − 1) , h ( x ) = A L x ( L − 1) + b L , where σ ( z ) = max { z , 0 } is applied comp onen t wise, A ℓ ∈ R d ℓ × d ℓ − 1 , and b ℓ ∈ R d ℓ , with d 0 = d in and d L = d out . The width is max 0 ≤ ℓ ≤ L − 1 d ℓ . W e write NN( d in , d out , L, W, S, B , E ) for the class of such netw orks with maximum width at most W , at most S nonzero parameters, entrywise parameter b ound E , and output uniformly b ounded b y B on the latent domain U : NN( d in , d out , L, W, S, B , E ) := n h : max 0 ≤ ℓ ≤ L − 1 d ℓ ≤ W, L X ℓ =1  ∥ A ℓ ∥ 0 + ∥ b ℓ ∥ 0  ≤ S, max ℓ  ∥ A ℓ ∥ max , ∥ b ℓ ∥ max  ≤ E , sup u ∈U ∥ h ( u ) ∥ ∞ ≤ B o , (16) where ∥ M ∥ max := max i,j | M ij | denotes the max-entry norm and ∥ · ∥ 0 coun ts nonzeros. Assume that the finite-anchor approximation induces a uniform p erturbation of the loss: sup h ∈H , y t , y 0   ℓ K ( y t , y 0 ; h ) − ℓ ( y t , y 0 ; h )   ≤ ε ( ˜ π , t, K ) , for some deterministic function ε ( ˜ π , t, K ) (see Section 4 for explicit b ounds). With these definitions in place, we obtain the following generation-accuracy guarantee b y com bining the pull-bac k inequalit y (Theorem 1), the excess-risk b ound (Theorem 2), and standard appro ximation and en tropy estimates for ReLU net works. Theorem 3 ( MA GT’s generation fidelit y ) . Under Assumptions 1 – 3 and the estimator class setting H = NN( d, D , L, W, W 2 L, B , B ) with d ≥ d ∗ , there exist constants c 1 , c 2 , c 3 > 0 (dep ending only on ( m, M , H γ , Λ 2 , ρ M ) and the VP sc hedule, but not on n, W, L, K ) suc h that E W 2  P Y 0 , P ˜ Y 0  ≤ C PB ( t )  c 1 ( W L ) − 2( η +1) d ∗ + c 2 σ − ( η +2) t  ( W L ) 2 log 5 ( W L ) n  η +1 2 η + c 3 ε ( ˜ π , t, K )  , (17) where the exp ectation is ov er the training sample and an y Mon te Carlo randomness used to form the anc hor-based score estimator. Corollary 2 (Explicit n –rate) . Under the assumptions of Theorem 3, set κ := d ∗ 2(2 η + d ∗ ) , r := η + 1 2 η + d ∗ , and c ho ose ( W , L ) so that W L = l n log 5 n  κ m . Then, E W 2  P Y 0 , P ˜ Y 0  ≤ C PB ( t )   c 1 + c 2 σ − ( η +2) t  n log 5 n  − r + c 3 ε ( ˜ π , t, K )  . 14 where ε ( ˜ π , t, K ) → 0 as K → ∞ for eac h fixed t ∈ (0 , 1) , with explicit K -dependent b ounds in Section 4. The exp onen t  n log 5 n  − r =  n log 5 n  − η +1 2 η + d ∗ in Corollary 2 matc hes the intrinsic-dimension minimax scaling established for W asserstein-risk estimation of η -regular distributions on a d ∗ -dimensional manifold; see, e.g., T ang and Y ang (2024) (up to p olylogarithmic factors). Generation error b ounds. As summarized in T able 2, the results listed there that achiev e an intrinsic-dimension W asserstein rate scale as n − ( η +1) / (2 η + d ∗ ) . In particular, manifold- adaptiv e diffusion (T ang and Y ang, 2024) attains this exp onen t (for W 1 ) under a b oundaryless- manifold assumption, while MAGT attains the same intrinsic-dimension exponent for W 2 without requiring a no-b oundary condition. By con trast, existing guaran tees for ambien t- space diffusion and flo w-matc hing methods typically scale with the ambien t dimension D , leading to slow er rates when d ∗ ≪ D . T o the b est of our knowledge, comparable nonasymp- totic W asserstein-risk guaran tees for normalizing flo ws are not curren tly av ailable. A salien t difference is that the analysis of manifold-adaptiv e diffusion (T ang and Y ang, 2024) assumes the data manifold is without b oundary . This excludes many practical set- tings in whic h the supp ort has a boun dary (e.g., manifolds embedded in a bounded region), including the six syn thetic manifolds considered in Section 6.1. In con trast, our pull-bac k analysis do es not rely on a b oundaryless assumption. Empirically (Section 6.1), MA GT remains effectiv e in precisely these boundary-affected regimes. Neural netw ork architecture. Bey ond the rate comparison in T able 2, Corollary 2 sug- gests a flexible arc hitecture trade-off for MA GT: the rate is achiev ed b y c ho osing width and depth so that the pro duct W L scales as ⌈ ( n/ log 5 n ) κ ⌉ . This p ermits relativ ely deep arc hi- tectures provided the width is adjusted accordingly . In con trast, the constructions in Ok o et al. (2023); T ang and Y ang (2024) t ypically realize their rates b y letting the width (and sparsit y) gro w rapidly with n , yielding substantially larger net works that can b e less aligned with standard practical design choices. 4 Practical c hoices for Mon te Carlo appro ximation This section gives practical sc hemes for appro ximating the posterior mean that en ters the transp ort-based score estimator, together with nonasymptotic b ounds con trolling the finite- anc hor term ε ( ˜ π , t, K ) in Theorem 3. Relation to existing Mon te Carlo/QMC theory . Lemmas 1 – 3 b elow rely on standard to ols from self-normalized imp ortance sampling (SNIS) and quasi–Monte Carlo (QMC): nonasymptotic SNIS error b ounds/v ariance expansions (see, e.g., Owen (2013)) and the K oksma–Hla wk a inequalit y together with classical discrepancy estimates for lo w-discrepancy p oin t sets (see, e.g., Niederreiter (1992); Dic k and Pillichshammer (2010)). W e restate these b ounds mainly to trac k how the constants depend on the smo othing level σ t and the in trinsic dimension d , and to make the finite-anchor term ε ( ˜ π , t, K ) in Theorem 3 explicit. The pro ofs are giv en in App endix D. 15 T able 2: Theoretical guarantees for score-based diffusion and flo w mo dels (rates up to a p olylogarithmic factor of n and constants). Metho d Metric Rate Key assumptions Estimator / class (scaling in n ) Diffusion (Oko et al., 2023) TV , W 1 n − η / (2 η + D ) (TV), n − ( η +1 − δ ) / (2 η + D ) ( W 1 ) η -smo oth density in R D (Beso v-type); b oundary regularity ReLU score nets: L n = Θ(log n ) , W n , S n = e Θ  n D/ (2 η + D )  Manifold-adaptiv e diffusion (T ang and Y ang, 2024) W 1 n − ( η +1) / (2 η + d ∗ ) η -smo oth density on a d ∗ -manifold; no b oundary ReLU score nets: L n = Θ(log 4 n ) , W n , S n = e Θ  n d/ (2 η + d ∗ )  Lo wer-bound-free diffusion (Zhang et al., 2024) TV n − η / (2 η + D ) Sub-Gaussian data in R D ; (optionally) η -Sob olev with η ≤ 2 T runcated KDE plug-in score KDE-based flow matc hing (Kunkel and T rabs, 2025) W 1 n − ( η +1) / (2 η + D ) η -smo oth density (Beso v); compact supp ort in R D Lipsc hitz vector-field nets; sufficien tly expressive class MA GT (Theorem 3) W 2 n − ( η +1) / (2 η + d ∗ ) η -smo oth density on a d ∗ -manifold (b oundary allow ed) Anc hor-based score estimator; flexible ReLU arc hitecture Recall from (2)–(4) that the smo othed score at noise level t dep ends on the p osterior mean m t ( y t ) := E [ h ( U ) | y t ] . Any finite-anc hor appro ximation pro duces e m t,K ( y t ) and the induced score estimate e s t,K ( y t ) = σ − 2 t  α t e m t,K ( y t ) − y t  . Since e s t,K ( y t ) − ∇ y t log p ( y t ) = α t σ 2 t  e m t,K ( y t ) − m t ( y t )  , (18) it suffices to control the approximation error of the p osterior mean for each metho d below. 4.1 MA GT-MC MA GT-MC is the baseline v arian t in whic h we approximate the p osterior exp ectation in (2) using standard Mon te Carlo anchors dra wn i.i.d. from the base distribution. Equiv alen tly , w e choose the prop osal in (3) to b e ˜ π ( · | y t ) ≡ π ( · ) , draw u 1 , . . . , u K ∼ π , and compute e s t,K ( y t ; h, π , π ) via (4). In this case, the imp ortance ratio cancels and the weigh ts are prop or- tional to the Gaussian lik eliho o d terms ϕ ( y t ; α t h ( u j ) , σ 2 t I D ) , yielding a simple finite-mixture appro ximation. This approac h is em barrassingly parallel and requires no optimization at inference time. Ho w ev er, when the p osterior π t ( u | y t ) is m uch more concen trated than the base distribution of π (e.g., for small σ t ), most anc hors receiv e negligible w eight, and the effectiv e sample size can collapse. The QMC and MAP v ariants b elow are designed to mitigate this v ariance in complemen tary w a ys. Lemma 1 ( K -approximation error) . Consider the estimator with ˜ π = π , i.e., dra w i.i.d. anc hors U (1) , . . . , U ( K ) ∼ π (indep endent of Y t ) and form the transp ort-based score estimate 16 e s t,K ( Y t ; h, π , π ) via (4). Then for all sufficiently large K , there exists a constant C > 0 dep ending only on ( B , d, D ) and the geometric constan ts in Assumptions 1 – 2 suc h that, E   e s t,K ( y t ; h, π , π ) − s t ( y t ; h )   2 2 ≤ C α 2 t K σ d +4 t , (19) where s t ( · ; h ) := ∇ y log p t ( y ) is the ideal (infinite-anc hor) mixture score at lev el t . 4.2 MA GT-QMC MA GT-QMC replaces i.i.d. anc hors with quasi–Mon te Carlo p oin t sets that appro ximate the base distribution π more evenly . Concretely , we use a lo w-discrepancy sequence (optionally randomized via scram bling) in [0 , 1] d and map it to the target base (e.g., via the in verse CDF transform for factorized bases or other standard transp orts). W e then plug these anc hors in to (4) exactly as in MAGT-MC. Because e m t,K ( y t ) is a ratio of t w o p osterior exp ectations, reducing the in tegration er- ror of eac h term can significantly improv e score stabilit y . In many smo oth settings, QMC yields a faster empirical conv ergence rate than K − 1 / 2 and often pro vides a practical v ariance reduction mec hanism without c hanging the underlying model. Star discrepancy and Hardy–Krause v ariation. F or a p oint set P K = { z 1 , . . . , z K } ⊂ [0 , 1] d , its star discrepancy is D ∗ ( P K ) := sup x ∈ [0 , 1] d      1 K K X j =1 1 { z j ∈ [0 , x ) } − d Y i =1 x i      , (20) where x = ( x 1 , . . . , x d ) and [0 , x ) := Q d i =1 [0 , x i ) is an anchored axis-aligned b ox. W e write V HK ( g ) for the Hardy–Krause v ariation of an in tegrand g : [0 , 1] d → R ; for smooth g it can b e b ounded in terms of mixed partial deriv atives. See standard QMC references for the formal definition and the K oksma–Hla wk a inequalit y (Niederreiter, 1992; Dic k and Pillichshammer, 2010). Lemma 2 (MA GT-QMC score discrepancy bound) . Fix t ∈ (0 , 1) and an observ ation y t ∈ R D . Assume there exists a measurable map T : [0 , 1] d → U such that T ( Z ) ∼ π when Z ∼ Unif ([0 , 1] d ) . Let P K = { z 1 , . . . , z K } ⊂ [0 , 1] d b e a p oin t set with star discrepancy D ∗ ( P K ) , and define the QMC anc hors U ( j ) := T ( z j ) . F orm the QMC p osterior-mean and score estimators e m QMC t,K ( y t ) := P K j =1 ϕ  y t ; α t h ( U ( j ) ) , σ 2 t I D  h ( U ( j ) ) P K j =1 ϕ ( y t ; α t h ( U ( j ) ) , σ 2 t I D ) , e s QMC t,K ( y t ) := 1 σ 2 t  α t e m QMC t,K ( y t ) − y t  . Define the in tegrands on [0 , 1] d , f 0 ( z ) := ϕ  y t ; α t h ( T ( z )) , σ 2 t I D  , f 1 ,r ( z ) := h r ( T ( z )) ϕ  y t ; α t h ( T ( z )) , σ 2 t I D  , r = 1 , . . . , D , 17 and set V HK ( f 0 ) for the Hardy–Krause v ariation of f 0 and V HK ( f 1 ) := P D r =1 V HK ( f 1 ,r ) . Let I 0 := R [0 , 1] d f 0 ( z ) d z and I 1 := R [0 , 1] d f 1 ( z ) d z . If V HK ( f 0 ) D ∗ ( P K ) ≤ I 0 / 2 , then   e s QMC t,K ( y t ) − ∇ y t log p t ( y t )   2 ≤ 2 α t σ 2 t I 0  V HK ( f 1 ) + ∥ m t ( y t ) ∥ 2 V HK ( f 0 )  D ∗ ( P K ) . (21) In particular, for classical low-discrepancy constructions one has D ∗ ( P K ) = O  K − 1 (log K ) d  , yielding a O  K − 1 (log K ) d  deterministic in tegration rate whenever V HK ( f 0 ) and V HK ( f 1 ) are finite (Niederreiter, 1992; Dick and Pillichshammer, 2010). 4.3 MA GT-MAP MA GT-MAP approximates the exp ectation using a data-dep enden t prop osal prior con- structed from a MAP-Laplace-Gauss-Newton appro ximation. This yields a more concen- trated prop osal distribution around high-p osterior-mass regions, thereb y impro ving the effi- ciency of the exp ectation approximation. F or a fixed noise lev el t and observ ation y t ∈ R D , the latent p osterior induced b y the base densit y π and the map h is π t ( u | y t ) ∝ π ( u ) ϕ  y t ; α t h ( u ) , σ 2 t I D  . (22) W e compute a MAP estimate ˆ u ∈ arg max u π t ( u | y t ) , equiv alently ˆ u ∈ arg min u Φ( u ) for the negativ e log-p osterior Φ( u ) := 1 2 σ 2 t   y t − α t h ( u )   2 2 − log π ( u ) . A Laplace approximation of (22) yields a Gaussian prop osal of the form q ( u | y t ) = N ( ˆ u , ˜ Σ ) . In MAGT-MAP w e use this data-dep endent Gaussian as the prop osal ˜ π ( · | y t ) = q ( · | y t ) in the imp ortance w eigh ts (3), which concen trates anc hors near high p osterior mass and t ypically impro v es the effectiv e sample size when σ t is small. Gauss–Newton Laplace prop osal. Computing the exact Hessian ∇ 2 Φ( ˆ u ) can b e ex- p ensiv e b ecause it in v olves second deriv atives of h . Instead, we use a Gauss–Newton approx- imation based on the Jacobian J h ( ˆ u ) . Concretely , w e take ˆ Λ := I d + α 2 t σ 2 t J h ( ˆ u ) ⊤ J h ( ˆ u ) , ˜ Σ := ( ζ ˆ Λ + τ 2 I d ) − 1 , where I d is the d × d identit y matrix, ζ ≥ 0 is an optional inflation factor, and τ 2 ≥ 0 pro vides n umerical damping. Lemma 3 (MAGT-MAP self-normalized IS error) . Fix t ∈ (0 , 1) and y t ∈ R D . As- sume π t ( · | y t ) ≪ q ( · | y t ) and ∥ h ( u ) ∥ 2 ≤ B for all u . Let e m t,K ( y t ) denote the self- normalized importance-sampling estimator of m t ( y t ) = E π t ( ·| y t ) [ h ( U )] formed from i.i.d. 18 samples U (1) , . . . , U ( K ) ∼ q ( · | y t ) and weigh ts proportional to π t ( u | y t ) /q ( u | y t ) (equiv a- len tly , the unnormalized w eigh ts (3) with ˜ π = q ). Define the order-2 div ergence factor D 2  π t ( · | y t )   q ( · | y t )  := E q ( ·| y t ) "  π t ( U | y t ) q ( U | y t )  2 # = 1 + χ 2  π t ( · | y t )   q ( · | y t )  . Then, E h   e s t,K ( y t ; h, π , q ) − ∇ y t log p t ( y t )   2 2    y t i ≤ 32 α 2 t B 2 K σ 4 t D 2  π t ( · | y t )   q ( · | y t )  . Th us MA GT-MAP ac hieves the usual K − 1 / 2 ro ot-MSE rate, with a constant gov erned b y ho w w ell the Laplace prop osal matc hes the p osterior (through D 2 ) (Ow en, 2013). Standard Laplace-appro ximation b ounds quantify when the Gaussian prop osal q ( · | y t ) is close to the true p osterior: small lo cal curv ature mismatc h and accurate cov ariance appro ximation (together with p osterior concentration) yield small div ergence. Moreo ver, imp ortance-sampling efficiency dep ends on ho w well q ( · | y t ) matches the p osterior, with w eigh t disp ersion gov erned b y divergences such as D 2 ( π t ∥ q ) . Finally , ev en structured co v ari- ances (e.g., diagonal or lo w-rank) can work well in practice when they preserve the dominan t directions of ˆ Σ , whic h keeps the cov ariance-mismatc h con tribution small. 5 Implemen tation This section describes how we optimize the empirical ob jectiv e L n,K in (6), especially when the anchor budget K is large. Recall that K en ters the loss inside the score estimator e s t,K , through the self-normalized imp ortance weigh ts in (3). Consequen tly , increasing K impro v es the Mon te Carlo accuracy of eac h p er-sample score estimate, but it also c hanges the computational profile of sto chastic gradient descent (SGD) used in optimization. Concerning the SGD minibatc h computation { ( y ( b ) 0 , y ( b ) t ) } B b =1 at a fixed time t and a set of anchors { u k } K k =1 , define cen ter outputs ˜ y ( k ) 0 = h θ ( u k ) ∈ R D . In the common c hoice ˜ π ≡ π , the unnormalized w eights are prop ortional to Gaussian lik eliho o ds, so the (normalized) soft assignmen t for item b is w b,k := exp  z b,k  P K j =1 exp  z b,j  , z b,k := − ∥ y ( b ) t − α t ˜ y ( k ) 0 ∥ 2 2 2 σ 2 t ; b = 1 , . . . , B ; k = 1 , . . . , K. (23) The p osterior mean is m b = P K k =1 w b,k ˜ y ( k ) 0 , and the score estimator b ecomes e s t,K ( y ( b ) t ) = ( α t m b − y ( b ) t ) /σ 2 t . Th us ev aluating one SGD step for (6) requires (i) computing all logits z b,k ’s; b = 1 , . . . , B ; k = 1 , . . . , K , (ii) forming a K -term softmax and weigh ted sum p er batc h item, and (iii) differentiating through these op erations. If w e directly implement (6) with automatic differen tiation, the computation graph con- tains all B K in teractions. This has t wo practical issues. First, the computational cost of the forw ard pass p er step scales as O ( B K D ) (distance ev aluations and weigh ted sums). Second, naively backpropagating through the softmax and weigh ted sums requires storing 19 { z b,k , w b,k , ˜ y ( k ) 0 } (and in termediate activ ations through h θ ), which can scale like O ( B K ) plus the activ ations for K anc hor forw ard passes. When K ≫ B , this is often the b ottlenec k. In addition, the softmax in (23) can b e numerically unstable for large K as man y logits may lie far in the tail. In practice, we compute it with a log-sum-exp stabilization, i.e., subtract max k z b,k b efore exp onen tiating. F rom the optimization standp oint, K also controls the sto chasticit y of the gradien t: small K yields a noisier Mon te Carlo appro ximation to the p osterior mean and hence a higher- v ariance sto chastic gradien t, while larger K reduces this v ariance but increases p er-step cost. Our implemen tation therefore separates (a) the Monte Carlo noise induced by finite K from (b) the usual minibatch noise induced by finite B . Exact gradien ts with resp ect to anchor outputs plus c h unk ed bac kprop. T o k eep the estimator in (4) unchanged while making SGD practical for large K , w e use a tw o-stage gradien t computation. Stage 1 (No-gr ad forwar d; c ompute exact c enter gr adients). W e compute the center outputs ˜ y ( k ) 0 = h θ ( u k ) and the logits/w eights (23) without storing the full autograd graph, where h is parametrized b y h θ with θ indicating mo del parameters. Giv en the minibatc h loss, w e then compute the exact gradien t of the loss with resp ect to eac h cen ter output, g k := ∂ L/∂ ˜ y ( k ) 0 , using a closed-form expression obtained by differen tiating through m b = P k w b,k ˜ y ( k ) 0 and the softmax Jacobian. Stage 2 (Chunke d VJP thr ough h θ ). Once { g k } K k =1 are computed, the parameter gradient factors through the Jacobian: ∇ θ L = K X k =1 ∇ θ h θ ( u k ) ⊤ g k = ∇ θ K X k =1 ⟨ h θ ( u k ) , g k ⟩ , (24) where ∇ θ h θ ( u k ) ∈ R D ×| θ | denotes the Jacobian of h θ ( u k ) with resp ect to the parameter vector θ . Equation (24) means we can backpropagate through h θ b y treating g k as constan ts and pro cessing anchors in ch unks of size K c . P eak memory then scales with K c rather than K , while the computed gradient is exact for the c hosen anchors. Algorithm 1 summarizes the resulting up date. 20 Algorithm 1 Best single-level transport map ˆ h ˆ λ giv en ( K, d ) Require: T raining data D train , v alidation data D v al , laten t dimension d , anc hor count K , ev aluation metric E ( · , · ) , candidate times T = { t 1 , . . . , t L } , learning rate r Ensure: T rained transp ort ˆ h ˆ λ 1: for t ∈ T do 2: for e = 1 , . . . , max ep o ch do 3: for eac h minibatch { Y ( i ) } B i =1 ⊂ D train do 4: Sample noises { Z ( i ) } B i =1 ∼ N ( 0 , I ) ; set t b ← t 5: α b = α ( t b ) , σ b = σ ( t b ) ; Y ( b ) t ← α b Y ( b ) + σ b Z ( b ) 6: Parameter up date: θ ← Upda te ( θ , r ; { ( Y ( b ) t , t b ) } B b =1 , K , d, K c , M ) 7: end for 8: end for 9: Set ˆ h ( t,d,K ) ← h θ 10: Sample anc hors { U j } |D v al | j =1 ∼ π U ; Generate b D ( t ) ← { ˆ h ( t,d,K ) ( U j ) } |D v al | j =1 11: Compute d ( t ) ← E  b D ( t ) , D v al  12: end for 13: ˆ t ← arg min t ∈T d ( t ) 14: Set ˆ λ ← ( ˆ t, d, K ) 15: Set ˆ h ˆ λ ← ˆ h ( ˆ t,d,K ) 16: return ˆ h ˆ λ 17: function Upd a te ( θ , r ; { ( Y ( b ) t , t b ) } B b =1 , K , d, K c , M ) 18: Sample laten ts u k ∼ π for k = 1 , . . . , K 19: Compute cen ters without grad : ˜ Y ( k ) = h θ ( u k ) 20: Phase 1 (no-grad forw ard & exact cen ter gradien ts). 21: for b = 1 to B do 22: z b,k ← −∥ Y ( b ) t − α b ˜ Y ( k ) ∥ 2 / (2 σ 2 b ) , w b,k ← softmax k ( z b, · ) 23: m b ← P K k =1 w b,k ˜ Y ( k ) , sP b ← ( α b m b − Y ( b ) t ) /σ 2 b 24: end for 25: Define true score T b ; set g b ← sP b − T b , c b ← ( α b /σ 2 b ) g b , ∆ b,k ← ⟨ ˜ Y ( k ) − m b , c b ⟩ 26: F or each k , set g k ← P B b =1 h w b,k c b + w b,k ∆ b,k  α b σ 2 b Y ( b ) t − α 2 b σ 2 b ˜ Y ( k )  i 27: Phase 2 (c h unk ed VJP; g frozen). 28: for c hunks K ⊂ { 1 , . . . , K } of size ≤ K c do 29: S K ( θ ) ← P k ∈K ⟨ h θ ( u k ) , g k ⟩ (tr e at g as c onstant) 30: Bac kprop ∇ θ S K ( θ ) and accum ulate 31: end for 32: Gradien t step: θ ← θ − r P K ∇ θ S K ( θ ) 33: return θ 34: end function 21 6 Exp erimen ts This section ev aluates whether MAGT can reconcile three ob jectives that are often in tension in generativ e mo deling: (i) gener ation fidelity (matc hing the target distribution), (ii) sam- pling efficiency (fast inference-time generation), and (iii) manifold alignment (concen trating probabilit y mass near the in trinsic lo w-dimensional supp ort rather than leaking into the am- bien t space). W e b enc hmark MA GT against diffusion-, ODE/transp ort-, and adversarial- based baselines on b oth con trolled syn thetic manifolds (where ground-truth geometry is kno wn) and real datasets (images and tabular/high-dimensional sequences). 6.1 Syn thetic b enchmarks: lo w-dimensional manifolds W e first ev aluate on controlled syn thetic distributions where b oth the ground-truth distribu- tion and (for manifold datasets) the underlying manifold are kno wn. T able 3 rep orts six rep- resen tativ e b enc hmarks: four non-Gaussian distributions in R 2 ( rings2d , spir al2d , mo ons2d , che cker2d ) and t wo thin manifolds embedded in R 3 ( helix3d , torus3d ). F or the manifold datasets, w e sample laten t parameters from a uniform distribution, map them through a nonlinear embedding in to R 3 , and add small i.i.d. Gaussian jitter; full sim ulation details are pro vided in Appendix A. Data splits and ev aluation proto col. A cross all syn thetic b enchmarks, the training set D train = { y i 0 } n i =1 con tains n = 10 , 000 observ ations. W e use an independent v alidation set D v al of size 5 , 000 for h yp erparameter selection and a held-out test set for computing W 2 and the off-manifold rate. Unless otherwise stated, each metho d generates 10 , 000 samples for ev aluation. MA GT configuration (syn thetic). F or eac h dataset, w e set the latent dimension d equal to the kno wn intrinsic dimension rep orted in T able 3 and use the standard Gaussian base π U = N (0 , I d ) . The transp ort h θ : R d → R D is a 5-hidden-la y er MLP of width 512 (ReLU). W e train MA GT with the MA GT-MC score estimator (proposal ˜ π = π U ) and tune the smo othing level t b y v alidation: w e searc h t ∈ { 0 . 1 , 0 . 2 , . . . , 0 . 9 } and select the v alue minimizing the fixed- t score-matc hing loss on D v al . Unless otherwise noted, we rep ort results at anc hor budget K = 1024 ; Figure 2 further studies sensitivit y to K and t . Diffusion baselines (syn thetic). W e train a time-conditioned score netw ork with the same MLP backbone and a 32-dimensional time embedding, using denoising score matc hing o v er t ∈ [ t low , t high ] . Sampling uses: (i) DDIM (Song, Meng and Ermon, 2021) with 1000 steps and η = 1 . 0 on t ∈ [0 . 05 , 0 . 90] (NFE = 1000 ); and (ii) DPM-Solver++ (Lu et al., 2022) with either 20 (first-order) or 40 (second-order midp oint) function ev aluations on the same in terv al. Flo w matc hing baseline (synthetic). W e train a time-conditioned v elo cit y field with the same MLP bac kb one and 32-dimensional time em b edding. Sampling solves the learned ODE with a midpoint integrator and step size 0 . 05 , yielding 20 v elo city ev aluations (NFE = 20 ) o v er t ∈ [0 . 05 , 0 . 90] . 22 W GAN-GP baseline (syn thetic). W e train W GAN-GP (Arjo vsky et al., 2017; Gulra- jani et al., 2017) with b oth generator and critic implemen ted as 5-hidden-la y er MLPs of width 512 (ReLU), matc hing the MA GT bac kb one capacit y . The generator tak es a d -dimensional laten t input (matc hing the intrinsic dimension for this controlled setting), and we use five critic up dates p er generator up date, gradient-penalty coefficient 10, and Adam with learning rate 10 − 4 and ( β 1 , β 2 ) = (0 . 5 , 0 . 999) . Iterativ e refinemen t using MA GT’s score estimator. W e additionally ev aluate MAGT– DDIM (M-DDIM), whic h uses the same learned transp ort h θ and anchor-based score esti- mator e s t,K as MAGT, but p erforms iterativ e DDIM-style refinemen t from t high = 0 . 90 to t low = 0 . 05 for 205 steps (NFE = 205 ; η = 1 . 0 ). T o a void repeatedly recomputing anc hor outputs, we cac he the anc hor bank { h θ ( u k ) } K k =1 with u k ∼ N (0 , I d ) once and reuse it at ev- ery refinement step. This baseline isolates whether MAGT’s empirical gains come from (a) the single-level score ob jectiv e and anchor p osterior estimator, or (b) the one-shot amortized transp ort. Fidelit y , manifold alignmen t, and sp eed. T able 3 sho ws that MA GT outp erforms all diffusion samplers on W 2 across all six manifolds, and improv es o ver flow matc hing on fiv e of six (with che cker2d essentially tied in W 2 ), while using only a single transp ort ev aluation at sampling time (NFE = 1 ). MAGT also exhibits the strongest manifold alignment : on the thin manifolds helix3d and torus3d it attains off-manifold rates of 0 . 0938 and 0 . 0341 , compared with 0 . 1673 / 0 . 5557 for flo w matching and 0 . 333 – 0 . 386 / 0 . 634 – 0 . 670 for diffusion samplers at practical step budgets. Finally , MA GT is the fastest high-fidelity method in this suite, generating 10 , 000 samples in ≈ 10 − 3 seconds, while d iffusion and ODE-based samplers require tens to thousands of net work ev aluations. In terpretation. T wo aspects of the exp erimen tal configuration are imp ortant for in terpret- ing T able 3. First, only MAGT and the GAN baseline are dimension-mismatched one-shot generators: MAGT learns h θ : R d → R D with d ≪ D (matched to the in trinsic dimension), so samples lie on the d -dimensional image of h θ b y construction. By contra st, diffusion and ODE baselines mo del am bien t-space dynamics and m ust learn to contract probability mass in directions normal to the manifold, which can leav e residual off-manifold scatter under finite capacit y and finite solv er steps. Second, MA GT is trained at a single smo othing lev el t (selected b y v alidation), so mo deling capacit y is fo cused on matching the fixed- t score that app ears in the pull-back b ound (Theorem 1); iterativ e samplers rep eatedly apply ap- pro ximate scores/v elo cities across man y steps, so approximation and discretization errors can accum ulate. This also helps explain wh y M-DDIM need not outp erform the one-shot MA GT sampler: it rep eatedly queries an appro ximate score estimator (finite K ) ov er 205 refinemen t steps. Figure 1 shows synthetic samples generated b y MA GT across six to y tasks. In all cases, MA GT concentrates probabilit y mass tigh tly along the underlying data manifold, with only a few outliers near the b oundary . Relativ e to diffusion (DDIM), MA GT pro duces visibly cleaner manifold support and ac hiev es uniformly lo wer W 2 in T able 3 while incurring sub- stan tially low er sampling cost; relativ e to flo w matc hing, MAGT is typically b etter and otherwise similar in fidelity , again at m uch low er sampling cost. 23 T able 3: Empirical W 2 ( ↓ ) , off-manifold rate (fraction of samples with distance > 0 . 1 ; ( ↓ ) ), and w all-clo ck sampling time (seconds) to generate 10 , 000 samples (identical batc hing and hardw are across metho ds). Paren theses rep ort standard deviations across runs for W 2 and the off-manifold rate. NFE denotes the num b er of score/velocity netw ork ev aluations. Boldface indicates the b est-p erforming method for eac h metric. rings2d ( d = 1 ) spiral2d ( d = 1 ) moons2d ( d = 2 ) W 2 ( ↓ ) % out ( ↓ ) W 2 ( ↓ ) % out ( ↓ ) W 2 ( ↓ ) % out ( ↓ ) MAGT NFE= 1 0.0592 (0.0070) 0.1406 (0.0255) 0.0767 (0.0185) 0.4150 (0.1501) 0.0331 (0.0036) 0.0360 (0.0103) Time 0.001 0.001 0.001 M-DDIM NFE= 205 0.0613 (0.0100) 0.2014 (0.0211) 0.0836 (0.0054) 0.3740 (0.0927) 0.0733 (0.0219) 0.2199 (0.0709) Time 0.615 0.613 0.668 DPM++ (1s) NFE= 20 0.1052 (0.0228) 0.6905 (0.0283) 0.1992 (0.0300) 0.7584 (0.0261) 0.1199 (0.0083) 0.1672 (0.0293) Time 0.153 0.132 0.164 DPM++ (2m) NFE= 40 0.1057 (0.0239) 0.6922 (0.0188) 0.1976 (0.0290) 0.7644 (0.0259) 0.1222 (0.0115) 0.1614 (0.0200) Time 0.300 0.256 0.327 DDIM NFE= 1000 0.1107 (0.0172) 0.7057 (0.0188) 0.1494 (0.0292) 0.7767 (0.0254) 0.0920 (0.0311) 0.1843 (0.0146) Time 3.190 2.720 3.392 WGAN NFE= 1 0.2904 (0.1112) 0.7316 (0.0622) 0.3603 (0.0472) 0.8110 (0.0769) 0.4562 (0.1310) 0.9292 (0.0829) Time 0.001 0.001 0.001 FM NFE= 20 0.0712 (0.0026) 0.5838 (0.0142) 0.0954 (0.0030) 0.6422 (0.0162) 0.0470 (0.0191) 0.0556 (0.0168) Time 0.091 0.090 0.091 chec ker2d ( d = 2 ) helix3d ( d = 1 ) torus3d ( d = 2 ) W 2 ( ↓ ) % out ( ↓ ) W 2 ( ↓ ) % out ( ↓ ) W 2 ( ↓ ) % out ( ↓ ) MAGT NFE= 1 0.0608 (0.0040) 0.0064 (0.0029) 0.0425 (0.0022) 0.0938 (0.0501) 0.0717 (0.0049) 0.0341 (0.0182) Time 0.001 0.001 0.001 M-DDIM NFE= 205 0.0640 (0.0063) 0.0088 (0.0040) 0.0536 (0.0086) 0.2812 (0.0740) 0.0922 (0.0123) 0.1208 (0.0223) Time 0.615 0.664 0.616 DPM++ (1s) NFE= 20 0.0944 (0.0213) 0.0228 (0.0205) 0.1115 (0.0162) 0.3338 (0.0230) 0.1650 (0.0223) 0.6453 (0.0656) Time 0.141 0.183 0.137 DPM++ (2m) NFE= 40 0.0914 (0.0216) 0.0257 (0.0195) 0.1056 (0.0128) 0.3356 (0.0275) 0.1640 (0.0221) 0.6344 (0.0588) Time 0.261 0.362 0.276 DDIM NFE= 1000 0.1108 (0.0276) 0.0860 (0.0407) 0.0883 (0.0162) 0.3861 (0.0300) 0.1603 (0.0269) 0.6699 (0.0658) Time 2.760 3.709 2.934 WGAN NFE= 1 0.2470 (0.0646) 0.3107 (0.3267) 0.2352 (0.1335) 0.8212 (0.1933) 0.4036 (0.0285) 0.7916 (0.0337) Time 0.001 0.001 0.001 FM NFE= 20 0.0592 (0.0015) 0.0147 (0.0047) 0.0533 (0.0072) 0.1673 (0.0294) 0.1130 (0.0074) 0.5557 (0.0268) Time 0.090 0.099 0.092 24 Figure 1: Qualitative comparison of generativ e mo dels on six syn thetic manifolds. Each row corresp onds to one toy dataset (rings2d, spiral2d, mo ons2d, c heck er2d, helix3d, torus3d). Columns sho w, from left to right, ground-truth samples, MA GT one-shot transp ort samples, diffusion-mo del samples generated with DDIM, and flo w-matc hing samples. 25 Figure 2: Effect of sample size n , anc hor coun t K , and smo othing lev el t on MAGT and MA GT–DDIM across six syn thetic b enchmarks. Curves rep ort W asserstein distance ( W 2 ; lo w er is b etter). Consistent with the bias–v ariance trade-off in Section 3, increasing n and K impro v es fidelit y , while in termediate noise lev els provide the most stable p erformance. 26 Sensitivit y to sample size, anc hor count, and smo othing level for MA GT. Fig- ure 2 examines the effect of the training sample size n , the num b er of mixture anc hors K , and the smo othing level t . Consistent with the bias–v ariance trade-off described in Section 3, increasing either n or K reduces the W 2 error. Performance is most stable at in termediate noise levels: small t amplifies v ariance in the posterior score estimator, while large t ov er- smo oths the underlying manifold geometry . T ogether, these results supp ort fixed- t training as an effectiv e bias–v ariance compromise, eliminating the need to in tegrate a full diffusion tra jectory . 6.2 Real-data b enc hmarks Datasets. W e ev aluate the prop osed metho d on image and tabular/high-dimensional b ench- marks. MNIST (LeCun et al., 1998) contains 28 × 28 grayscale handwritten digits (60,000 training, 10,000 test). CIF AR10-0 is the single-class subset of CIF AR-10 (Krizhevsky and Hin ton, 2009) con taining only class 0 (airplanes), with 5,000 training and 1,000 test im- ages. The single-class setting isolates a single semantic mo de and yields a more concen trated distribution, pro viding a stress test for manifold-aligned generators. F or tabular data, we use Superconduct (Hamidieh, 2018), which con tains 21 , 263 sam- ples with D = 81 numeric features. W e mo del the standardized feature vectors and ev aluate distributional discrepancy b et w een generated samples and a held-out test split (20% of the data). W e also consider Genomes from the 1000 Genomes Pro ject (The 1000 Genomes Pro ject Consortium, 2015). F ollowing prior studies (Y elmen et al., 2023; Ahrono viz and Gronau, 2024), w e fo cus on D = 10 , 000 biallelic SNPs on c hromosome 6 (a 3 Mbp region including HLA genes), enco ded as binary sequences. W e use 4,004 genomes for training and 1,002 for testing (stratified b y con tinental group). Mo del arc hitectures. W e matc h capacity within eac h mo dalit y as closely as practical, sub ject to standard arc hitectures for eac h baseline. (1) Images (MNIST, CIF AR10-0): MAGT uses a 5-lay er con volutional generator h θ , and the GAN baseline adopts the same generator arc hitecture. Diffusion and flo w-matc hing baselines use a 2D U-Net (v on Platen et al., 2022) to parameterize the score (diffusion) and velocity field (flow matc hing), whic h is substan tially more in tricate than our generator net w ork. (2) Sup erconduct: All metho ds use a 5-hidden-lay er MLP of width 512 (ReLU). Dif- fusion and flow matching additionally tak e a 32-d time embedding as input. The GAN generator tak es a 64-dimensional laten t v ector. (3) Genomes: Diffusion and MA GT use closely matched 1D U-Net bac kb ones, following the architecture used in genomic diffusion (Kenneweg et al., 2025). In MA GT, w e addition- ally include a linear pro jection that maps the low-dimensional latent input in to the channel dimension exp ected b y the U-Net. Baselines and sampler settings. W e compare against DDIM (Song, Meng and Ermon, 2021) (diffusion) and flow matching ( FM ) (Lipman et al., 2022; Liu et al., 2022) as strong 27 T able 4: MAGT configuration on real-data b enchmarks. W e rep ort the base laten t distribu- tion π U , the MA GT p osterior-estimation v ariant, the anc hor budget K in the score estimator, the candidate smo othing lev els t , the latent dimension d , and the v alidation metric used for mo del selection. Dataset V ariant Base π K t candidates d candidates Selection metric MNIST MC N (0 , I 64 ) × Bern(0 . 5) 16 4096 { 0 . 1 , 0 . 2 , . . . , 0 . 9 } 80 FID CIF AR10-0 MAP N (0 , I 128 ) 16384 { 0 . 01 , 0 . 1 } 128 FID Superconduct MC N (0 , I d ) 49152 { 0 . 1 , 0 . 2 , . . . , 0 . 9 } { 16 , 32 , 64 } W 2 Genomes MC N (0 , I 128 ) 2048 { 0 . 1 , 0 . 2 , . . . , 0 . 9 } 128 W 2 iterativ e baselines, as well as W GAN-GP as a one-shot baseline. W e run DDIM with 200 denoising steps on the in terv al t ∈ [0 . 05 , 0 . 90] with sto c hasticity parameter η = 1 . 0 . F or FM, w e integrate the learned ODE with a midp oint solv er using step size 0 . 05 (20 steps, NFE = 20 ) on the same time in terv al. F or GANs, sampling is one generator forward pass (NFE = 1 ); we use a latent dimension that matches MA GT whenev er applicable (MNIST: d = 80 , CIF AR10-0: d = 128 , Genomes: d = 128 ) and use a 64-dimensional latent for Sup erconduct, matc hing the strongest-performing MAGT setting ( d = 64 ). MA GT configuration. T able 4 rep orts the exact MAGT configurations used in our real- data exp erimen ts: the base laten t distribution π , laten t dimension d , anc hor budget K in the fixed- t score estimator used during training, the candidate smo othing levels t , the p osterior- estimation v ariant (MC and MAP), and the v alidation metric used for mo del selection. Unless otherwise noted, sampling alw a ys dra ws U ∼ π and outputs a sample h θ ( U ) in one forw ard pass. F or CIF AR10-0 w e adopt MAGT-MAP to stabilize training: at small t the laten t p osterior b ecomes sharply concentrated, and the MAP–Laplace proposal yields a substan tially higher effective sample size than prior sampling at the same anc hor budget. W e p erform mo del selection using the v alidation metric rep orted in the last column of T able 4. The selected configurations are t = 0 . 3 for MNIST, t = 0 . 1 for CIF AR10-0, ( t, d ) = (0 . 8 , 64) for Sup erconduct, and t = 0 . 3 for Genomes. Quan titative results and in terpretation. As summarized in T able 5, across all four real-data b enchmarks, MAGT consistently surpasses b oth the diffusion baseline (DDIM) and the GAN baseline in sample fidelit y . It delivers the b est o verall p erformance on MNIST, Sup erconduct, and Genomes, and ranks second on CIF AR10-0, narrowly trailing flo w matc h- ing (FM). Imp ortan tly , MAGT requires only a single forw ard pass through the generator at sampling time (NFE = 1 ). This one-shot generation design yields orders-of-magnitude reduc- tions in w all-clo ck latency relative to iterative diffusion and ODE-based approaches, while main taining equal or sup erior fidelit y . On MNIST, MA GT sligh tly outp erforms DDIM in FID (109.22 vs. 109.53) while reducing sampling time from 320.63s to 0.25s under our ev aluation protocol. Relativ e to flow matc hing (FM), MAGT improv es FID (109.22 vs. 115.00) while reducing sampling time from 30.18s to 0.25s. On CIF AR10-0, FM attains the b est FID (61.91), while MA GT still substan tially outp erforms DDIM (89.09) and GAN (159.79) in fidelity at dramatically lo wer cost (0.09s vs. 121.82s). The GAN baseline is extremely fast in this single-class setting (0.002s) but has 28 T able 5: Fidelity (FID for image datasets; W 2 for tabular datasets) and wall-clock sampling time (in seconds) required to generate the ev aluation batch under identical hardw are and batc hing conditions. Lo w er v alues indicate b etter p erformance for b oth metrics. Boldface highligh ts the b est fidelit y result within eac h dataset. A dash ("–") denotes that no imple- men tation is a v ailable for the corresp onding example. Dataset Quantit y MAGT DDIM FM GAN MNIST FID ↓ 109.22 109.53 115.00 109.23 Time (s) ↓ 0.25 320.63 30.18 0.19 CIF AR10-0 FID ↓ 65.64 89.09 61.91 159.79 Time (s) ↓ 0.09 22.97 121.82 0.002 Sup erconduct W 2 ↓ 0.0998 0.3982 0.1263 0.1443 Time (s) ↓ 0.0020 0.0974 0.0750 0.0013 Genomes W 2 ↓ 0.1688 0.2993 – 0.4911 Time (s) ↓ 1.63 980.34 – 0.11 substan tially w orse p erceptual fidelit y (FID 159.79). On Sup erconduct, MA GT achiev es the low est distributional error in W 2 (0.0998), im- pro ving ov er FM (0.1263), GAN (0.1443), and the DDIM baseline (0.3982); this corresp onds to a 74 . 9% reduction in W 2 relativ e to DDIM (and 21 . 0% relative to FM), a substan tial gain in tabular fidelit y . Sampling is essentially instan taneous (0.0020s). On Genomes, MAGT again ac hiev es the low est W 2 (0.1688), with substantial margins o ver DDIM (0.2993) and GAN (0.4911); this is a 43 . 6% reduction in W 2 relativ e to DDIM (and 65 . 6% relative to GAN). These results indicate that the manifold-aligned transp ort remains effective even in v ery high am bient dimensions. A dditional represen tativ e samples generated by MA GT are pro vided in Appendix A. Empirical p erformance and underlying mec hanisms. MAGT is not merely comp et- itiv e: in our ev aluations it outp erforms diffusion baselines and GANs across all b enc hmarks, and it ac hieves esp ecially strong reductions in off-manifold leak age on thin manifolds (T a- ble 3). These empirical gains are driv en b y t wo structural features of the exp erimental configuration. First, MA GT emplo ys a dimension-aligned transp ort map h θ : R d → R D , so ev ery generated sample lies in the d -dimensional image of h θ b y construction. This built-in supp ort constrain t directly targets the leak age captured b y the off-manifold rate. By contrast, flow and diffusion-based metho ds must map R D to R D . In the manifold regime, these approaches m ust learn to concen trate probabilit y mass in to a lo wer-dimensional set by con tracting in directions normal to the data manifold—a difficult task under finite capacit y and finite solver steps that can leav e residual off-manifold scatter. Second, MAGT concentrates learning at a single smo othing lev el t and matc hes the smo othed score at that level. The pull-bac k inequality (Theorem 1) b ounds W asserstein generation error by a fixed-level score discrepancy , with a geometry- and t -dep endent pref- actor. This p ersp ectiv e explains the empirical bias–v ariance trade-off in t (Figure 2): v ery 29 small t pro duces highly anisotropic scores and high estimator v ariance, whereas very large t o v ersmo oths geometric structure. Selecting an in termediate t via v alidation therefore fo cuses mo deling capacity precisely where the bound is ev aluated, a voiding the accum ulation of ap- pro ximation and discretization error across many time steps. Moreov er, the anc hor-based p osterior estimator explicitly a verages ov er laten t pre-images consistent with a noisy observ a- tion, stabilizing score estimation in normal directions and improving supp ort concentration relativ e to iterativ e diffusion/ODE baselines at comparable fidelity . 7 Discussion MA GT shows that a single-level (fixed- t ) training ob jective can recov er a high-fidelit y gen- erator in the manifold regime while retaining one-shot sampling. Crucially , b y iden tifying a manifold-induced transp ort from a low-dimensional latent to the ambien t space, MA GT can matc h and often surpass diffusion-baseline fidelity in practice (e.g., T ables 3 and 5) without learning or simulating a full reverse-time diffusion pro cess: generation do es not require in te- grating a long tra jectory and therefore a v oids the discretization error and stepwise sto chastic sampling error that can accumulate in diffusion samplers. Conceptually , the metho d decouples supp ort fidelity from the need to sim ulate an en- tire reverse-time tra jectory . By learning a transport map whose induced Gaussian smoothing admits an explicit p osterior iden tity for the score, MA GT turns fixed- t denoising in to a prac- tical generative mec hanism. In this view, the complexit y of diffusion-st yle time-dep endent mo dels is replaced b y a single, manifold-aligned transp ort together with a one-shot score ev aluation; any remaining appro ximation stems primarily from the finite-anchor posterior appro ximation rather than from time discretization. The smo othing lev el t pla ys a dual role. Larger t impro ves numerical stability of score esti- mation (the p osterior ov er U is less concen trated) but increases smo othing bias and can blur fine-scale structure. Smaller t reduces bias but mak es the laten t p osterior sharply p eak ed, whic h stresses finite-anchor appro ximations. Our experiments suggest that an in termediate t often pro vides the b est bias–v ariance trade-off; dev eloping principled, data-adaptive rules for selecting t (or using a small set of carefully chosen noise lev els) is a promising direction. The anc hor approximation is the main computational lever in MA GT. Prior sampling is simple and parallelizable, QMC reduces v ariance for mo derate intrinsic dimension, and MAP/Laplace prop osals improv e effectiv e sample size when the p osterior is highly concen- trated. A practical limitation is that very small t ma y require large K to main tain stable w eigh ts; further w ork could explore learned prop osals, amortized MAP initializations, or hierarc hical anchor banks. Finally , MAGT induces an in trinsic densit y on the learned im- age manifold (directly computable for generated samples) and, at the fixed smo othing lev el t > 0 , an ambien t densit y that can b e appro ximated using the same anc hor bank. Dev el- oping robust pro cedures for ev aluating these quantities on arbitrary observ ed p oints (whic h ma y require approximate in version) and for leveraging them for calibrated out-of-distribution scoring remains an op en and practically imp ortan t problem. More broadly , MAGT suggests that explicitly dimension-mismatc hed, non-inv ertible trans- p orts can pro vide a ligh tw eight alternative to diffusion-style mo deling when data concen trate near thin supp orts. Rather than fitting a time-dependent score field and sampling it via a dis- 30 cretized reverse-time pro cess, one can learn the manifold-induced transp ort and use a single- lev el p osterior iden tit y for generation. Poten tial extensions include conditional generation, higher-resolution image b enc hmarks, and h ybrid samplers that use MAGT for initialization follo w ed b y a short refinement c hain when maxim um fidelit y is required. 31 App endix A Exp erimen t details and more results A.1 T o y Datasets Rings2d. W e generate a t wo-dimensional mixture of concen tric rings. First sample a discrete radius index k ∼ Unif { 1 , . . . , K } and set the ring radius r k (e.g., equally spaced radii in a fixed interv al). Then draw an angle θ ∼ Unif [0 , 2 π ] and form the noiseless p oint x 1 = r k cos θ , x 2 = r k sin θ . Finally , we add Gaussian jitter ε ∼ N (0 , σ 2 I 2 ) to obtain x = ( x 1 , x 2 ) ⊤ + ε . The noise lev el σ is set b y the jitter parameter (default σ = 0 . 02 ). Spiral2d. W e sample a laten t parameter t ∼ Unif [ t min , t max ] and construct a planar spiral in p olar form. Let the radius gro w with t , e.g., r ( t ) = a + bt for constan ts a, b > 0 , and set the angle to b e θ ( t ) = t . The noiseless p oin t is x 1 = r ( t ) cos t, x 2 = r ( t ) sin t. W e then add Gaussian jitter ε ∼ N (0 , σ 2 I 2 ) to obtain x = ( x 1 , x 2 ) ⊤ + ε , with σ con trolled b y jitter . Mo ons2d. W e generate the standard t w o-mo ons dataset consisting of tw o interlea ving semicircles. Sample a lab el c ∈ { 0 , 1 } uniformly and draw an angle θ ∼ Unif [0 , π ] . F or the first moon ( c = 0 ), set x 1 = cos θ , x 2 = sin θ . F or the second mo on ( c = 1 ), w e apply a shift to create the in terlea ving structure: x 1 = 1 − cos θ , x 2 = 1 − sin θ − δ, where δ > 0 con trols the vertical separation (fixed throughout the exp eriments). As in other settings, we add Gaussian jitter ε ∼ N (0 , σ 2 I 2 ) to obtain x = ( x 1 , x 2 ) ⊤ + ε , with σ set by jitter . Chec ker2d. W e generate a tw o-dimensional chec kerboard distribution supp orted on alter- nating squares of a regular grid. Let m ∈ N denote the num b er of cells p er axis and partition [ − 1 , 1] 2 in to an m × m grid with cell width w = 2 /m . Sample in teger indices ( i, j ) uniformly from { 0 , . . . , m − 1 } 2 sub ject to the parit y constrain t ( i + j ) mo d 2 = 0 (i.e., only the “black” squares). Conditional on ( i, j ) , dra w a p oin t uniformly within the selected cell: x 1 ∼ Unif  − 1 + iw , − 1 + ( i + 1) w  , x 2 ∼ Unif  − 1 + j w , − 1 + ( j + 1) w  . Finally , w e add Gaussian jitter ε ∼ N (0 , σ 2 I 2 ) to obtain x = ( x 1 , x 2 ) ⊤ + ε , where σ is set by the jitter parameter (default σ = 0 . 02 ). 32 Helix3d. W e sample t ∼ Unif [0 , 2 π ] and define a three-dimensional helix: x 1 = cos( t ) , x 2 = sin( t ) , x 3 = t. Noise ε ∼ N (0 , σ 2 I 3 ) is added to obtain x = ( x 1 , x 2 , x 3 ) ⊤ + ε . T orus3d. W e sample tw o indep enden t latent parameters ( u, v ) ∼ Unif [0 , 2 π ] 2 . Given a ma jor radius R and minor radius r , the torus em b edding in R 3 is x 1 = ( R + r cos v ) cos u, x 2 = ( R + r cos v ) sin u, x 3 = r sin v . Again, Gaussian jitter ε ∼ N (0 , σ 2 I 3 ) is added. W e use R = 2 and r = 1 . A.2 Examples of generated samples In this subsection, we present additional qualitative results to supplemen t the main quan- titativ e ev aluations. W e include representativ e sample grids for MNIST and CIF AR10-0 (airplanes) to visually compare the p erceptual qualit y and div ersity of unconditional gener- ations pro duced b y MAGT, diffusion sampling (DDIM), and flo w matc hing. F or the genomic exp eriment, w e pro vide class-wise tw o-dimensional PCA visualizations of real test genomes and syn thetic genomes generated b y MA GT and a diffusion baseline. PCA is fit separately within eac h class and applied to both real and generated samples from that class, enabling an interpretable comparison of class-conditional structure in the high-dimensional SNP space. Figure 3: Unconditional generation on MNIST, comparing samples from MAGT, DDIM, and flo w matching (FM), alongside held-out real test images (left to righ t). B Pro ofs in Section 3 This appendix gathers the pro ofs and auxiliary technical lemmas underlying the results presen ted in Section 3 of the main text. Pr o of of The or em 1. W e w ork with the VP probability flo w with constan t sc hedule β ( s ) ≡ 1 . In the standard time parameter s ≥ 0 , the closed-form co efficients are α ( s ) = exp( − s/ 2) and σ 2 ( s ) = 1 − exp( − s ) . P arameterizing b y the noise level t := σ 2 ( s ) ∈ (0 , 1) giv es s = Γ( t ) := − log (1 − t ) , α t = √ 1 − t, σ 2 t = t. (25) 33 (a) MAGT (b) Diffusion(DDIM) (c) Flow matching Figure 4: Unconditional generation results on CIF AR10-0 (airplanes), comparing MAGT (left) and flo w matc hing (FM) (righ t). Figure 5: Class-wise PCA pro jections for five classes, comparing real genomic data with samples generated b y MA GT and diffusion-based models. 34 Set T := Γ( t ) . Let p s and ˜ p s denote the VP-smo othed densities at time s of the data and generator, resp ectiv ely , so that p T = p t and ˜ p T = ˜ p t . Apply Lemma 14 on [0 , T ] with the iden tification p s ← ˜ p s and q s ← p s . By symmetry of W 2 , W 2 ( P Y 0 , P ˜ Y 0 ) = W 2 ( p 0 , ˜ p 0 ) ≤ exp  Z T 0 L ( s ) ds  W 2 ( ˜ p t , p t ) + exp  Z T 0 L ( s ) ds  Z T 0 exp  Z T u ( K − L )  d u p J T , where (with this choice of roles) J T := Z R D   ∇ log ˜ p t ( y ) − ∇ log p t ( y )   2 2 p t ( y ) d y = J ( p t ∥ ˜ p t ) . With t ≤ t max = c 2 tube ρ 2 M , we hav e σ 2 s ≤ t for all s ∈ [0 , T ] , hence σ s ≤ c tube ρ M , so tub e pro jections are well-defined along the flow. Lemma 15 applies to b oth p s and ˜ p s and yields ∥∇ 2 log p s ∥ op ∨ ∥∇ 2 log ˜ p s ∥ op ≤ L ⋆ ( s ) for all s ≤ T , with L ⋆ ( s ) := C ( γ ) T σ γ − 2 s α γ s + ( C ( γ ) S ) 2 1 − θ t σ 2 γ − 2 s α 2 γ s , θ t = C ( γ ) N t γ (1 − t ) γ . Moreo v er, for the VP flow v s ( x ) = − 1 2 x −∇ log ρ s ( x ) , w e ha ve ∥∇ v s ( x ) ∥ op ≤ 1 2 + ∥∇ 2 log ρ s ( x ) ∥ op , hence Assumption 4(A1) holds with L ( s ) := 1 2 + L ⋆ ( s ) . Finally , by the score-gap gro wth lemma (Lemma 16), Assumption 4(A2) holds with K ( s ) := 1 2 + 4 L ⋆ ( s ) . In particular, with this choice, K ( s ) − L ( s ) = 3 L ⋆ ( s ) . By Lemma 11 with reference measure ˜ p t , W 2 ( ˜ p t , p t ) = W 2 ( p t , ˜ p t ) ≤ 1 C LSI ( ˜ p t ) p J ( p t ∥ ˜ p t ) = ¯ C LSI ( t ) p J T . Moreo v er, b y Lemma 10 applied to ˜ p t , ¯ C LSI ( t ) = 1 C LSI ( ˜ p t ) ≤ α 2 t M 2 + σ 2 t min { C LSI ( π ) , 1 } = (1 − t ) M 2 + t min { C LSI ( π ) , 1 } . Change v ariables r = σ 2 s = 1 − exp( − s ) so that dr = (1 − r ) ds and α 2 s = 1 − r . Then σ γ − 2 s α γ s = r γ / 2 − 1 (1 − r ) γ / 2 , σ 2 γ − 2 s α 2 γ s = r γ − 1 (1 − r ) γ , ds = dr 1 − r . Using (1 − r ) − a ≤ (1 − t ) − a for r ∈ [0 , t ] yields Z T 0 L ⋆ ( s ) ds ≤ 2 γ C ( γ ) T t γ / 2 (1 − t ) 1+ γ / 2 + 1 γ ( C ( γ ) S ) 2 1 − θ t t γ (1 − t ) 1+ γ =: I γ ( t ) . 35 Since R T 0 L ( s ) ds = 1 2 T + R T 0 L ⋆ ( s ) ds , w e obtain the (updated) b ound exp  Z T 0 L  ≤ exp  1 2 T  exp  I γ ( t )  = exp  I γ ( t )  √ 1 − t =: Φ( t ) . Also, using K − L = 3 L ⋆ and monotonicit y of the in tegral, exp  Z T 0 L  Z T 0 exp  Z T u ( K − L )  du ≤ Φ( t ) Z T 0 exp  3 Z T u L ⋆ ( r ) dr  du ≤ Φ( t ) Z T 0 exp  3 I γ ( t )  du = T exp  4 I γ ( t )  √ 1 − t =: Ψ( t ) , where T = Γ( t ) = − log (1 − t ) . Under the VP schedule σ 2 t = t and α 2 t = 1 − t , T w eedie’s form ula gives m p,t ( y ) = y + t ∇ log p t ( y ) α t , m ˜ p,t ( y ) = y + t ∇ log ˜ p t ( y ) α t , hence ∥ m p,t ( y ) − m ˜ p,t ( y ) ∥ 2 2 = t 2 1 − t ∥∇ log p t ( y ) − ∇ log ˜ p t ( y ) ∥ 2 2 . T aking exp ectation under Y ∼ p t sho ws E MAG ( t ) = E Y ∼ p t ∥ m p,t ( Y ) − m ˜ p,t ( Y ) ∥ 2 2 = t 2 1 − t J T , so p J T = √ 1 − t t p E MAG ( t ) . Com bining the pull-bac k inequality with the W 2 –Fisher b ound yields W 2 ( P Y 0 , P ˜ Y 0 ) ≤  Φ( t ) ¯ C LSI ( t ) + Ψ( t )  p J T =  Φ( t ) ¯ C LSI ( t ) + Ψ( t )  √ 1 − t t p E MAG ( t ) . Then w e show that C PB ( t ) = O ( t − 1 ) . W e can rewrite C PB ( t ) = 1 t  exp  I γ ( t )  ¯ C LSI ( t ) + Γ( t ) exp  4 I γ ( t )   , (26) and I γ ( t ) = A 1 t γ / 2 (1 − t ) 1+ γ / 2 + A 2 1 − θ t t γ (1 − t ) 1+ γ , θ t = C ( γ ) N t γ (1 − t ) γ . Fix an y t ≤ 1 / 2 . Then (1 − t ) − a ≤ 2 a for ev ery a ≥ 0 , and θ t ≤ 2 γ C ( γ ) N t γ . In particular, for all sufficiently small t w e ha ve θ t ≤ 1 / 2 , hence (1 − θ t ) − 1 ≤ 2 . Therefore, for all sufficien tly small t , I γ ( t ) ≤ c 1 t γ / 2 + c 2 t γ ≤ C t γ / 2 , 36 for constan ts c 1 , c 2 , C dep ending only on the problem parameters. Hence I γ ( t ) → 0 as t ↓ 0 , and in particular exp  I γ ( t )  = 1 + o (1) , exp  4 I γ ( t )  = 1 + o (1) . Since (1 − t ) M 2 + t ≤ M 2 + 1 , we ha ve ¯ C LSI ( t ) ≤ M 2 +1 min { C LSI ( π ) , 1 } =: C 0 , so ¯ C LSI ( t ) = O (1) as t ↓ 0 . Also, the T a ylor expansion giv es Γ( t ) = − log (1 − t ) = t + O ( t 2 ) . So in particular Γ( t ) ≤ 2 t for all sufficien tly small t . Plugging the bounds into (26), for all sufficien tly small t w e obtain C PB ( t ) ≤ 1 t  exp  I γ ( t )  C 0 + (2 t ) exp  4 I γ ( t )   = C 0 exp  I γ ( t )  t + 2 exp  4 I γ ( t )  = O ( t − 1 ) , b ecause exp  I γ ( t )  and exp  4 I γ ( t )  remain boun ded and tend to 1 as t ↓ 0 . Pr o of of The or em 2. Let X i := ( Y i t , Y i 0 ) , i = 1 , . . . , n , b e the i.i.d. sample with la w P , and write P n f := n − 1 P n i =1 f ( X i ) . Let h # ∈ arg min h ∈H R ( h ) b e a p opulation risk minimizer o v er H (a b est appro ximation to h ∗ ). Under (15) we hav e ρ 2 ( h ∗ , h # ) ≤ ε 2 / 4 and, writing ∆ K := sup h, x   ℓ K ( x ; h ) − ℓ ( x ; h )   , also ∆ K ≤ ε 2 / 8 . F or l = 0 , 1 , . . . define the shells A l := n h ∈ H : 2 l ε 2 ≤ ρ 2 ( h ∗ , h ) < 2 l +1 ε 2 o . Since ρ ( h ∗ , ˆ h λ ) ≥ ε implies ˆ h λ ∈ ∪ l ≥ 0 A l , P  ρ ( h ∗ , ˆ h λ ) ≥ ε  ≤ ∞ X l =0 P ∗  ˆ h λ ∈ A l  . Because ˆ h λ ∈ arg min h ∈H P n ℓ K ( · ; h ) , P n  ℓ K ( · ; h # ) − ℓ K ( · ; ˆ h λ )  ≥ 0 . Hence, on the even t { ˆ h λ ∈ A l } , sup h ∈ A l P n  ℓ K ( · ; h # ) − ℓ K ( · ; h )  ≥ 0 . F or any h ∈ A l , using the definition of ρ 2 and the uniform error b ound ∆ K , E  ℓ K ( · ; h # ) − ℓ K ( · ; h )  = E  ℓ ( · ; h # ) − ℓ ( · ; h )  + E  ℓ K − ℓ  ( · ; h # ) − E  ℓ K − ℓ  ( · ; h ) ≤ −  R ( h ) − R ( h # )  + 2∆ K = −  ρ 2 ( h ∗ , h ) − ρ 2 ( h ∗ , h # )  + 2∆ K ≤ −  2 l − 1 4  ε 2 + 2∆ K ≤ −  2 l − 1 2  ε 2 . 37 Therefore, P ∗  ˆ h λ ∈ A l  ≤ P ∗  sup h ∈ A l ν n  ℓ K ( · ; h # ) − ℓ K ( · ; h )  ≥ √ n (2 l − 1 2 ) ε 2  , where ν n ( f ) := √ n ( P n − P ) f . T o apply Lemma 13 to eac h set A l , set M l := √ n (2 l − 1 2 ) ε 2 , v 2 l := 8 c v 2 l +1 ε 2 . By Lemma 6 and the triangle inequalit y , sup h ∈ A l V ar  ℓ ( · ; h ) − ℓ ( · ; h ⋆ )  ≤ v 2 l . Since ℓ K − ℓ is uniformly b ounded b y ∆ K and ∆ K ≤ ε 2 / 8 , the same b ound (up to an absolute numerical factor absorb ed in to c v ) holds for ℓ K ( · ; h ) − ℓ K ( · ; h ⋆ ) . Moreo v er, the centered class satisfies Bernstein’s condition with constant c b b y Lemma 7 (again unaffected b y a uniformly b ounded p erturbation). With k ≥ c b / (4 c v ) , the mean–v ariance condition (36) in Lemma 13 holds for ( M l , v 2 l ) , and the entrop y condition (14) implies (37) uniformly o ver l ≥ 0 (the least fav orable case is l = 0 ). Thus Lemma 13 yields P ∗  sup h ∈ A l ν n  ℓ K ( · ; h ⋆ ) − ℓ K ( · ; h )  ≥ M l  ≤ 3 exp  − (1 − k ) M 2 l 2 [4 v 2 l + M l c b / (3 √ n )]  ≤ 3 exp − (1 − k ) (2 l − 1 2 ) 2 nε 2 (64 c v + 2 c b 3 ) 2 l +1 ! . Summing o ver l ≥ 0 giv es P  ρ ( h ∗ , ˆ h λ ) ≥ ε  ≤ 4 exp  − c e nε 2  , c e = 1 − k 8(64 c v + 2 c b 3 ) . This completes the pro of. Pr o of of The or em 3. The result follows b y combining Theorems 1 and 2 with the appro xi- mation and estimation b ounds in Theorems 4 and 5. Pr o of of Cor ol lary 2. Recall the definitions b := η + 1 2 η , κ := b 2( η +1) d ∗ + 2 b , r := 2( η + 1) d ∗ κ = η + 1 2 η + d ∗ . By the c hoice of W and L , ( W L ) − 2( η +1) d ∗ =   n/ log 5 n  κ  − 2( η +1) d ∗ =  n log 5 n  − r . Moreo v er, since κ ∈ (0 , 1) for ev ery d ∗ ≥ 1 and η > 0 , w e hav e W L ≤ n for all n ≥ 3 , hence log( W L ) ≤ log n . Therefore,  ( W L ) 2 log 5 ( W L ) n  b ≤  ( W L ) 2 log 5 n n  b =   n/ log 5 n  2 κ log 5 n n  b =  n log 5 n  − b (1 − 2 κ ) . 38 Finally , by the definition of κ , r = 2( η + 1) d ∗ κ = 2( η + 1) d ∗ · b 2( η +1) d ∗ + 2 b = b  1 − 2 κ  , so the t wo W L -dep endent terms deca y at the same rate  n log 5 n  − r , which pro v es the stated b ound. B.1 Appro ximation error Theorem 4 (Approximation error) . Let s t ( · ; h ) := ∇ log p h Y t ( · ) and s ⋆ t ( · ) := s t ( · ; h ∗ ) = ∇ log p h ∗ Y t ( · ) . Supp ose h ⋆ ∼ C η +1 ( U , B ) with a b ounded support U , giv en K and H := NN( W , L , B) with W = c W W log W and L = c L L log L , we can b ound the appro ximation error inf h ∈H E ∥ e s t,K ( Y t ; h, π , π ) − s ⋆ t ( Y t ) ∥ 2 2 ≲ α 2 t σ 4 t ( W L ) − 2( η +1) d ∗ + ε ( ˜ π , t, K ) . (27) Pr o of. Fix h ∈ H and consider the K -anc hor estimator e s t,K ( · ; h, π , π ) in (4). By ( a + b ) 2 ≤ 2 a 2 + 2 b 2 , E   e s t,K ( Y t ; h, π , π ) − s ⋆ t ( Y t )   2 2 ≤ 2 E   e s t,K ( Y t ; h, π , π ) − s t ( Y t ; h )   2 2 + 2 E   s t ( Y t ; h ) − s ⋆ t ( Y t )   2 2 . The first term is the Monte Carlo approximation error and is b ounded b y the results in Section 4, yielding the ε ( ˜ π , t, K ) form. The second term is con trolled by (i) the appro xi- mation rate of h ∗ b y net w orks in H when h ∗ ∈ C η +1 ( U , B ) in Lemma 12 and (ii) the score p erturbation b ound in Lemma 5, whic h turns a uniform appro ximation error on h in to an L 2 error on the induced score. T aking the infim um o v er h ∈ H yields the claimed b ound. Lemma 4 (F réchet deriv ative of the score w.r.t. h ) . Fix a noise lev el t ∈ (0 , 1) with α t > 0 and σ t > 0 . Let ( U , A , π ) b e a probabilit y space and let h : U → R D b e measurable with E π ∥ h ( U ) ∥ < ∞ . Set a ( u ) := α t h ( u ) and define the smo othed density p h Y t ( x ) = Z ϕ σ t  x − a ( u )  π ( d u ) , ϕ σ t ( z ) = (2 π σ 2 t ) − D/ 2 exp  −∥ z ∥ 2 / (2 σ 2 t )  . Define the induced score at lev el t b y s t ( x ; h ) := ∇ x log p h Y t ( x ) . Let r h ( u | x ) := ϕ σ t ( x − a ( u )) R ϕ σ t ( x − a ( v )) π ( dv ) , m h ( x ) := Z a ( u ) r h ( u | x ) π ( d u ) = E [ a ( U ) | Y t = x ] , where Y t = α t h ( U ) + σ t Z with U ∼ π and Z ∼ N (0 , I D ) . Then for an y direction δ h ∈ L ∞ ( π ; R D ) the F réc het deriv ativ e exists and D h s t ( x ; h )[ δ h ] = α t σ 2 t Z r h ( u | x ) h I D + ( a ( u ) − m h ( x ))( x − a ( u )) ⊤ σ 2 t i δ h ( u ) π ( d u ) . (28) 39 Pr o of. Differen tiate p h Y t ( x ) = R ϕ σ t ( x − a ( u )) π ( d u ) in x : ∇ x log p h Y t ( x ) = R ( a ( u ) − x ) ϕ σ t ( x − a ( u )) π ( d u ) σ 2 t R ϕ σ t ( x − a ( u )) π ( d u ) = m h ( x ) − x σ 2 t , so it suffices to differen tiate m h ( x ) with respect to h . Consider the n umerator and denomi- nator N ( x ) := Z a ( u ) ϕ σ t ( x − a ( u )) π ( d u ) , D ( x ) := Z ϕ σ t ( x − a ( u )) π ( d u ) = p h Y t ( x ) , so m h ( x ) = N ( x ) /D ( x ) . F or a perturbation δ h , write δa = α t δ h and apply the quotien t rule: δ m h ( x ) = δ N ( x ) D ( x ) − N ( x ) D ( x ) 2 δ D ( x ) . A direct differen tiation of the Gaussian factor yields δ D ( x ) = Z ϕ σ t ( x − a ( u )) ( x − a ( u )) ⊤ σ 2 t δ a ( u ) π ( d u ) , and similarly δ N ( x ) = Z h I D + a ( u )( x − a ( u )) ⊤ σ 2 t i ϕ σ t ( x − a ( u )) δ a ( u ) π ( d u ) . Com bining the last three displa ys, using r h ( u | x ) = ϕ σ t ( x − a ( u )) /D ( x ) and m h ( x ) = N ( x ) /D ( x ) , w e obtain δ m h ( x ) = Z r h ( u | x ) h I D + ( a ( u ) − m h ( x ))( x − a ( u )) ⊤ σ 2 t i δ a ( u ) π ( d u ) . Since s t ( x ; h ) = ( m h ( x ) − x ) /σ 2 t and δ a = α t δ h , this giv es (28). Dominated con v ergence (justified b y bounded δ h and Gaussian env elop es) allo ws in terc hanging differen tiation and in tegration. Lemma 5 (Score perturbation b ound under ∥ h − h ∗ ∥ ∞ ) . Let h, h ∗ : U → R D and fix t with α t > 0 and σ t > 0 . Assume ∥ h − h ∗ ∥ ∞ ≤ ε (i.e., sup u ∥ h ( u ) − h ∗ ( u ) ∥ ≤ ε ). Then   s t ( · ; h ) − s t ( · ; h ∗ )   L 2 ( p h ∗ Y t ) ≤ C D α t σ 2 t ε, C D := p 2 + 2 D ( D + 2) . (29) Pr o of. Define the interpolation h υ := h ∗ + υ ( h − h ∗ ) , 0 ≤ υ ≤ 1 , and write p υ := p h υ Y t . By the fundamen tal theorem of calculus and Lemma 4, s t ( · ; h ) − s t ( · ; h ∗ ) = Z 1 0 D h s t ( · ; h υ )[ h − h ∗ ] dυ . By Mink owski and Jensen, for an y υ ∈ [0 , 1] , ∥ s t ( · ; h ) − s t ( · ; h ∗ ) ∥ L 2 ( p υ ) ≤ Z 1 0 ∥ D h s t ( · ; h µ )[ δ h ] ∥ L 2 ( p υ ) dµ ≤ sup µ ∈ [0 , 1] ∥ D h s t ( · ; h µ )[ δ h ] ∥ L 2 ( p υ ) , 40 where δ h := h − h ∗ and ∥ δ h ∥ ∞ ≤ ε . Using (28) point wise in x , Jensen (for the posterior a v erage), and ∥ δ h ∥ ∞ ≤ ε ,   D h s t ( x ; h µ )[ δ h ]   ≤ α t σ 2 t ε    I D + ( a µ ( u ) − m µ ( x ))( x − a µ ( u )) ⊤ σ 2 t    op , r µ ( ·| x ) , where a µ ( u ) := α t h µ ( u ) , r µ ( · | x ) := r h µ ( · | x ) , and m µ ( x ) := m h µ ( x ) . Here ∥ · ∥ op , r denotes the r ( · | x ) -av erage of the squared op erator norm under a square ro ot. Bounding ( α + β ) 2 ≤ 2( α 2 + β 2 ) and using that the op erator norm of a rank-one matrix is the pro duct of v ector norms, ∥ M µ ( · , x ) ∥ 2 op ,r ≤ 2  1 + 1 σ 4 t E  ∥ a µ ( U ) − m µ ( X ) ∥ 2 ∥ X − a µ ( U ) ∥ 2   X = x   . Let A µ ( x ) := E [ ∥ a µ − m µ ∥ 2 | X = x ] = tr Cov( a µ | x ) and B µ ( x ) := E [ ∥ X − a µ ∥ 2 | X = x ] . Note that A µ ( x ) ≤ B µ ( x ) b ecause B µ ( x ) = ∥ x − m µ ( x ) ∥ 2 + A µ ( x ) . Hence E p υ ∥ M µ ( · , x ) ∥ 2 op ,r ≤ 2  1 + 1 σ 4 t E p υ  B µ ( X ) 2   . By Jensen, B µ ( X ) 2 = ( E [ ∥ X − a µ ∥ 2 | X ]) 2 ≤ E [ ∥ X − a µ ∥ 4 | X ] , th us E p υ  B µ ( X ) 2  ≤ E p υ ∥ X − a µ ( U ) ∥ 4 . But conditionally on U , X − a µ ( U ) = σ t Z with Z ∼ N (0 , I D ) , so E ∥ X − a µ ( U ) ∥ 4 = σ 4 t E ∥ Z ∥ 4 = σ 4 t D ( D + 2) . Com bining the displa ys and taking square roots, ∥ D h s t ( · ; h µ )[ δ h ] ∥ L 2 ( p υ ) ≤ α t σ 2 t ε p 2 + 2 D ( D + 2) . Since this bound is uniform in µ and υ , taking υ = 0 giv es (29). B.2 Estimation error Lemma 6 (V ariance–mean) . Recall that the score mo del satisfies s t ( y t ; h ) = α t m h ( y t ) − y t σ 2 t , m h ( y t ) := E [ h ( U ) | Y t = y t ] . Assume ∥ h ∗ ∥ ∞ ≤ B and sup h ∈H ∥ h ∥ ∞ ≤ B , and define the excess risk ρ 2 ( h ∗ , h ) = E  ℓ t ( · ; h ) − ℓ t ( · ; h ∗ )  . Then for all sufficiently small ε > 0 , sup { ρ ( h ∗ ,h ) ≤ ε : h ∈H} V ar  ℓ t ( · ; h ) − ℓ t ( · ; h ∗ )  ≤ c v ε 2 , with c v = 40 α 2 t B 2 σ 4 t . 41 Pr o of. By the Gaussian conditional score, ∇ y t log p ( y t | y 0 ) = ( α t y 0 − y t ) /σ 2 t , hence s t ( Y t ; h ) − ∇ Y t log p ( Y t | Y 0 ) = α t σ 2 t  m h ( Y t ) − Y 0  , and therefore ℓ t ( Y t , Y 0 ; h ) = α 2 t σ 4 t ∥ m h ( Y t ) − Y 0 ∥ 2 2 . Let m h := m h ( Y t ) and m 0 := m h ∗ ( Y t ) and set ∆ := m h − m 0 . Then ℓ t ( · ; h ) − ℓ t ( · ; h ∗ ) = α 2 t σ 4 t  ∥ ∆ ∥ 2 2 + 2 ⟨ ∆ , m 0 − Y 0 ⟩  . Since m 0 ( Y t ) = E [ Y 0 | Y t ] , w e hav e E [ m 0 − Y 0 | Y t ] = 0 , hence ρ 2 ( h ∗ , h ) = E [ ℓ t ( · ; h ) − ℓ t ( · ; h ∗ )] = α 2 t σ 4 t E ∥ ∆ ∥ 2 2 . By the sup norm of the transp ort class, ∥ m h ∥ ≤ B , ∥ m 0 ∥ ≤ B , and ∥ Y 0 ∥ ≤ B almost surely , so ∥ ∆ ∥ ≤ 2 B and ∥ m 0 − Y 0 ∥ ≤ 2 B . Using ( a + b ) 2 ≤ 2 a 2 + 2 b 2 and Cauc hy–Sc hw arz, E h  ℓ t ( · ; h ) − ℓ t ( · ; h ∗ )  2 i ≤ α 4 t σ 8 t E h 2 ∥ ∆ ∥ 4 2 + 8 ∥ ∆ ∥ 2 2 ∥ m 0 − Y 0 ∥ 2 2 i ≤ α 4 t σ 8 t  2(2 B ) 2 + 8(2 B ) 2  E ∥ ∆ ∥ 2 2 = 40 α 4 t B 2 σ 8 t E ∥ ∆ ∥ 2 2 . Since V ar( Z ) ≤ E [ Z 2 ] , com bining with E ∥ ∆ ∥ 2 2 = ( σ 4 t /α 2 t ) ρ 2 ( h ∗ , h ) giv es V ar  ℓ t ( · ; h ) − ℓ t ( · ; h ∗ )  ≤ 40 α 2 t B 2 σ 4 t ρ 2 ( h ∗ , h ) . T aking the suprem um o v er ρ ( h ∗ , h ) ≤ ε yields the claim. Lemma 7 (Bernstein’s condition) . With c b = 16 α 2 t B 2 σ 4 t , the cen tered excess-loss class F ε := n f h ( X ) := ∆ ℓ h ( X ) − E [∆ ℓ h ( X )] : ρ ( h ∗ , h ) ≤ ε, h ∈ H o , ∆ ℓ h := ℓ t ( · ; h ) − ℓ t ( · ; h ∗ ) , satisfies Bernstein’s condition in the follo wing momen t form: there exists v 2 = v 2 ( ε ) suc h that sup f ∈F ε V ar( f ( X )) ≤ v 2 and, for all integers k ≥ 2 , sup f ∈F ε E | f ( X ) | k ≤ 1 2 k ! v 2 c k − 2 b . (30) Moreo v er, using Lemma 6 (v ariance–mean), one ma y tak e v 2 ( ε ) = c v ε 2 , where c v = 40 α 2 t B 2 σ 4 t . 42 Pr o of. Fix t ∈ (0 , 1) and assume the forw ard p erturbation mo del Y 0 = h ∗ ( U ) , Y t = α t Y 0 + σ t Z , Z ∼ N (0 , I D ) , with Z indep enden t of U . Assume the score mo del admits the posterior-mean represen tation s t ( y t ; h ) = α t m h ( y t ) − y t σ 2 t , m h ( y t ) := E [ h ( U ) | Y t = y t ] , so that the ideal loss reduces to ℓ t ( Y t , Y 0 ; h ) = α 2 t σ 4 t ∥ m h ( Y t ) − Y 0 ∥ 2 2 . W e ha v e ∥ h ( U ) ∥ 2 ≤ B a.s. and therefore ∥ m h ( Y t ) ∥ 2 = ∥ E [ h ( U ) | Y t ] ∥ 2 ≤ B a.s. Also ∥ Y 0 ∥ 2 = ∥ h ∗ ( U ) ∥ 2 ≤ B a.s. Using the reduced loss form, 0 ≤ ℓ t ( · ; h ) = α 2 t σ 4 t ∥ m h ( Y t ) − Y 0 ∥ 2 2 ≤ α 2 t σ 4 t (2 B ) 2 . The same bound holds for ℓ t ( · ; h ∗ ) . Thus | ∆ ℓ h | ≤ 8 α 2 t B 2 σ 4 t ⇒ | f h | = | ∆ ℓ h − E ∆ ℓ h | ≤ 16 α 2 t B 2 σ 4 t = c b a.s. F or any centered random v ariable f with | f | ≤ c b almost surely and any integer k ≥ 2 , | f | k ≤ c k − 2 b f 2 ⇒ E | f | k ≤ c k − 2 b E [ f 2 ] = c k − 2 b V ar( f ) . No w let v 2 := sup f ∈F ε V ar( f ) . Then for all k ≥ 2 , sup f ∈F ε E | f | k ≤ v 2 c k − 2 b . Since 1 2 k ! ≥ 1 for all k ≥ 2 , this implies (30). Finally , Lemma 6 giv es on the lo calized class ρ ( h ∗ , h ) ≤ ε that V ar(∆ ℓ h ) ≤ c v ρ 2 ( h ∗ , h ) ≤ c v ε 2 , hence V ar( f h ) = V ar(∆ ℓ h ) ≤ c v ε 2 . Therefore one can tak e v 2 ( ε ) = c v ε 2 . Theorem 5 (Estimation Error) . Supp ose H = NN( W , L ) , then there exists c N N > 0 , such that ε ≥ min δ ≥ 1 ,ζ ≥ 1 max ( c N N  W 2 L 2 log 5 ( W L ) c 2 h n  2 2 − δ +  C 2 t,ζ c 2 h n  2 2 − δ (1 − D 2 ζ ) ) (31) satisfies the in tegral en tropy equation (14). Pr o of. Consider solving the entrop y equation. Z 4 c 1 / 2 v ε kε 2 / 16 H 1 / 2 B ( u, L ) du ≤ c h n 1 / 2 ε 2 . 43 Note that we ha v e the L ⊂ C ζ ([ − R t , R t ] D , α t σ ζ +4 t ) with R t ≍ σ t log n . The entrop y for the smo oth class is b ounded b y H 1 / 2 B ( u, L ) ≲ 1 σ ζ +4 t u − D/ (2 ζ ) . Then we only need to solv e the follo wing sufficien t condition for ε , for a fixed 1 ≤ δ < 2 , c h n 1 / 2 ε 2 ≥ Z ε δ kε 2 / 16 H 1 / 2 B ( u, L )d u + Z ∞ ε δ H 1 / 2 B ( u, C ζ ([ − R t , R t ] D , 1 σ ζ +4 t ))d u Then w e shows an upp er b ounds for the righ t side. F or the first term, w e can sho w Z ε δ kε 2 / 16 H 1 / 2 B ( u, L )d u ≤ ε δ H 1 / 2 B ( k ε 2 / 16 , L ) F or the second term, let C t,ζ := 1 σ ζ +4 t and assume D/ (2 ζ ) > 1 , Z ∞ ε δ H 1 / 2 B ( u, C ζ ([ − R t , R t ] D , 1 σ η +4 t ))d u ≤ C t,ζ ε δ (1 − D 2 ζ ) Com bine the t wo b ounds, w e hav e the en tropy inequality , c h n 1 / 2 ε 2 ≥ ε δ H 1 / 2 B ( k ε 2 / 16 , L ) + C t,ζ ε δ (1 − D 2 ζ ) . (32) Then use the Lipsc hitz transfer lemma (Lemma 8) and plug in the entrop y bound for the NN class in Lemma 9, w e get the bound that ε ≥ min 1 ≤ δ < 2 ζ ≥ 1 ( c N N  W 2 L 2 log 5 ( W L ) c 2 h n  1 2(2 − δ ) +  C 2 t,ζ c 2 h n  1 2(2 − δ (1 − D 2 ζ )) ) . (33) Let ζ = D ( η +2) d ∗ and δ = η +2 η +1 . Note that D 2 ζ = d ∗ 2( η +2) , so the assumption D 2 ζ > 1 holds whenever d ∗ > 2( η + 2) . W e can get the target that ε ≳  W 2 L 2 log 5 ( W L ) c 2 h n  η +1 2 η + n − η +1 2 η + d ∗ c h σ 4+ D d ∗ ( η +2) t . (34) Lemma 8 (Lipschitz transfer w.r.t. cen ters) . Fix ∥ x ∥ ≤ R x and supp ose ∥ y i ∥ , ∥ y ′ i ∥ ≤ R y for all i . Define g y ( x ) = x σ 2 − 1 σ 2 K X i =1 w i ( x ; y ) y i , w ( x ; y ) = softmax  s ( x ; y )  , s i ( x ; y ) := − ∥ x − y i ∥ 2 2 σ 2 . Then ∥ g y ( x ) − g y ′ ( x ) ∥ ≤ C 0 σ 2 ∥ y − y ′ ∥ ∞ ,K , C 0 := 1 + R y ( R x + R y ) σ 2 . Th us, with B := R x + R y σ 2 + R x , | f y ( x ) − f y ′ ( x ) | ≤ 2 B C 0 σ 2 ∥ y − y ′ ∥ ∞ ,K . 44 Pr o of. W rite g y ( x ) − g y ′ ( x ) = − 1 σ 2  P K i =1 w i ( x ; y ) y i − P K i =1 w i ( x ; y ′ ) y ′ i  . A dd and sub- tract P i w i ( x ; y ′ ) y i to obtain K X i =1 w i ( x ; y ) y i − K X i =1 w i ( x ; y ′ ) y ′ i = K X i =1 w ′ i ( y i − y ′ i ) + K X i =1 ( w i − w ′ i ) y i , where w i := w i ( x ; y ) and w ′ i := w i ( x ; y ′ ) . Therefore, ∥ g y ( x ) − g y ′ ( x ) ∥ ≤ 1 σ 2      K X i =1 w ′ i ( y i − y ′ i )      + 1 σ 2      K X i =1 ( w i − w ′ i ) y i      ≤ 1 σ 2 K X i =1 w ′ i ∥ y i − y ′ i ∥ + 1 σ 2 K X i =1 | w i − w ′ i | ∥ y i ∥ ≤ 1 σ 2 ∥ y − y ′ ∥ ∞ ,K + R y σ 2 ∥ w − w ′ ∥ 1 . Then w e b ound ∥ w − w ′ ∥ 1 b y ∥ s − s ′ ∥ ∞ . F or softmax w i = exp( s i ) / P j exp( s j ) , the Jacobian satisfies ∂ w i ∂ s j = w i ( 1 { i = j } − w j ) . F or any direction a ∈ R K , the directional deriv ative is ( J a ) i = w i  a i − P K j =1 w j a j  . Hence ∥ J a ∥ 1 = K X i =1 w i      a i − K X j =1 w j a j      ≤ K X i =1 w i  | a i | +    K X j =1 w j a j     ≤ K X i =1 w i  ∥ a ∥ ∞ + ∥ a ∥ ∞  = 2 ∥ a ∥ ∞ . By the mean v alue theorem applied to the smo oth map s 7→ softmax( s ) along the segment s τ = s ′ + τ ( s − s ′ ) , τ ∈ [0 , 1] , w e get ∥ w − w ′ ∥ 1 ≤ sup τ ∈ [0 , 1] ∥ J ( s τ )( s − s ′ ) ∥ 1 ≤ 2 ∥ s − s ′ ∥ ∞ . Next, w e b ound ∥ s − s ′ ∥ ∞ in terms of ∥ y − y ′ ∥ ∞ ,K . F or eac h i , | s i ( x ; y ) − s i ( x ; y ′ ) | = 1 2 σ 2    ∥ x − y i ∥ 2 − ∥ x − y ′ i ∥ 2    ≤ 1 2 σ 2 ∥ y i − y ′ i ∥  ∥ x − y i ∥ + ∥ x − y ′ i ∥  ≤ 1 2 σ 2 ∥ y i − y ′ i ∥ ( R x + R y + R x + R y ) = R x + R y σ 2 ∥ y i − y ′ i ∥ . T aking the maxim um o v er i giv es ∥ s − s ′ ∥ ∞ ≤ R x + R y σ 2 ∥ y − y ′ ∥ ∞ ,K . 45 Com bining, ∥ w − w ′ ∥ 1 ≤ 2 ∥ s − s ′ ∥ ∞ ≤ 2 R x + R y σ 2 ∥ y − y ′ ∥ ∞ ,K . Plugging in to the earlier estimate for ∥ g y ( x ) − g y ′ ( x ) ∥ yields ∥ g y ( x ) − g y ′ ( x ) ∥ ≤ 1 σ 2  1 + 2 R y ( R x + R y ) σ 2  ∥ y − y ′ ∥ ∞ ,K . Finally , using   ∥ a ∥ 2 − ∥ b ∥ 2   ≤ ( ∥ a ∥ + ∥ b ∥ ) ∥ a − b ∥ with a = g y ( x ) − x and b = g y ′ ( x ) − x , w e obtain | f y ( x ) − f y ′ ( x ) | ≤ ( ∥ g y ( x ) − x ∥ + ∥ g y ′ ( x ) − x ∥ ) ∥ g y ( x ) − g y ′ ( x ) ∥ . Under ∥ x ∥ ≤ R x and ∥ y i ∥ ≤ R y , w e hav e ∥ g y ( x ) ∥ ≤ ∥ x ∥ σ 2 + 1 σ 2 X i w i ∥ y i ∥ ≤ R x + R y σ 2 , so ∥ g y ( x ) − x ∥ ≤ R x + R y σ 2 + R x =: B , and similarly for y ′ . Therefore, | f y ( x ) − f y ′ ( x ) | ≤ 2 B ∥ g y ( x ) − g y ′ ( x ) ∥ ≤ 2 B C 0 σ 2 ∥ y − y ′ ∥ ∞ ,K . Lemma 9 (Empirical L ∞ co v ering of H ) . Let H = H ( W , L ) b e ReLU netw orks of depth L and width W , with output dimension d y and range b ound ∥ h ( u ) ∥ ≤ R h . Then for any finite set { U i } K i =1 and an y η ∈ (0 , 2 R h ] , log N  η , H , ∥ · ∥ ∞ ,K  ≤ C 1 d y Pdim( H ) log  C 2 R h η  , where Pdim( H ) is the pseudo-dimension of the (scalar-output) net work class and C 1 , C 2 are univ ersal constan ts. F or ReLU nets, Pdim( H ) ≤ C 3 W L log ( eW ) , hence log N  η , H , ∥ · ∥ ∞ ,K  ≤ C d y W L log ( eW ) log  C ′ R h η  . Pr o of. Apply Thm. 12.5 of Anthon y and Bartlett (2009) to eac h co ordinate class { u 7→ h ℓ ( u ) } , use the range bound to normalize, and union b ound o ver d co ordinates to pass from scalar to v ector outputs under ℓ ∞ on the sample. The pseudo-dimension upp er b ound for piecewise-linear nets is from Bartlett et al. (2019). 46 C Auxiliary lemmas Lemma 10 (LSI condition) . Assume the latent prior π satisfies a log–Sob olev inequalit y with constan t C LSI ( π ) > 0 (e.g., π = N (0 , I d ) giv es C LSI ( π ) = 1 ). Fix any h ∈ H and define ˜ Y 0 = h ( U ) with U ∼ π and ˜ Y t = α t ˜ Y 0 + σ t Z with Z ∼ N (0 , I D ) indep endent. Under Assumption 1, the map h is M –Lipschitz, and q t satisfies a log–Sobolev inequality with C LSI ( q t ) ≥ min { C LSI ( π ) , 1 } α 2 t M 2 + σ 2 t . (35) Pr o of. Let µ := π ⊗ N (0 , I D ) b e the joint la w of ( U, Z ) ∈ R d × R D . By tensorization of log–Sob olev inequalities, µ satisfies an LSI with constan t C LSI ( µ ) = min { C LSI ( π ) , 1 } . Define the (deterministic) map F ( u, z ) := α t h ( u ) + σ t z so that F # µ is the law of ˜ Y t . F or an y smo oth φ : R D → R , set ψ ( u, z ) := φ ( F ( u, z )) . By the c hain rule, ∥∇ ( u,z ) ψ ( u, z ) ∥ 2 2 ≤  α 2 t ∥ J h ( u ) ∥ 2 op + σ 2 t  ∥∇ φ ( F ( u, z )) ∥ 2 2 ≤  α 2 t M 2 + σ 2 t  ∥∇ φ ( F ( u, z )) ∥ 2 2 , where the last inequalit y uses Assumption 1. Applying the LSI for µ to ψ and rewriting the result under the pushforward F # µ yields En t q t ( φ 2 ) ≤ 2( α 2 t M 2 + σ 2 t ) C LSI ( µ ) Z R D ∥∇ φ ( y ) ∥ 2 2 q t ( y ) d y . This pro ves (35). Lemma 11 (LSI ⇒ W 2 –Fisher c hain, Theorem 22.17 of Villani et al. (2009)) . If p satisfies LSI with constant ρ > 0 (Definition 3), then p also satisfies T alagrand’s T 2 inequalit y and, for an y q ≪ p , W 2 2 ( q , p ) ≤ 2 ρ KL  q ∥ p  ≤ 1 ρ 2 J ( q ∥ p ) , where J ( q ∥ p ) := R ∥∇ log q − ∇ log p ∥ 2 2 q is the relativ e Fisher information. The following is a ReLU approximation result for a Hölder class of smo oth functions, whic h is a simplified v ersion of Theorem 1.1 in Lu et al. (2021) and Lemma 11 in Huang, Jiao, Li, Liu, W ang and Y ang (2022). Lemma 12 (Lemma 11 in Huang, Jiao, Li, Liu, W ang and Y ang (2022)) . F or an y f ∈ C r ([0 , 1] d , R , B ) , there exists a ReLU net work Φ with W = c W ( W log W ) , L = c L ( L log L ) and E = ( W L ) c E with some p ositiv e constants c W , c L and c E dep enden t on d and r , suc h that sup x ∈ [0 , 1] d | Φ( x ) − f ( x ) | = O ( B ( W L ) − 2 r d ) . Lemma 13. Assume that f ( Y ) ∈ F satisfies the Bernstein condition with some constan t c b for an i.i.d. sample Y 1 , · · · , Y n . Let ϕ ( M , v 2 , F ) = M 2 2[4 v 2 + M c b / 3 n 1 / 2 ] , where V ar( f ( Y )) ≤ v 2 . Assume that M ≤ k n 1 / 2 v 2 / 4 c b , (36) 47 with 0 < k < 1 and Z v kM / (8 n 1 / 2 ) H 1 / 2 B ( u, F ) du ≤ M k 3 / 2 / 2 10 , (37) then P ∗ ( sup { f ∈F } n − 1 / 2 n X i =1 ( f ( Y i ) − E f ( Y i )) ≥ M ) ≤ 3 exp( − (1 − k ) ϕ ( M , v 2 , n )) , where P ∗ denotes the outer probability . Sp ecifically , for an y ev en t A , P ∗ ( A ) := inf { P ( B ) : A ⊆ B , B measurable } . Pr o of of L emma 13. The result follows from the same arguments as in the pro of of Theorem 3 in Shen and W ong (1994) with V ar( f ( X )) ≤ v 2 . Note that Bernstein’s condition replaces the upp er b oundedness condition there, and the second condition of (4.6) there is not needed here. W e now present the tec hnical lemmas that will serv e as the foundation for the pro of of Theorem 1. F or s ∈ [0 , t ] , let p s , q s b e C 2 densities on R D with finite second moments solving the con tin uity equations ∂ s p s + ∇ · ( p s v p s ) = 0 , ∂ s q s + ∇ · ( q s v q s ) = 0 , (38) where the (v ariance-preserving) probabilit y–flo w fields are v p s ( x ) = − 1 2 β ( s ) x − β ( s ) ∇ log p s ( x ) , v q s ( x ) = − 1 2 β ( s ) x − β ( s ) ∇ log q s ( x ) , (39) with a measurable schedule β ( s ) ≥ 0 . Denote the scores and their difference by s p := ∇ log p s , s q := ∇ log q s , ∆ := s p − s q , and define J s := R R D ∥ ∆( y ) ∥ 2 q s ( y ) d y . Assume there is a measurable L ⋆ ( s ) ≥ 0 such that for all x and all s ∈ [0 , t ] , ∥∇ 2 log p s ( x ) ∥ op ≤ L ⋆ ( s ) , ∥∇ 2 log q s ( x ) ∥ op ≤ L ⋆ ( s ) , (40) and that all in tegrals b elo w are justified (sufficient decay/in tegrability; b oundary terms v anish). Assumption 4. Ther e exist me asur able functions L ( · ) , K ( · ) : [0 , t ] → [0 , ∞ ) such that, for al l s ∈ [0 , t ] : (A1) Flow Lipschitz: ∥∇ v p s ( x ) ∥ op ≤ L ( s ) , ∥∇ v q s ( x ) ∥ op ≤ L ( s ) ∀ x. (41) (A2) Sc or e–gap gr owth: d ds J s ≤ 2 K ( s ) J s . (42) 48 Lemma 14 (V ariable-co efficient pull–bac k) . Let P := p 0 and Q := q 0 . Under Assumption 4, for ev ery t > 0 , W 2 ( p 0 , q 0 ) ≤ e R t 0 L W 2 ( p t , q t ) + e R t 0 L Z t 0 β ( u ) exp  Z t u ( K ( r ) − L ( r )) dr  du p J t . (43) Pr o of. Let π t b e an optimal coupling of p t , q t ; dra w ( X t , Y t ) ∼ π t and ev olve b ackwar d X s := Φ p s ← t ( X t ) , Y s := Φ q s ← t ( Y t ) , s ∈ [0 , t ] . Then X s ∼ p s , Y s ∼ q s . Set ∆ tra j s := X s − Y s and R s :=  E ∥ ∆ tra j s ∥ 2  1 / 2 ; then W 2 ( p s , q s ) ≤ R s , W 2 ( p 0 , q 0 ) ≤ R 0 , and W 2 ( p t , q t ) ≤ R t . Differen tiate 1 2 ∥ ∆ tra j s ∥ 2 and use the flow ODEs: d ds 1 2 ∥ ∆ tra j s ∥ 2 =  ∆ tra j s , v p s ( X s ) − v p s ( Y s )  +  ∆ tra j s , v p s ( Y s ) − v q s ( Y s )  . By (41), ∥ v p s ( X s ) − v p s ( Y s ) ∥ ≤ L ( s ) ∥ ∆ tra j s ∥ . Moreo ver, v p s − v q s = − β ( s ) ∆ s (where ∆ s := ∇ log p s − ∇ log q s ) p oin t wise, so  ∆ tra j s , v p s ( Y s ) − v q s ( Y s )  ≤ β ( s ) ∥ ∆ tra j s ∥ ∥ ∆ s ( Y s ) ∥ . T aking exp ectations and applying Cauc hy–Sc hw arz yields d ds R s ≤ L ( s ) R s + β ( s ) p J s , 0 ≤ s ≤ t. (44) By Grön wall from s to t (in tegrating the backw ard flo w stabilit y), R s ≤ e R t s L R t + Z t s β ( u ) e R u s L p J u du. (45) By (42) and Grön wall, w e assume the gro wth condition implies that for 0 ≤ u ≤ t , p J u ≤ e R t u K p J t . (46) Insert (46) in to (45) with s = 0 , and use W 2 ( p 0 , q 0 ) ≤ R 0 , R t ≥ W 2 ( p t , q t ) : W 2 ( p 0 , q 0 ) ≤ e R t 0 L W 2 ( p t , q t ) + e R t 0 L Z t 0 β ( u ) e − R t u L e R t u K du p J t , whic h is (43). Lemma 15 (Hessian b ound with h ∈ C 1+ η ) . Assume the latent prior has bounded Hessian: sup u ∥∇ 2 u log π U ( u ) ∥ ≤ Λ 2 . Consider the VP corruption at lev el s , Y s = α s X + σ s Z , α s ∈ (0 , 1] , σ s > 0 , with X = h ( U ) and σ s ≤ cρ (inside the tub e). Then, with H s ( y ) := 49 ∇ 2 log p s ( y ) , there exist constan ts C ( γ ) T , C ( γ ) S , C ( γ ) N dep ending only on ( m, M , H γ , Λ 2 , ρ ) suc h that for all y , ∥ Π T H s ( y )Π T ∥ op ≤ C ( γ ) T σ γ − 2 s α γ s , (47) ∥ Π T H s ( y )Π N ∥ op ≤ C ( γ ) S σ γ − 2 s α γ s , (48) Π N H s ( y )Π N ⪯ −  σ − 2 s − C ( γ ) N σ 2 γ − 2 s α 2 γ s  Π N . (49) Consequen tly , L ⋆ ( s ) := sup y λ max  H s ( y )  ≤ C γ σ γ − 2 s α γ s , C γ := C ( γ ) T +  C ( γ ) S  2 1 − C ( γ ) N σ 2 γ s α 2 γ s . Pr o of. F or Y s = α s X + σ s Z with Z ∼ N (0 , I D ) indep enden t of X , the score and Hessian satisfy ∇ log p s ( y ) = α s σ 2 s  E [ X | Y s = y ] − y α s  , H s ( y ) = ∇ 2 log p s ( y ) = α 2 s σ 4 s Co v( X | Y s = y ) − 1 σ 2 s I D . (A) Fix y and let x 0 := Π M ( y /α s ) b e the unique nearest-p oin t pro jection on to M (w ell- defined since σ s ≤ cρ ). Let Π T , Π N denote orthogonal pro jections on to the tangent/normal spaces at x 0 . Because x 0 is the nearest-point pro jection, the residual r := y /α s − x 0 is normal: Π T r = 0 and r = Π N r . Cho ose u 0 suc h that h ( u 0 ) = x 0 , and write a lo cal C 1 ,γ parametrization of M : for ξ in a small ball in R d , x ( ξ ) = x 0 + J ξ + R ( ξ ) , J := J h ( u 0 ) , ∥ R ( ξ ) ∥ ≤ C ∥ ξ ∥ 1+ γ , (B) where C depends only on ( m, H γ ) . The conditional la w of ξ given Y s = y has (unnormalized) densit y proportional to exp  − 1 2 σ 2 s ∥ r − ( J ξ + R ( ξ )) ∥ 2  π U ( u 0 + ξ ) . Using (B) and Π T r = 0 , one obtains standard Laplace/Gaussian comparison b ounds imply- ing: there exists c 0 > 0 (dep ending only on ( m, M , H γ ) ) such that the p osterior concen trates on {∥ ξ ∥ ≲ ε } , and the momen ts ob ey E [ ξ | y ] = O  ε 1+ γ  , (50) Co v( ξ | y ) = σ 2 s α 2 s ( J ⊤ J ) − 1 + O  σ 2+ γ s α 2+ γ s  , (51) E [ ∥ ξ ∥ 2+ γ | y ] = O ( ε 2+ γ ) , E [ ∥ ξ ∥ 2+2 γ | y ] = O ( ε 2+2 γ ) . (52) 50 W rite X − x 0 = J ξ + R ( ξ ) and pro ject: δ x T := Π T ( X − x 0 ) = J ξ + O ( ∥ ξ ∥ 1+ γ ) , δ x N := Π N ( X − x 0 ) = O ( ∥ ξ ∥ 1+ γ ) . Using (51)–(52) and ∥ J ∥ op ≤ M , ∥ ( J ⊤ J ) − 1 ∥ op ≤ m − 2 , we obtain the blo c k co v ariance b ounds Co v T := Π T Co v( X | y )Π T = J Co v ( ξ | y ) J ⊤ + O  E ∥ ξ ∥ 2+ γ | y  = σ 2 s α 2 s Π T + O  σ 2+ γ s α 2+ γ s  , (53) Co v T N := Π T Co v( X | y )Π N = O  E ∥ J ξ ∥ ∥ ξ ∥ 1+ γ | y  = O  σ 2+ γ s α 2+ γ s  , (54) Co v N := Π N Co v( X | y )Π N = O  E ∥ ξ ∥ 2+2 γ | y  = O  σ 2+2 γ s α 2+2 γ s  . (55) Using (A) and (53)–(55), Π T H s Π T = α 2 s σ 4 s Co v T − 1 σ 2 s Π T = O  α 2 s σ 4 s · σ 2+ γ s α 2+ γ s  = O  σ γ − 2 s α γ s  , whic h giv es (47). Similarly , Π T H s Π N = α 2 s σ 4 s Co v T N = O  σ γ − 2 s α γ s  , giving (48). Finally , Π N H s Π N = α 2 s σ 4 s Co v N − 1 σ 2 s Π N ⪯ −  σ − 2 s − C σ 2 γ − 2 s α 2 γ s  Π N , whic h is (49) after renaming constants. In the ( T , N ) blo c k form, write H s =  A B B ⊤ C  , A = Π T H s Π T , B = Π T H s Π N , C = Π N H s Π N . By (49), − C ⪰ µ I with µ := σ − 2 s − C ( γ ) N σ 2 γ − 2 s α 2 γ s = σ − 2 s  1 − θ s  , θ s = C ( γ ) N σ 2 γ s α 2 γ s . When θ s < 1 , the Sc h ur complemen t b ound implies λ max ( H s ) ≤ ∥ A ∥ op + ∥ B ∥ 2 op µ . Com bine with (47)–(48) to get λ max ( H s ) ≤ C ( γ ) T σ γ − 2 s α γ s +  C ( γ ) S  2 1 − θ s σ γ − 2 s α γ s = C γ σ γ − 2 s α γ s , whic h yields the stated env elop e for L ⋆ ( s ) . 51 Lemma 16 (Score–gap growth along the q –flow) . Let p s , q s solv e the con tinuit y equations (38) with VP probability–flo w fields (39), and define s p = ∇ log p s , s q = ∇ log q s , ∆ := s p − s q , J s := Z R D ∥ ∆ ∥ 2 2 q s . Assume the Hessian env elop e (40) holds: ∥∇ 2 log p s ( x ) ∥ op ≤ L ⋆ ( s ) , ∥∇ 2 log q s ( x ) ∥ op ≤ L ⋆ ( s ) ( ∀ x ) . Then, for all 0 ≤ u ≤ t , J u ≤ J t exp  2 Z t u β ( r )  1 2 + 4 L ⋆ ( r )  dr  . (56) Equiv alently , dJ s /ds ≤ 2 K ( s ) J s with K ( s ) = β ( s )  1 2 + 4 L ⋆ ( s )  . Pr o of. Note that, if ρ s solv es ∂ s ρ s + ∇ · ( ρ s v s ) = 0 , then its score s ρ := ∇ log ρ s satisfies ∂ s s ρ + ( ∇ s ρ ) v s + ( ∇ v s ) ⊤ s ρ + ∇ ( ∇ · v s ) = 0 . (57) So w e apply (57) to ( p s , v p s ) and ( q s , v q s ) . Then Subtract the tw o identities and rewrite the transport part along v q s : ∂ s ∆ + ( ∇ ∆) v q s + ( ∇ v q s ) ⊤ ∆ = − h ( ∇ s p )( v p s − v q s ) +  ( ∇ v p s ) ⊤ − ( ∇ v q s ) ⊤  s p + ∇  ∇ · ( v p s − v q s )  i . (58) Using (39) one has v p s − v q s = − β ( s )∆ , ∇ v p s − ∇ v q s = − β ( s ) ∇ ∆ , ∇ · ( v p s − v q s ) = − β ( s ) ∇ · ∆ , so (58) becomes ∂ s ∆ + ( ∇ ∆) v q s + ( ∇ v q s ) ⊤ ∆ = β ( s ) h ( ∇ s p )∆ + ( ∇ ∆) ⊤ s p + ∇ ( ∇ · ∆) i . (59) Differen tiate J s = R ∥ ∆ ∥ 2 q s and use ∂ s q s = −∇ · ( q s v q s ) : d ds J s = Z 2 ⟨ ∆ , ∂ s ∆ ⟩ q s + Z ∥ ∆ ∥ 2 ∂ s q s = Z 2 ⟨ ∆ , ∂ s ∆ ⟩ q s − Z ∥ ∆ ∥ 2 ∇ · ( q s v q s ) = Z 2 ⟨ ∆ , ∂ s ∆ ⟩ q s + Z q s v q s · ∇ ( ∥ ∆ ∥ 2 ) = 2 Z q s ⟨ ∆ , ∂ s ∆ + ( ∇ ∆) v q s ⟩ . Insert (59) to obtain d ds J s = − 2 Z q s  ∆ , ( ∇ v q s ) ⊤ ∆  + 2 β ( s ) Z q s ⟨ ∆ , ( ∇ s p )∆ ⟩ + 2 β ( s ) I s , (60) 52 where I s := Z q s  ⟨ ∆ , ∇ ( ∇ · ∆) ⟩ +  ∆ , ( ∇ ∆) ⊤ s p   . Let f := ∇ · ∆ . Using s p = s q + ∆ we split I s = Z q s  ⟨ ∆ , ∇ f ⟩ +  ∆ , ( ∇ ∆) ⊤ s q   + Z q s  ∆ , ( ∇ ∆) ⊤ ∆  . The first brac ket equals − R q s ∥∇ ∆ ∥ 2 F ≤ 0 b y Lemma 17. F or the second term, use  ∆ , ( ∇ ∆) ⊤ ∆  = ∆ ⊤ ( ∇ ∆)∆ ≤ ∥∇ ∆ ∥ op ∥ ∆ ∥ 2 , ∥∇ ∆ ∥ op ≤ ∥∇ 2 log p s ∥ op + ∥∇ 2 log q s ∥ op ≤ 2 L ⋆ ( s ) , hence I s ≤ 2 L ⋆ ( s ) J s . (61) F rom (39), ∇ v q s ( x ) = − 1 2 β ( s ) I − β ( s ) ∇ 2 log q s ( x ) ⇒ ∥∇ v q s ∥ op ≤ β ( s )  1 2 + L ⋆ ( s )  . Also, Z q s ⟨ ∆ , ( ∇ s p )∆ ⟩ ≤ L ⋆ ( s ) Z q s ∥ ∆ ∥ 2 = L ⋆ ( s ) J s . Insert these bounds and (61) in to (60): d ds J s ≤ 2 β ( s )  1 2 + L ⋆ ( s )  J s + 2 β ( s ) L ⋆ ( s ) J s + 4 β ( s ) L ⋆ ( s ) J s =  β ( s ) + 8 β ( s ) L ⋆ ( s )  J s . Equiv alently , d ds J s ≤ 2 β ( s )  1 2 + 4 L ⋆ ( s )  J s . Applying Grön wall on [ u, t ] yields (56). Lemma 17 (W eighted IBP identit y along the q –flow) . Let q b e a C 2 densit y on R D with score s q = ∇ log q . Let ∆ = ∇ g b e a C 2 gradien t field (so ∇ ∆ = ∇ 2 g is symmetric), and set f := ∇ · ∆ . Assume sufficient decay/in tegrability so that b oundary terms v anish. Then Z R D q ⟨ ∆ , ∇ f ⟩ dx + Z R D q  ∆ , ( ∇ ∆) ⊤ s q  dx = − Z R D q ∥∇ ∆ ∥ 2 F dx ≤ 0 . (62) Pr o of. W rite s q = ∇ log q , so q s q = ∇ q . Using integration by parts (b oundary terms v anish), Z q  ∆ , ( ∇ ∆) ⊤ s q  dx = Z ∆ ⊤ ( ∇ ∆) ∇ q dx = − Z q ∇ ·  ∆ ⊤ ( ∇ ∆)  dx. Compute the div ergence in co ordinates (summation con ven tion): ∇ ·  ∆ ⊤ ( ∇ ∆)  = ∂ j  ∆ i ∂ i ∆ j  = ( ∂ j ∆ i )( ∂ i ∆ j ) + ∆ i ∂ i ( ∂ j ∆ j ) . Since ∆ = ∇ g , we hav e ∂ j ∆ i = ∂ i ∆ j , hence ( ∂ j ∆ i )( ∂ i ∆ j ) = X i,j ( ∂ i ∆ j ) 2 = ∥∇ ∆ ∥ 2 F , ∆ i ∂ i ( ∂ j ∆ j ) = ⟨ ∆ , ∇ f ⟩ . Therefore Z q  ∆ , ( ∇ ∆) ⊤ s q  dx = − Z q ∥∇ ∆ ∥ 2 F dx − Z q ⟨ ∆ , ∇ f ⟩ dx, whic h is exactly (62). 53 D Pro ofs in Section 4 Pr o of of L emma 1. Fix t ∈ (0 , 1) and let U ∼ π . F or y ∈ R D define ϕ σ t ( z ) := (2 π σ 2 t ) − D/ 2 exp  − ∥ z ∥ 2 2 σ 2 t  , w t ( u ; y ) := ϕ σ t  y − α t h ( u )  , and the (unnormalized) mixture density B t ( y ) := E π  w t ( U ; y )  = Z U ϕ σ t  y − α t h ( u )  π ( u ) d u = p t ( y ) . Define also the p osterior mean of h at lev el t , m t ( y ) := E [ h ( U ) | Y t = y ] = E π  w t ( U ; y ) h ( U )  B t ( y ) . Giv en i.i.d. anc hors U (1) , . . . , U ( K ) iid ∼ π , define the self-normalized estimator of m t ( y ) , e m t,K ( y ) := P K j =1 w t ( U ( j ) ; y ) h ( U ( j ) ) P K j =1 w t ( U ( j ) ; y ) . When ˜ π ≡ π , the transp ort-based score estimator (4) satisfies e s t,K ( y ; h, π , π ) = 1 σ 2 t  α t e m t,K ( y ) − y  . Moreo v er, b y (2), s t ( y ; h ) = ∇ y log p t ( y ) = 1 σ 2 t  α t m t ( y ) − y  . Therefore, for all y ∈ R D , e s t,K ( y ; h, π , π ) − s t ( y ; h ) = α t σ 2 t  e m t,K ( y ) − m t ( y )  , (63) and hence E   e s t,K ( Y t ; h, π , π ) − s t ( Y t ; h )   2 2 = α 2 t σ 4 t E   e m t,K ( Y t ) − m t ( Y t )   2 2 . (64) In the first step, w e b ound the conditional mean-squared error of the self-normalized estimator e m t,K ( y ) for fixed y . Assume E π [ w t ( U ; y ) 4 ∥ h ( U ) ∥ 4 ] < ∞ and B t ( y ) > 0 . Then the standard self-normalized imp ortance sampling expansion (e.g., Owen (2013, Ch. 9)) giv es E   e m t,K ( y ) − m t ( y )   2 2 = 1 K E π  w t ( U ; y ) 2 ∥ h ( U ) − m t ( y ) ∥ 2 2  B t ( y ) 2 + O  1 K 2  , (65) where the expectation is o v er the anc hors U (1: K ) conditional on Y t = y . In the second step, we rewrite the leading term in (65). Using the iden tity ϕ σ t ( z ) 2 = (4 π σ 2 t ) − D/ 2 ϕ σ t / √ 2 ( z ) , 54 w e obtain E π  w t ( U ; y ) 2 g ( U )  = (4 π σ 2 t ) − D/ 2 E π h ϕ σ t / √ 2  y − α t h ( U )  g ( U ) i for an y measurable g . Define B t,σ t / √ 2 ( y ) := E π h ϕ σ t / √ 2  y − α t h ( U )  i , C t ( y ) := (4 π σ 2 t ) − D/ 2 B t,σ t / √ 2 ( y ) B t ( y ) 2 . Also define the p osterior at bandwidth σ t / √ 2 b y q t,σ t / √ 2 ( u | y ) := ϕ σ t / √ 2  y − α t h ( u )  π ( u ) B t,σ t / √ 2 ( y ) . Applying the abov e iden tity with g ( U ) = ∥ h ( U ) − m t ( y ) ∥ 2 2 yields E π  w t ( U ; y ) 2 ∥ h ( U ) − m t ( y ) ∥ 2 2  B t ( y ) 2 = C t ( y ) E q t,σ t / √ 2 ( ·| y )  ∥ h ( U ) − m t ( y ) ∥ 2 2  . (66) Com bining (65) and (66) giv es E   e m t,K ( y ) − m t ( y )   2 2 = C t ( y ) K E q t,σ t / √ 2 ( ·| y )  ∥ h ( U ) − m t ( y ) ∥ 2 2  + O  1 K 2  . (67) Then, we b ound C t ( y ) using Lemma 18 (Gaussian–manifold con volution). The marginal densit y is B t ( y ) = p t ( y ) = Z U ϕ σ t  y − α t h ( u )  π ( u ) d u , whic h is a Gaussian smo othing of the image manifold α t M at scale σ t . Applying Lemma 18 with intrinsic dimension d (and observing that the exp onen tial terms cancel in the ratio defining C t ) yields constan ts c 1 , c 2 , σ 0 > 0 suc h that c 1 σ − d t ≤ C t ( y ) ≤ c 2 σ − d t ( y ∈ T r ( α t M ) , σ t ≤ σ 0 ) . (68) In the next step, w e b ound the p osterior second-moment term. Since ∥ h ( U ) ∥ 2 ≤ B almost surely , w e ha v e for all y , ∥ h ( U ) − m t ( y ) ∥ 2 2 ≤  ∥ h ( U ) ∥ 2 + ∥ m t ( y ) ∥ 2  2 ≤ 4 B 2 , hence E q t,σ t / √ 2 ( ·| y )  ∥ h ( U ) − m t ( y ) ∥ 2 2  ≤ 4 B 2 . (69) Com bining (67), (68), and (69), and c ho osing K sufficiently large so that the O ( K − 2 ) term is dominated b y the leading term, w e obtain for y ∈ T r ( α t M ) and σ t ≤ σ 0 , E   e m t,K ( y ) − m t ( y )   2 2 ≤ C ′ K σ − d t , for a constant C ′ > 0 dep ending only on ( B , d, D ) and the geometric constan ts in Assump- tions 1 – 2. T aking exp ectation ov er Y t and substituting in to (64) yields E   e s t,K ( Y t ; h, π , π ) − s t ( Y t ; h )   2 2 ≤ α 2 t σ 4 t · C ′ K σ − d t = C α 2 t K σ d +4 t , whic h is (19). 55 Lemma 18 (Gaussian–manifold conv olution: tw o–sided bounds ) . Let h ∈ H reg satisfy Assumption 1 on a b ounded latent domain U ⊂ R d , and let π b e a laten t density on U satisfying 0 < π min ≤ π ( u ) ≤ π max < ∞ for all u ∈ U . Assume furthermore that M := h ( U ) has reac h at least ρ M > 0 (Assumption 2). F or α > 0 define p X α ( x ) := Z U ϕ α ( x − h ( u )) π ( u ) d u , ϕ α ( z ) = (2 π α 2 ) − D/ 2 exp  − ∥ z ∥ 2 2 α 2  . Fix r ∈ (0 , ρ M ) and consider x ∈ T r ( M ) = { x : dist( x , M ) ≤ r } . Then there exist constan ts c ℓ , c u > 0 and α 0 ∈ (0 , r ) , dep ending only on ( D , d, π min , π max , m, M , ρ M , r, U ) , suc h that for all x ∈ T r ( M ) and all α ∈ (0 , α 0 ] , c ℓ α d − D exp  − dist( x , M ) 2 2 α 2  ≤ p X α ( x ) ≤ c u α d − D exp  − dist( x , M ) 2 2 α 2  . Pr o of. Fix x ∈ T r ( M ) and write δ = dist( x , M ) . Since r < ρ M and reac h( M ) ≥ ρ M , there exists a unique nearest p oint y ⋆ ∈ M with ∥ x − y ⋆ ∥ = δ . Let v := x − y ⋆ , so ∥ v ∥ = δ and v ⊥ T y ⋆ M . First, b y p ositive reac h there exists r 0 = r 0 ( ρ M ) and a lo cal c hart Ψ : B d ( r 0 ) → M around y ⋆ of the form Ψ( w ) = y ⋆ + P w + ψ ( w ) , where P is an isometry on to T y ⋆ M , ψ ( w ) ∈ N y ⋆ M , ψ (0) = 0 , D ψ (0) = 0 , and ∥ ψ ( w ) ∥ ≤ K ∥ w ∥ 2 ( ∥ w ∥ ≤ r 0 ) . W rite y ( w ) := Ψ( w ) . Since v , ψ ( w ) ∈ N y ⋆ M and P w ∈ T y ⋆ M are orthogonal, ∥ x − y ( w ) ∥ 2 = ∥ v − ψ ( w ) ∥ 2 + ∥ w ∥ 2 . After shrinking r 0 if necessary , there exist constan ts a 1 , a 2 > 0 suc h that δ 2 + a 1 ∥ w ∥ 2 ≤ ∥ x − y ( w ) ∥ 2 ≤ δ 2 + a 2 ∥ w ∥ 2 ( ∥ w ∥ ≤ r 0 ) . Then, c ho ose u ⋆ ∈ U such that h ( u ⋆ ) = y ⋆ . By Assumption 1(i), J h ( u ⋆ ) has smallest singular v alue at least m . Th us, b y the in v erse function prop erty , there exists r 1 ∈ (0 , r 0 ) and a C 1 map Θ : B d ( r 1 ) → U such that h (Θ( w )) = Ψ( w ) , Θ(0) = u ⋆ . Differen tiating giv es J h (Θ( w )) J Θ ( w ) = D Ψ( w ) . Since D Ψ( w ) = P + Dψ ( w ) and ∥ D ψ ( w ) ∥ ≤ 2 K ∥ w ∥ , shrinking r 1 so that ∥ D ψ ( w ) ∥ ≤ 1 2 yields 1 2 M ≤ s min ( J Θ ( w )) ≤ s max ( J Θ ( w )) ≤ 3 2 m . Hence (2 M ) − d ≤ | det( J Θ ( w )) | ≤ (3 / (2 m )) d ( w ∈ B d ( r 1 )) . 56 Moreo v er, δ 2 + a 1 ∥ w ∥ 2 ≤ ∥ x − h (Θ( w )) ∥ 2 ≤ δ 2 + a 2 ∥ w ∥ 2 . Next, decompose p X α ( x ) = (2 π α 2 ) − D/ 2 Z U exp  − ∥ x − h ( u ) ∥ 2 2 α 2  π ( u ) d u =: I near + I far , where I near in tegrates o v er Θ( B d ( r 1 )) . Changing v ariables u = Θ( w ) and using π ≤ π max and the lo wer quadratic b ound, I near ≤ (2 π α 2 ) − D/ 2 π max (3 / (2 m )) d exp  − δ 2 2 α 2  Z R d exp  − a 1 ∥ w ∥ 2 2 α 2  d w . Ev aluating the Gaussian in tegral yields I near ≤ C (1) u α d − D exp  − δ 2 2 α 2  . F or u / ∈ Θ( B d ( r 1 )) , con tinuit y and uniqueness of pro jection imply ∥ x − h ( u ) ∥ 2 ≥ δ 2 + κ for some κ > 0 . Hence I far ≤ (2 π α 2 ) − D/ 2 π max v ol( U ) exp  − δ 2 + κ 2 α 2  . Since exp  − κ 2 α 2  = o ( α d ) , this term is absorb ed into the same b ound for small α . Thus p X α ( x ) ≤ c u α d − D exp  − δ 2 2 α 2  . Finally , restrict to ∥ w ∥ ≤ cα . Using π ≥ π min and the upper quadratic b ound, p X α ( x ) ≥ (2 π α 2 ) − D/ 2 π min (2 M ) − d exp  − δ 2 2 α 2  Z ∥ w ∥≤ cα exp  − a 2 ∥ w ∥ 2 2 α 2  d w . Bounding the exponential b elow and using vol d ( B d ( cα )) = ω d c d α d giv es p X α ( x ) ≥ c ℓ α d − D exp  − δ 2 2 α 2  . 57 Pr o of of L emma 2. Let Z ∼ Unif [0 , 1] d and assume T : [0 , 1] d → R d is measurable with T ( Z ) ∼ π . Then for an y integrable ψ : R d → R , Z R d ψ ( u ) π ( d u ) = E [ ψ ( T ( Z ))] = Z [0 , 1] d ψ ( T ( z )) d z . Applying this with ψ 0 ( u ) := ϕ ( y t ; α t h ( u ) , σ 2 t I D ) and ψ 1 ( u ) := h ( u ) ϕ ( y t ; α t h ( u ) , σ 2 t I D ) giv es I 0 = Z [0 , 1] d f 0 ( z ) d z , I 1 = Z [0 , 1] d f 1 ( z ) d z , m t ( y t ) = I 1 I 0 . The K oksma–Hla wk a inequalit y (Niederreiter, 1992) states that for a scalar integrand g : [0 , 1] d → R with finite Hardy–Krause v ariation V HK ( g ) ,      1 K K X j =1 g ( z j ) − Z [0 , 1] d g ( z ) d z      ≤ V HK ( g ) D ∗ ( P K ) . Applying this to f 0 yields | I 0 ,K − I 0 | ≤ V HK ( f 0 ) D ∗ ( P K ) . (70) F or the v ector in tegrand f 1 = ( f 1 , 1 , . . . , f 1 ,p ) with p := dim( h ( u )) (typically p = d ), apply K oksma–Hla wk a co ordinate-wise and use ∥ · ∥ 2 ≤ P p r =1 | · | to get ∥ I 1 ,K − I 1 ∥ 2 ≤ p X r =1      1 K K X j =1 f 1 ,r ( z j ) − Z f 1 ,r      ≤ p X r =1 V HK ( f 1 ,r ) ! D ∗ ( P K ) =: V HK ( f 1 ) D ∗ ( P K ) . Assume V HK ( f 0 ) D ∗ ( P K ) ≤ I 0 / 2 . Then b y (70), I 0 ,K ≥ I 0 − | I 0 ,K − I 0 | ≥ I 0 / 2 , so 1 I 0 ,K ≤ 2 I 0 . No w decomp ose the ratio error: e m QMC t,K ( y t ) − m t ( y t ) = I 1 ,K I 0 ,K − I 1 I 0 = I 1 ,K − I 1 I 0 ,K + I 1  1 I 0 ,K − 1 I 0  . T aking ℓ 2 -norms and using    1 I 0 ,K − 1 I 0    = | I 0 ,K − I 0 | I 0 I 0 ,K giv es ∥ e m QMC t,K ( y t ) − m t ( y t ) ∥ 2 ≤ ∥ I 1 ,K − I 1 ∥ 2 I 0 ,K + ∥ I 1 ∥ 2 I 0 I 0 ,K | I 0 ,K − I 0 | . Using I 0 ,K ≥ I 0 / 2 and ∥ I 1 ∥ 2 = I 0 ∥ m t ( y t ) ∥ 2 yields ∥ e m QMC t,K ( y t ) − m t ( y t ) ∥ 2 ≤ 2 I 0 ∥ I 1 ,K − I 1 ∥ 2 + 2 ∥ m t ( y t ) ∥ 2 I 0 | I 0 ,K − I 0 | . Finally substitute the Koksma–Hla wk a b ounds for ∥ I 1 ,K − I 1 ∥ 2 and | I 0 ,K − I 0 | to obtain ∥ e m QMC t,K ( y t ) − m t ( y t ) ∥ 2 ≤ 2 I 0  V HK ( f 1 ) + ∥ m t ( y t ) ∥ 2 V HK ( f 0 )  D ∗ ( P K ) . 58 Applying the iden tity   e s QMC t,K ( y t ) − ∇ y t log p Y t ( y t )   2 = α t σ 2 t   e m QMC t,K ( y t ) − m t ( y t )   2 , w e obtain the result stated in the lemma. Pr o of of L emma 3. Fix t ∈ (0 , 1) and y t ∈ R D , and condition throughout on y t (so all exp ectations and probabilities b elo w are conditional on y t ). W rite p ( u ) := π t ( u | y t ) , q ( u ) := q ( u | y t ) , w ( u ) := p ( u ) q ( u ) . By assumption p ≪ q , hence w is well-defined q -a.e. and satisfies w ( u ) ≥ 0 . Let U 1 , . . . , U K i.i.d. ∼ q and denote w i := w ( U i ) . Define B K := 1 K K X i =1 w i , A K := 1 K K X i =1 w i h ( U i ) ∈ R D . Then the self-normalized estimator can b e written as e m t,K ( y t ) = A K B K . Moreov er, m t ( y t ) = E p [ h ( U )] = Z h ( u ) p ( u ) d u = Z h ( u ) w ( u ) q ( u ) d u = E q [ w ( U ) h ( U )] , so E [ A K ] = m t ( y t ) . Using E [ A K ] = m t and E [ B K ] = 1 , e m t,K − m t = A K B K − m t = A K − m t B K B K = ( A K − m t ) − m t ( B K − 1) B K . (71) Hence, b y ( a + b ) 2 ≤ 2 a 2 + 2 b 2 , ∥ e m t,K − m t ∥ 2 2 ≤ 2 ∥ A K − m t ∥ 2 2 B 2 K + 2 ∥ m t ∥ 2 2 ( B K − 1) 2 B 2 K . (72) Then we define the go o d ev en t set G := { B K ≥ 1 / 2 } . On G w e hav e 1 /B 2 K ≤ 4 , so (72) implies ∥ e m t,K − m t ∥ 2 2 1 G ≤ 8 ∥ A K − m t ∥ 2 2 + 8 ∥ m t ∥ 2 2 ( B K − 1) 2 . (73) On G c , since w i ≥ 0 and B K > 0 a.s., the normalized w eights ¯ w i := w i / P K j =1 w j form a con v ex com bination, hence e m t,K = K X i =1 ¯ w i h ( U i ) , so ∥ e m t,K ∥ 2 ≤ max i ∥ h ( U i ) ∥ 2 ≤ B . Also ∥ m t ∥ 2 ≤ E p ∥ h ( U ) ∥ 2 ≤ B . Therefore ∥ e m t,K − m t ∥ 2 2 1 G c ≤ ( ∥ e m t,K ∥ 2 + ∥ m t ∥ 2 ) 2 1 G c ≤ 4 B 2 1 G c . (74) 59 T aking exp ectations and com bining (73)–(74) yields E ∥ e m t,K − m t ∥ 2 2 ≤ 8 E ∥ A K − m t ∥ 2 2 + 8 ∥ m t ∥ 2 2 E ( B K − 1) 2 + 4 B 2 P ( G c ) . (75) Let X i := w i h ( U i ) ∈ R D , so that A K = 1 K P K i =1 X i with E [ X i ] = m t and i.i.d. across i . Then E ∥ A K − m t ∥ 2 2 = E      1 K K X i =1 ( X i − E X i )      2 2 = 1 K E ∥ X 1 − E X 1 ∥ 2 2 ≤ 1 K E ∥ X 1 ∥ 2 2 . (76) Using ∥ h ( u ) ∥ 2 ≤ B and X 1 = w ( U ) h ( U ) , E ∥ X 1 ∥ 2 2 = E q  w ( U ) 2 ∥ h ( U ) ∥ 2 2  ≤ B 2 E q [ w ( U ) 2 ] = B 2 D 2 ( p ∥ q ) . Plugging in to (76) yields E ∥ A K − m t ∥ 2 2 ≤ B 2 K D 2 ( p ∥ q ) . (77) Next, since B K = 1 K P K i =1 w i with E [ w i ] = 1 , E ( B K − 1) 2 = V ar( B K ) = 1 K V ar( w 1 ) ≤ 1 K E [ w 2 1 ] = 1 K D 2 ( p ∥ q ) . (78) With G c = { B K < 1 / 2 } ⊂ {| B K − 1 | ≥ 1 / 2 } , Cheb yshev giv es P ( G c ) ≤ P ( | B K − 1 | ≥ 1 / 2) ≤ E ( B K − 1) 2 (1 / 2) 2 = 4 E ( B K − 1) 2 . Using (78) yields P ( G c ) ≤ 4 K D 2 ( p ∥ q ) . (79) Substitute (77), (78), (79) into (75), and use ∥ m t ∥ 2 ≤ B : E ∥ e m t,K − m t ∥ 2 2 ≤ 8 · B 2 K D 2 ( p ∥ q ) + 8 B 2 · 1 K D 2 ( p ∥ q ) + 4 B 2 · 4 K D 2 ( p ∥ q ) = 32 B 2 K D 2 ( p ∥ q ) . Finally , the score error follo ws from e s t,K ( y t ; h, π , q ) − ∇ y t log p t ( y t ) = α t σ 2 t  e m t,K ( y t ) − m t ( y t )  , so squaring and taking conditional exp ectations yields the stated b ound. 60 References Ahrono viz, S. and Gronau, I. (2024), ‘Genome-ac-gan: Enhancing synthetic genotype gen- eration through auxiliary classification’, bioRxiv pp. 2024–02. Alb ergo, M. S. and V anden-Eijnden, E. (2023), ‘Sto chastic in terp olants: A unifying framew ork for flows and diffusions’, Pr o c e e dings of the National A c ademy of Scienc es 120 (36), e2303906120. An thon y , M. and Bartlett, P . L. (2009), Neur al Network L e arning: The or etic al F oundations , Cam bridge Univ ersit y Press, Cam bridge, UK. Arjo vsky , M., Chintala, S. and Bottou, L. (2017), W asserstein generativ e adv ersarial net- w orks, in ‘Proceedings of the 34th In ternational Conference on Machine Learning’, PMLR, pp. 214–223. Bartlett, P . L., Harvey , N., Liaw, C. and Mehrabian, A. (2019), ‘Nearly-tight v c-dimension and pseudodimension b ounds for piecewise linear neural net w orks’, The Journal of Ma- chine L e arning R ese ar ch 20 (1), 2285–2301. Chen, H., Lee, H. and Lu, J. (2023), Impro ved analysis of score-based generative modeling: User-friendly b ounds under minimal smo othness assumptions, in ‘International Conference on Mac hine Learning’, PMLR, pp. 4735–4763. De Bortoli, V., Thornton, J., Heng, J. and Doucet, A. (2022), Riemannian score-based generativ e mo deling, in ‘Adv ances in Neural Information Pro cessing Systems’, V ol. 35, pp. 12791–12804. Dhariw al, P . and Nichol, A. Q. (2021), Diffusion models b eat gans on image syn thesis, in ‘A dv ances in Neural Information Pro cessing Systems’, V ol. 34, pp. 8780–8794. Dic k, J. and Pillic hshammer, F. (2010), Digital Nets and Se quenc es: Discr ep ancy The ory and Quasi-Monte Carlo Inte gr ation , V ol. V ol. 157 of Cambridge Mono gr aphs on Applie d and Computational Mathematics , Cam bridge Univ ersit y Press, Cam bridge, UK. Dinh, L., Sohl-Dickstein, J. and Bengio, S. (2017), Densit y estimation using real nvp, in ‘In ternational Conference on Learning Representations’. Gulra jani, I., Ahmed, F., Arjo vsky , M., Dumoulin, V. and Courville, A. (2017), Improv ed training of w asserstein gans, in ‘Adv ances in Neural Information Processing Systems’, pp. 5767–5777. Hamidieh, K. (2018), ‘A data-driven statistical mo del for predicting the critical temp erature of a superconductor’, Computational Materials Scienc e 154 , 346–354. Huang, C.-W., Agha johari, M., Bose, J., Panangaden, P . and Courville, A. C. (2022), ‘Rie- mannian diffusion mo dels’, A dvanc es in Neur al Information Pr o c essing Systems 35 , 2750– 2761. 61 Huang, J., Jiao, Y., Li, Z., Liu, S., W ang, Y. and Y ang, Y. (2022), ‘An error analysis of generativ e adv ersarial netw orks for learning distributions’, Journal of machine le arning r ese ar ch 23 (116), 1–43. Hyvärinen, A. (2005), ‘Estimation of non-normalized statistical mo dels b y score matc hing’, Journal of Machine L e arning R ese ar ch 6 , 695–709. Karras, T., Aittala, M., Laine, S., Härkönen, E., Hellsten, J., Leh tinen, J. and Aila, T. (2022), Elucidating the design space of diffusion-based generative mo dels, in ‘Adv ances in Neural Information Processing Systems’, V ol. 35, pp. 26565–26577. Kennew eg, P ., Dandinasiv ara, R., Luo, X., Hammer, B. and Sc hönhuth, A. (2025), ‘Generat- ing syn thetic genotypes using diffusion models’, Bioinformatics 41 (Supplemen t_1), i484– i492. Kingma, D. P . and Dhariwal, P . (2018), Glow: Generativ e flo w with in vertible 1x1 con volu- tions, in ‘Adv ances in Neural Information Pro cessing Systems’, V ol. 31, pp. 10236–10245. Kiric henk o, P ., Izmailov, P . and Wilson, A. G. (2020), Wh y normalizing flo ws fail to detect out-of-distribution data, in ‘Adv ances in Neural Information Pro cessing Systems’, V ol. 33, pp. 20578–20589. K ob yzev, I., Prince, S. J. and Brubaker, M. A. (2021), ‘Normalizing flo ws: An introduction and review of current metho ds’, IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e 43 (11), 3964–3979. Krizhevsky , A. and Hinton, G. (2009), Learning m ultiple la yers of features from tin y images, in ‘Pro ceedings of the 2009 conference on computer vision and pattern recognition’, IEEE, pp. 1378–1385. Kunk el, L. and T rabs, M. (2025), ‘On the minimax opti malit y of flow matc hing through the connection to k ernel densit y estimation’, arXiv pr eprint arXiv:2504.13336 . LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P . (1998), ‘Gradient-based learning applied to document recognition’, Pr o c e e dings of the IEEE 86 (11), 2278–2324. Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nic k el, M. and Le, M. (2022), ‘Flo w matc hing for generativ e mo deling’, arXiv pr eprint arXiv:2210.02747 . Liu, X., Gong, C. and Li, Q. (2022), ‘Rectified flo w: A marginal preserving approach to optimal transp ort’, arXiv pr eprint arXiv:2209.14577 . Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C. and Zhu, J. (2022), Dpm-solver: A fast o de solver for diffusion probabilistic model sampling in around 10 steps, in ‘A dv ances in Neural Information Processin g Systems’, V ol. 35, pp. 5775–5787. Lu, J., Shen, Z., Y ang, H. and Zhang, S. (2021), ‘Deep netw ork approximation for smo oth functions’, SIAM Journal on Mathematic al Analysis 53 (5), 5465–5506. 62 Nalisnic k, E., Matsuk aw a, A., T eh, Y. W., Gorur, D. and Lakshminaray anan, B. (2019), Do deep generative mo dels kno w what they don’t kno w?, in ‘International Conference on Learning Represen tations’. Niederreiter, H. (1992), R andom Numb er Gener ation and Quasi-Monte Carlo Metho ds , So- ciet y for Industrial and Applied Mathematics (SIAM), Philadelphia, P A, USA. Ok o, K., Akiy ama, S. and Suzuki, T. (2023), Diffusion mo dels are minimax optimal distri- bution estimators, in ‘In ternational Conference on Machine Learning’, PMLR, pp. 26517– 26582. Ow en, A. B. (2013), ‘Monte carlo theory , metho ds and examples’, Stanfor d University . URL: https://statweb.stanfor d.e du/ owen/mc/ P apamak arios, G., Nalisnic k, E., Rezende, D. J., Mohamed, S. and Lakshminara yanan, B. (2021), ‘Normalizing flows for probabilistic mo deling and inference’, Journal of Machine L e arning R ese ar ch 22 (57), 1–64. Ren, J., Liu, P . J., F ertig, E., Sno ek, J., P oplin, R., DePristo, M., Dillon, J. V. and Laksh- minara y anan, B. (2019), Lik eliho o d ratios for out-of-distribution detection, in ‘Adv ances in Neural Information Pro cessing Systems’, V ol. 32. Rom bac h, R., Blattmann, A., Lorenz, D., Esser, P . and Ommer, B. (2022), High-resolution image syn thesis with latent diffusion mo dels, in ‘Pro ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition’, pp. 10684–10695. Salimans, T. and Ho, J. (2022), Progressive distillation for fast sampling of diffusion models, in ‘In ternational Conference on Learning Represen tations’. Shen, X. (1997), ‘On metho ds of sieves and p enalization’, The A nnals of Statistics 25 (6), 2555–2591. Shen, X. and W ong, W. H. (1994), ‘Conv ergence rate of siev e estimates’, The Annals of Statistics 22 (2), 580–615. Song, J., Meng, C. and Ermon, S. (2021), Denoising diffusion implicit mo dels, in ‘Interna- tional Conference on Learning Representations (ICLR)’. Song, Y., Dhariwal, P ., Chen, M. and Sutskev er, I. (2023), Consistency mo dels, in ‘Interna- tional Conference on Machine Learning’. Song, Y., Sohl-Dickstein, J., Kingma, D. P ., Kumar, A., Ermon, S. and Poole, B. (2021), Score-based generative mo deling through sto chastic differen tial equations, in ‘In ternational Conference on Learning Representations (ICLR)’. T ang, R. and Y ang, Y. (2024), Adaptivit y of diffusion mo dels to manifold structures, in M. Claudel, P . Alquier et al., eds, ‘Pro ceedings of the 27th International Conference on Artificial Intelligence and Statistics (AIST A TS)’, V ol. 238 of Pr o c e e dings of Machine L e arn- ing R ese ar ch , PMLR, PMLR, pp. 1908–1916. 63 The 1000 Genomes Pro ject Consortium (2015), ‘A global reference for h uman genetic v ari- ation’, Natur e 526 (7571), 68–74. Villani, C. et al. (2009), Optimal tr ansp ort: old and new , V ol. 338, Springer. Vincen t, P . (2011), ‘A connection betw een score matc hing and denoising autoenco ders’, Neur al Computation 23 (7), 1661–1674. v on Platen, P ., P atil, S., Lozhko v, A., Cuenca, P ., Lambert, N., Rasul, K., Dav aadorj, M., Nair, D., Paul, S., Berman, W., Xu, Y., Liu, S. and W olf, T. (2022), ‘Diffusers: State-of- the-art diffusion models’, https://github.com/huggingface/diffusers . Y elmen, B., Decelle, A., Boulos, L. L., Szatko wnik, A., F urtlehner, C., Charpiat, G. and Ja y , F. (2023), ‘Deep conv olutional and conditional neural netw orks for large-scale genomic data generation’, PLOS Computational Biolo gy 19 (10), e1011584. Zhang, K., Yin, C. H., Liang, F. and Liu, J. (2024), Minimax optimality of score-based diffusion mo dels: b ey ond the density lo w er b ound assumptions, in ‘Pro ceedings of the 41st In ternational Conference on Mac hine Learning’, pp. 60134–60178. Zheng, K., Lu, C., Chen, J. and Zh u, J. (2023), Dpm-solv er-v3: Improv ed diffusion ode solv er with empirical mo del statistics, in ‘A dv ances in Neural Information Pro cessing Systems (NeurIPS)’. 64

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment