Robust Physics-Guided Diffusion for Full-Waveform Inversion

We develop a robust physics-guided diffusion framework for full-waveform inversion that combines a score-based generative prior with likelihood guidance computed through wave-equation simulations. We adopt a transport-based data-consistency potential…

Authors: Jishen Peng, Enze Jiang, Zheng Ma

Robust Physics-Guided Diffusion for Full-Waveform Inversion
Robust Ph ysics-Guided Diffusion for F ull-W a v eform In v ersion Jishen P eng a,1 , Enze Jiang a,1 , Zheng Ma a,b,c,d , Xiong-Bin Y an e, ∗ a Scho ol of Mathematic al Scienc es, Shanghai Jiao T ong University, Shanghai, China b Institute of Natur al Scienc es, MOE-LSC, Shanghai Jiao T ong University, Shanghai, China c Qing Y uan R ese ar ch Institute, Shanghai Jiao T ong University, Shanghai, China d CMA-Shanghai, Shanghai Jiao T ong University, Shanghai, China e Scho ol of Mathematics and Statistics, L anzhou University, L anzhou, China. Abstract W e dev elop a robust physics-guided diffusion framew ork for full-wa veform in v ersion that com bines a score-based generativ e prior with lik eliho o d guidance computed through wa ve- equation simulations. W e adopt a transp ort-based data-consistency p otential (W asserstein-2), incorp orating w av efield enhancement via b ounded weigh ting and observ ation-dep endent normalization, thereb y impro ving robustness to amplitude imbalance and time/phase mis- alignmen t. On the inference side, we in tro duce a preconditioned guided reverse-diffusion sc heme that adapts the guidance strength and spatial scaling throughout the rev erse-time dynamics, yielding a more stable and effectiv e data-consistency guidance step than standard diffusion p osterior sampling (DPS). Numerical exp erimen ts on Op enFWI datasets demon- strate impro ved reconstruction qualit y ov er deterministic optimization baselines and standard DPS under comparable computational budgets. Keyw ords. F ull-w a veform inv ersion; Optimal transp ort; Ph ysics-guided diffusion; Precondi- tioned guidance 1. In tro duction In geophysics, seismic inv ersion is widely used to infer quantitativ e subsurface mo dels that match observed seismic recordings. F rom a mathematical viewp oin t, recov ering medium parameters from time-dep enden t w av efield measuremen ts constitutes a nonlinear and ill-posed in verse problem, primarily due to limited acquisition geometry and band-limited observ ations. Sev eral classes of metho ds ha ve b een developed, including v elo city analysis based on stack ed traces [ 1 ], migration and tra veltime-based methods [ 2 , 3 ], Born-appro ximation-based linearized in version [ 4 , 5 ], and full-w av eform inv ersion (FWI) [ 6 , 7 , 8 ]. Among these, FWI is distinguished b y its p otential to reco v er high-resolution subsurface structure by exploiting b oth phase and amplitude information through the solution of a PDE-constrained optimization problem that minimizes a c hosen data-misfit functional b et ween observ ed and synthetic seismograms. ∗ Corresp onding author. Email addr ess: yanxb2015@163.com (Xiong-Bin Y an) 1 Jishen P eng and Enze Jiang con tributed equally to this w ork. F ull-w av eform in v ersion (FWI) is a PDE-constrained inv erse problem that aims to re- construct subsurface parameters (e.g., w a vespeed or velocity) from time-dep enden t seismic recordings [ 9 , 10 , 6 ]. Given an acquisition geometry and a forward w a ve-propagation operator F , the classical form ulation seeks a v elo cit y mo del v suc h that F ( v ) matc hes the observ ed data d obs , where v ∈ R m denotes the discretized v elo cit y field and F : R m → R n maps mo del parameters to recorded traces. Despite its ability to recov er high-resolution structures by exploiting the full wa v efield conten t, FWI is widely recognized as a challenging nonconv ex optimization problem: the forward map is nonlinear, the inv erse problem is ill-conditioned, and the misfit landscap e t ypically contains many local minima due to phase am biguity and limited illumination [ 9 , 11 , 12 , 13 ]. These difficulties are esp ecially pronounced when the initial mo del is inaccurate and when the observ ations are contaminated b y noise (and other data errors). In addition, strong m ulti-scale comp onen ts with markedly differen t amplitudes can further bias residual-based ob jectiv es and degrade stability [14, 9]. Since p oint wise ℓ 2 ob jectives p enalize the p oin twise residual in time, they are highly sensi- tiv e to small time/phase shifts b et w een observed and syn thetic traces. This sensitivit y leads to a strongly nonconv ex ob jectiv e landscap e and the well-kno wn cycle-skipping phenomenon [ 15 , 16 ], in which iterativ e methods can b ecome trapp ed in lo cal minima asso ciated with incorrect phase (kinematic) alignment. Moreo ver, seismic traces often exhibit pronounced dynamic-range imbalance: early-arriving phases can ha ve substantially larger amplitudes than later reflections, causing b oth the misfit and its gradien t to b e dominated b y a small p ortion of the data [ 17 ]. As a consequence, the resulting gradients can be dominated b y high-amplitude comp onen ts, leading up dates to underfit weak er later phases. A large b ody of w ork aims to mitigate these issues through multi-scale con tinuation [ 14 ], alternative misfit functions based on phase/env elop e attributes [ 18 , 19 ], adaptiv e wa v eform inv ersion [ 20 ], and geometry-a ware metrics suc h as optimal-transp ort/W asserstein distances [ 12 , 21 , 22 ]; see also [23] for a recen t review of misfit functions and adjoint sources. In parallel with adv ances in data misfit design, recent y ears ha ve witnessed rapid progress in learning expressiv e priors for in verse problems using generative mo deling. In seismic imaging, a range of generative models—including GAN-based geological priors, latent-v ariable mo dels suc h as V AEs, and normalizing flo ws—hav e b een explored to mitigate ill-p osedness and to facilitate uncertaint y quantification [ 24 , 25 , 26 ]. Score-based diffusion mo dels (also known as score-based generativ e mo dels) pro vide a particularly attractiv e mechanism: they learn the score field ∇ x log p t ( x ) of progressively noised marginals and enable conditional generation b y rev ersing the asso ciated diffusion dynamics [ 27 , 28 ]. F rom the p ersp ectiv e of in verse problems, diffusion mo dels can b e in terpreted as providing a learned prior ov er plausible v elo cit y fields, trained solely from samples of the v elo city mo del (i.e., parameter-field samples), without requiring paired data of the form (w a vefield, velocity) and without em b edding a forw ard solv er in the prior-training lo op; conditional inference is then p erformed at test time b y incorp orating data-consistency (lik eliho o d) information during reverse-time sampling or related p osterior-sampling schemes [ 29 , 30 ]. This decoupling is especially app ealing in seismic imaging, where rep eated wa vefield sim ulation is exp ensiv e and acquisition geometries ma y v ary across applications; accordingly , diffusion-based Ba yesian FWI and diffusion-driven v elo cit y mo deling hav e started to b e in vestigated in recen t w ork [31, 32]. T o solv e the in verse problem giv en observ ations, guided rev erse diffusion metho ds com bine the learned prior score with a data-consistency term induced b y a p otential Φ (i.e., a negativ e 2 log-lik eliho o d, or more generally a data-misfit functional that measures agreemen t b et ween sim ulated data and the observ ed data). Diffusion p osterior sampling (DPS) is a represen tative metho d: at rev erse step i it forms a clean-state estimate ˆ v ( i ) 0 , whic h is a denoised estimate of the underlying clean v ariable asso ciated with the current noisy state. This estimate is computed using the learned prior score s θ ∗ ( · , i ) . A likelihoo d-guidance step is then applied using the gradien t ∇ Φ( ˆ v ( i ) 0 ) . How ever, a direct use of the baseline DPS up date with a fixed global guidance step size is p oorly matched to the characteristics of full-wa v eform inv ersion (FWI) in t wo fundamental respects. First, the effectiv eness of DPS critically dep ends on the c hoice of the p oten tial Φ . In FWI, con ven tional p oint wise ℓ 2 (residual-based) p oten tials inherit t wo structural deficiencies. On the one hand, the quadratic residual makes the induced descen t direction disprop ortionately influenced b y large-amplitude arriv als, resulting in markedly unbalanced up dates across traces and time windo ws. On the other hand, the p oin twise ℓ 2 geometry is highly sensitive to small time/phase shifts, whic h typically pro duces a highly nonconv ex misfit landscap e with spurious lo cal minima (cycle-skipping) associated with incorrect phase alignment. When suc h a potential is used for likelihoo d guidance in DPS, these effects can yield correction terms whose scale and direction are p o orly matched to the current rev erse-time state, thereby degrading the stabilit y and effectiveness of the reverse-time dynamics. Second, ev en with a robust potential, a single scalar guidance strength in baseline DPS cannot capture the spatially v arying sensitivit y of the misfit to the v elo cit y field v . Early rev erse-time steps often yield a rough ˆ v ( i ) 0 , so ov erly aggressiv e guidance can be unreliable, whereas later steps t ypically b enefit from stronger data-consistency enforcement. Moreov er, illumination and acquisition geometry induce strong spatial v ariability in sensitivity , so a single global step size ma y ov ersho ot in w ell-illuminated regions while remaining ineffective in p oorly illuminated ones. These mismatc hes can reduce stabilit y and efficiency along the rev erse-time iterations and ultimately degrade reconstruction quality . Contributions. This work develops a robust physics-guided diffusion framework for full- w av eform in v ersion (FWI), com bining a score-based generative prior with an OT-based data-consistency p oten tial and a stabilized, v ariable-metric guidance mechanism in guided rev erse diffusion. Our main con tributions are: • An OT-based data-consistency p oten tial for ph ysics-guided diffusion. W e design a robust likelihoo d p oten tial for guided reverse diffusion b y combining bounded amplitude-adaptiv e weigh ting of seismic traces with a one-dimensional W asserstein discrepancy computed via quantile functions. When used as the data-consistency term in DPS-type guidance, the resulting guidance signal is less dominated b y high-amplitude arriv als and less sensitive to small time/phase misalignment, thereby alleviating cycle- skipping and yielding a more balanced and stable data-consistency mechanism for FWI. • A daptiv e, v ariable-metric guidance for guided rev erse diffusion. W e in tro duce a v ariable-metric (diagonal) preconditioner for the guidance step, replacing the scalar DPS step size b y a matrix P i = ρ i D i . Here ρ i adapts the guidance strength along the rev erse tra jectory to suppress unreliable early updates, while the diagonal scaling D i balances the spatially heterogeneous sensitivit y of the misfit with resp ect to the velocity 3 field in FWI. This yields a more stable and effectiv e guidance mec hanism than standard DPS in seismic settings. • Numerical v alidation. W e ev aluate the prop osed metho d on Op enFWI b enc hmarks and compare against deterministic optimization-based in version baselines and the origi- nal DPS under matched computational budgets, demonstrating impro v ed reconstruction qualit y and stability . W e further assess generalization on other standard b enchmarks b ey ond Op enFWI, where the metho d main tains strong p erformance. The remainder of the pap er is organized as follo ws. In Section 2 w e state the FWI problem and cast it in Ba yesian form. Section 3 recalls score-based diffusion mo dels and the learned prior score needed for reverse-time dynamics. Section 4 deriv es a conditional rev erse-diffusion form ulation for Bay esian in v ersion and presen ts a baseline discretization in the spirit of diffusion p osterior sampling (DPS). Section 5 in tro duces our robust physics-guided diffusion sc heme, including an OT-based data-consistency p oten tial and an adaptive, v ariable-metric guidance mec hanism. Section 6 pro vides n umerical results, comparativ e studies on Op enFWI b enc hmarks, and generalization tests on additional standard b enc hmark datasets. Section 7 concludes with a summary of the main results. 2. Problem form ulation 2.1. F ul l-waveform inversion (FWI) F ull-w av eform inv ersion (FWI) aims to reco ver a spatially v arying w av e sp eed v ( x ) from seismic observ ations recorded at receiv ers { x r } N r r =1 . Giv en observed seismograms d r ( t ) = d ( x r , t ) , the estimate of v is t ypically obtained b y minimizing a wa veform-misfit functional b et w een observed and sim ulated data, sub ject to the gov erning wa ve equation. By exploiting the full w av eform information—including phase, amplitude, and m ultiple arriv als—FWI can in principle achiev e higher resolution than kinematic approaches such as trav eltime tomograph y [ 33 , 34 , 35 ]. A t the same time, the resulting PDE-constrained optimization problem is strongly nonlinear and generally noncon vex; robust performance therefore requires careful treatmen t of noise and modeling errors, together with appropriate regularization or prior information. In what follows, the simulated data are generated b y a forward op erator F ( v ) defined through the solution of an acoustic w av e equation. Sp ecifically , w e consider seismic w av e propagation under the constant-densit y acoustic approximation in an isotropic medium. The w av efield u ( x , t ) denotes the acoustic pressure, and the unkno wn parameter is the spatially v arying wa ve speed v ( x ) > 0 . Neglecting v ariable-density and anisotropic effects leads to the standard Laplacian as the spatial op erator. Let Ω ⊂ R d ( d = 2 , 3) b e the domain and T > 0 the final time. F or a given source term s ( x , t ) , the forw ard problem reads          1 v 2 ( x ) ∂ tt u ( x , t ) − ∆ u ( x , t ) = s ( x , t ) , ( x , t ) ∈ Ω × (0 , T ] , u ( x , 0) = 0 , ∂ t u ( x , 0) = 0 , x ∈ Ω , ∂ n u ( x , t ) = 0 , ( x , t ) ∈ Γ ref × (0 , T ] , (1) 4 where Γ ref ⊂ ∂ Ω denotes the (top) reflecting b oundary and ∂ n is the outw ard normal deriv ativ e. On the remaining boundary Γ abs := ∂ Ω \ Γ ref , we imp ose absorbing b oundary conditions, implemen ted in practice via p erfectly matched lay ers (PML), in order to minimize artificial reflections from the truncation of the computational domain. W e tak e the domain to b e the rectangle Ω = [0 , L x ] × [0 , L z ] ⊂ R 2 with x = ( x, z ) , where the reflecting b oundary corresp onds to the top side Γ ref = { ( x, z ) ∈ ∂ Ω : z = L z } . W e discretize Ω b y a tensor-pro duct grid with nodes { ( x i , z j ) } i =1 ,...,m x ; j =1 ,...,m z . The w a ve-speed field is represen ted by its no dal v alues v ij := v ( x i , z j ) , collected in a matrix V = { v ij } ∈ R m x × m z , or equiv alen tly by the v ectorization v = vec( V ) ∈ R m , where m = m x m z . Receiv ers are deploy ed exclusiv ely on the top b oundary Γ ref ⊂ ∂ Ω . W e denote the discrete receiv er set b y Γ r := { x r } N r r =1 ⊂ Γ ref . F or the j -th exp erimen t with source term s ( x , t ; ξ j ) , w e solv e (1) (with this source) to obtain the wa v efield u j ( x , t ; v , ξ j ) . Let R denote the receiv er sampling (restriction) op erator, defined b y ( Ru j ) r ( t ) := u j ( x r , t ; v , ξ j ) , r = 1 , . . . , N r , so that the corresp onding noise-free synthetic data for experiment j are d ( j ) r ( t ) = ( Ru j ) r ( t ) . Stac king the traces across all exp erimen ts and receivers defines the parameter-to-observ able op erator F : v 7− → { ( Ru j ) r ( t ) } j =1 ,...,N s ; r =1 ,...,N r , whic h maps the velocity mo del to the predicted receiv er recordings. After discretizing time at { t k } N T k =1 , the data can b e view ed as an arra y d = { d j,r ( t k ) } ∈ R N s × N r × N T , ( F ( v )) j,r,k = ( Ru j ) r ( t k ) . The observ ation mo del is then d obs = F ( v ) + η , (2) where η denotes the measuremen t noise. W e adopt an additiv e Gaussian noise η ∼ N (0 , Σ) , where Σ is the noise cov ariance in data space. In the simplest case of indep endent and iden tically distributed noise, Σ = σ 2 I , so that η ∼ N (0 , σ 2 I ) with σ > 0 con trolling the noise lev el. Classically , FWI is form ulated as the minimization of a data-misfit functional (p ossibly augmen ted b y regularization). In this work, w e instead adopt a Ba y esian formulation and infer v through the p osterior, where data consistency is enco ded b y the likelihoo d p oten tial Φ (i.e., the negativ e log-likelihoo d), defined b elo w. 2.2. Bayesian formulation of FWI Giv en the observ ation mo del (2) , w e cast FWI as a Bay esian inv erse problem and seek the subsurface velocity v through the p osterior distribution p ( v | d obs ) . In this w ork, w e use the p osterior form ulation primarily as a computational target for a diffusion-based in version algorithm, rather than pursuing full p osterior uncertain t y quantification. Likeliho o d as a data-c onsistency p otential. Rather than fixing the likelihoo d to a p oint- wise ℓ 2 discrepancy a priori, w e express it through a general data-c onsistency p otential 5 Φ( · ; d obs ) (i.e., a negativ e log-likelihoo d up to an additiv e constan t), following the standard Ba yesian inv erse problem form ulation; see, e.g., [36, 37]: p ( d obs | v ) ∝ exp  − Φ( v ; d obs )  . (3) This form ulation decouples the data-error mo del (2) from the c hoice of data-misfit functional used to quantify agreemen t b et ween sim ulated and observed data. As a classical sp ecial case, the additiv e Gaussian noise mo del in (2) yields p ( d obs | v ) ∝ exp  − 1 2   F ( v ) − d obs   2 Σ − 1  , ∥ r ∥ 2 Σ − 1 := r ⊤ Σ − 1 r . (4) corresp onding to Φ( v ; d obs ) = 1 2 ∥F ( v ) − d obs ∥ 2 Σ − 1 . Throughout the pap er, the observ ation d obs is fixed. W e often suppress this dep endence and write Φ( v ) ≡ Φ( v ; d obs ) . In what follo ws, we keep Φ general un til Section 5, where w e instan tiate it by a robust OT-based wa v eform misfit tailored to amplitude dominance, cycle-skipping, and ob jective scaling. Prior. T o regularize the inv erse problem and enco de admissible structural assumptions on the subsurface, we endo w the wa ve speed v with a prior distribution π 0 (d v ) (with density p ( v ) in the finite-dimensional discretization). Such prior information is crucial in FWI, whic h is t ypically ill-p osed due to limited acquisition geometry and band-limited data, and hence ma y admit non-unique and unstable reconstructions under purely data-driven fitting. A common c hoice is a Gibbs-type prior of the form p ( v ) ∝ exp  − µ R ( v )  , where R ( v ) is a regularization functional and µ > 0 con trols its strength. In this case, the negative log-prior con tributes the term µ R ( v ) to the p osterior ob jective, so that clas- sical deterministic regularization is recov ered at the level of maxim um a p osteriori (MAP) estimation. Posterior. Com bining the lik eliho o d (3) with the prior p ( v ) via Ba y es’ rule giv es the p osterior distribution p ( v | d obs ) ∝ p ( d obs | v ) p ( v ) ∝ exp  − Φ( v ; d obs )  p ( v ) . (5) Once Φ is sp ecified, the p osterior combines the data-fit term enco ded b y the lik eliho o d with the prior information enco ded b y p ( v ) . Conne ction to deterministic FWI (MAP). Classical deterministic form ulations of FWI are often p osed as minimizing a data-misfit functional, p ossibly augmented b y regularization. Within the Ba yesian formulation, suc h approac hes can b e interpreted as computing a MAP estimate. T aking the negative logarithm of (5) yields, up to an additiv e constant, − log p ( v | d obs ) = Φ( v ; d obs ) − log p ( v ) + C, where C is indep endent of v . Consequen tly , v MAP = arg min v ∈V Φ( v ; d obs ) − log p ( v ) , (6) where V denotes the admissible set (e.g., enforcing v > 0 ). If Φ is c hosen as a quadratic (Gaussian) data-misfit and p ( v ) ∝ exp ( − µ R ( v )) , then (6) reduces to a classical regularized PDE-constrained optimization problem for FWI. In the remainder of the pap er, we compute p osterior-driv en reconstructions using a diffusion-based solver with a learned prior score and a data-consistency p oten tial Φ . 6 3. Score-Based Generativ e Mo del Diffusion mo dels, also known as sc or e-b ase d gener ative mo dels [ 28 , 27 , 38 ], define a generativ e pro cess b y gradually p erturbing samples with Gaussian noise and then learning to rev erse this corruption pro cess. F rom the score-based viewp oin t, these mo dels are grounded in the estimation of the sc or e function , i.e., the gradien t of the log-density with resp ect to the v ariable of in terest. 3.1. Setup and notation W e identify the discretized velocity field v ∈ R m ( m = m x m z ) with the data v ector x 0 and write x 0 := v . This is purely a notational con ven tion to align with diffusion-model form ulations, where x 0 denotes an uncorrupted sample at diffusion time t = 0 . W e regard x 0 as a random vector following an unknown distribution p data on R m , representing the p opulation of plausible subsurface mo dels. In our Ba yesian setting, p data pla ys the role of the prior p ( v ) , from whic h we assume access to samples. 3.2. F orwar d diffusion pr o c ess A diffusion mo del begins b y defining a contin uum of progressiv ely noised v ariables { x ( t ) } t ∈ [0 ,T diff ] through an Itô sto c hastic differen tial equation (SDE) [39] d x ( t ) = f ( x ( t ) , t ) d t + g ( t ) d w t , x (0) = x 0 ∼ p data , where w t is a standard m -dimensional Wiener pro cess, f is the drift, and g is the diffusion co efficien t. Under mild regularit y assumptions [ 40 ], the la w of x ( t ) admits a density p t , and { p t } t ∈ [0 ,T diff ] ev olves according to the Kolmogoro v forw ard (F okker–Planc k) equation [39] ∂ t p t ( x ) = −∇ ·  f ( x , t ) p t ( x )  + 1 2 g 2 ( t )∆ p t ( x ) , p t =0 = p data . In particular, the forw ard SDE induces a family of distributions { p t } obtained b y Gaussian p erturbations of p data . 3.3. V arianc e-pr eserving diffusion and tr ansition kernel W e adopt the varianc e-pr eserving (VP) SDE [28], defined by the linear Itô SDE d x ( t ) = − 1 2 β ( t ) x ( t ) d t + p β ( t ) d w t , t ∈ [0 , T diff ] , (7) where β ( t ) > 0 is a prescrib ed noise schedule. In our setting, x (0) = x 0 represen ts a clean (v ectorized) velocity mo del, while x ( t ) denotes its progressively p erturb ed v ersion. Since (7) is linear with time-dep enden t co efficien ts, it admits an explicit mild solution. Define the in tegrating factor α ( t ) := exp  − 1 2 Z t 0 β ( τ ) d τ  . 7 Applying Itô’s form ula to α ( t ) − 1 x ( t ) yields x ( t ) = α ( t ) x 0 + α ( t ) Z t 0 α ( s ) − 1 p β ( s ) d w s = α ( t ) x 0 + Z t 0 exp  − 1 2 Z t s β ( τ ) d τ  p β ( s ) d w s . (8) The sto c hastic in tegral in (8) is Gaussian with mean zero. Setting ˆ σ 2 ( t ) := 1 − exp  − Z t 0 β ( τ ) d τ  = 1 − α 2 ( t ) , w e obtain the Gaussian transition kernel x ( t ) | x 0 ∼ N  α ( t ) x 0 , ˆ σ 2 ( t ) I  , (9) or, equiv alen tly , the reparameterization x ( t ) = α ( t ) x 0 + ˆ σ ( t ) z , z ∼ N (0 , I ) . If the schedule is c hosen so that R T diff 0 β ( τ )d τ is sufficien tly large, then α ( T diff ) ≈ 0 and ˆ σ 2 ( T diff ) ≈ 1 , implying that x ( T diff ) is appro ximately standard Gaussian and essen tially indep enden t of x 0 . In practice we sample x ( T diff ) ∼ N (0 , I ) to initialize the reverse-time pro cedure. 3.4. R everse-time dynamics and sc or e le arning Let q t ( ·| x 0 ) denote the transition k ernel of the Marko v diffusion (7) , and define, for eac h t ∈ [0 , T diff ] , the time- t densit y p t ( x ) := Z R m q t ( x | x 0 ) p data ( x 0 )d x 0 , whenev er the density exists. The main ob ject of in terest is the score field s ⋆ ( x , t ) := ∇ x log p t ( x ) , i.e., the logarithmic deriv ativ e of the marginal densit y induced b y the forw ard diffusion at time t (equiv alen tly , at the corresp onding noise level). Under mild regularity assumptions, a time-reversal formula for diffusion pro cesses [ 41 ] implies that the rev erse-time dynamics asso ciated with (7) can b e expressed in terms of the score. In the VP case, where f ( x , t ) = − 1 2 β ( t ) x and g 2 ( t ) = β ( t ) , the rev erse-time SDE can b e written as d x ( t ) =  − 1 2 β ( t ) x ( t ) − β ( t ) s ⋆ ( x ( t ) , t )  d t + p β ( t ) d ¯ w t , t ∈ [0 , T diff ] , (10) in terpreted in rev erse time, i.e., the pro cess is run backw ard from t = T diff to t = 0 , where ¯ w t is a Wiener pro cess in the rev erse-time parameterization. 8 W e appro ximate the true score s ⋆ ( x , t ) = ∇ x log p t ( x ) b y a neural net w ork s θ ( x , t ) (the sc or e network ), parameterized by θ . A natural training ob jective is the mean-squared error regression: L 1 ( θ ) := E t ∼U (0 ,T diff ) E x ∼ p t h κ ( t ) ∥ s θ ( x , t ) − ∇ x log p t ( x ) ∥ 2 2 i , (11) where κ ( t ) > 0 rew eights the con tributions of different noise lev els. How ever, (11) is not directly computable in practice, since the marginal densit y p t is induced by the unkno wn data distribution p data . Since the marginal score ∇ x log p t ( x ) is in tractable, w e adopt denoising score matching (DSM) [ 42 ]. DSM exploits the fact that the forw ard transition k ernel q t ( · | x 0 ) is kno wn, and trains the score netw ork to match the corresp onding conditional score. Concretely , it minimizes L 2 ( θ ) := E t ∼U (0 ,T diff ) E x 0 ∼ p data E x ∼ q t ( ·| x 0 ) h κ ( t ) ∥ s θ ( x , t ) − ∇ x log q t ( x | x 0 ) ∥ 2 2 i . (12) The mathematical rationale is the iden tity ∇ x log p t ( x ) = E x 0 ∼ p ( · | x ( t )= x ) [ ∇ x log q t ( x | x 0 )] , so the marginal score is the conditional exp ectation of the conditional score. By the L 2 pro jection prop ert y of conditional exp ectation, L 2 ( θ ) = L 1 ( θ ) + C with a constan t C ≥ 0 indep enden t of θ . Hence L 1 and L 2 ha ve the same minimizers and L 2 ( θ ) ≥ L 1 ( θ ) . F or the VP diffusion, q t ( ·| x 0 ) is Gaussian ((9)), hence ∇ x log q t ( x | x 0 ) = − x − α ( t ) x 0 ˆ σ 2 ( t ) . After training, let θ ∗ denote the learned parameters (e.g., an empirical minimizer of the training loss (12) ). Replacing the unknown score in the rev erse-time dynamics b y s θ ∗ yields the appro ximate reverse SDE d x ( t ) =  f ( x ( t ) , t ) − g 2 ( t ) s θ ∗ ( x ( t ) , t )  d t + g ( t ) d ¯ w t , t ∈ [0 , T diff ] , in terpreted in reverse time (run bac kward from t = T diff to t = 0 ), with f ( x , t ) = − 1 2 β ( t ) x and g ( t ) = p β ( t ) for the VP diffusion. 4. Conditional Diffusion for Ba yesian In version 4.1. Diffusion-b ase d p osterior solver W e no w describ e a diffusion-based solv er for computing p osterior-driv en reconstructions of v from the Bay esian mo del (5) . F ollowing the diffusion p osterior sampling (DPS) framework, w e incorp orate data consistency through a lik eliho o d p oten tial and use a learned score mo del to represent prior information. F or notational consistency with the DPS literature, w e denote the observ ation b y y δ := d obs and recall the rev erse-time SDE (10) for the VP diffusion. Thr oughout this se ction, the observation y δ is fixe d; henc e we suppr ess this dep endenc e in the notation and write Φ( · ) ≡ Φ( · ; y δ ) . T o dra w samples from the Bay esian p osterior distribution 9 p ( v | y δ ) in (5), w e replace the marginal score s ⋆ ( x , t ) = ∇ x log p t ( x ) b y the conditional score ∇ x log p t ( x | y δ ) , leading to d x ( t ) =  − 1 2 β ( t ) x ( t ) − β ( t ) ∇ x log p t ( x ( t ) | y δ )  d t + p β ( t ) d ¯ w t , t ∈ [0 , T diff ] , in terpreted in reverse time (run backw ard from t = T diff to t = 0 ). Using Bay es’ rule, the conditional score decomp oses as ∇ x ( t ) log p t ( x ( t ) | y δ ) = ∇ x ( t ) log p t ( x ( t )) + ∇ x ( t ) log p t ( y δ | x ( t )) , (13) so that, giv en an appro ximation of the marginal score ∇ log p t ( x ( t )) ≈ s θ ∗ ( x ( t ) , t ) , it remains to appro ximate the term ∇ x ( t ) log p t ( y δ | x ( t )) . F r om likeliho o d to a gener al p otential. Under a Gaussian likelihoo d (4) , a common appro ximation is ∇ x ( t ) log p t ( y δ | x ( t )) ≃ − 1 2 ∇ x ( t ) ∥ y δ − F ( ˆ x 0 ( t )) ∥ 2 Σ − 1 , where ˆ x 0 ( t ) denotes the denoised (p osterior-mean) estimate of the clean v ariable asso ciated with x ( t ) . F ollowing Eq. (10) in [30], w e set ˆ x 0 ( t ) := 1 p ¯ α ( t )  x ( t ) +  1 − ¯ α ( t )  s θ ∗ ( x ( t ) , t )  , where s θ ∗ ( x , t ) is the trained score netw ork and, for the VP diffusion, ¯ α ( t ) := exp  − R t 0 β ( τ ) dτ  . In our form ulation (3) , data consistency is enco ded through a p oten tial Φ . Accordingly , w e approximate the conditioning term by ∇ x ( t ) log p t ( y δ | x ( t )) ≃ −∇ x ( t ) Φ( ˆ x 0 ( t )) , (14) so that the observ ational constraint enters the reverse-time dynamics via the chosen p oten- tial Φ . Com bining (13) with (14) and replacing the prior score by the trained score net work yields ∇ x ( t ) log p t ( x ( t ) | y δ ) ≃ s θ ∗ ( x ( t ) , t ) − ζ ∇ x ( t ) Φ( ˆ x 0 ( t )) , (15) where ζ > 0 is a user-chosen guidance-strength parameter that rescales the data-consistency term to accoun t for ob jective scaling. 4.2. Baseline DPS discr etization for FWI Let { β i } N i =1 b e the discrete noise sc hedule, α i := 1 − β i , and ¯ α i := Q i j =1 α j . Denote by v i the rev erse-time diffusion state at step i (DDPM indexing [ 27 ]), so that v i ≈ x ( t i ) . Here i = N corresp onds to the highest noise lev el (an appro ximately Gaussian state), and i = 1 corresp onds to the lo west noise level; thus { v i } N i =1 traces a discrete reverse-time tra jectory that progressively denoises the state from v N to ward a clean reconstruction. Denote by s θ ∗ ( · , i ) the trained score netw ork. The standard clean estimator is ˆ v ( i ) 0 = ˆ v 0 ( v i , i ) := 1 √ ¯ α i  v i + (1 − ¯ α i ) s θ ∗ ( v i , i )  . (16) 10 Algorithm 1: Baseline Diffusion P osterior Sampling (DPS) Require : N , sc hedule { β i } N i =1 , score net w ork s θ ∗ , v ariance { ˆ σ i } fixed observ ation d obs , h yp erparameters { ζ i } (and optional clipping b ounds), ob jective function Φ l 2 . Result : obtain the inv ersion v elo cit y mo del v = v 0 . Initialize v N ∼ N (0 , I ) ; set α i := 1 − β i and ¯ α i := Q i j =1 α j . for i = N to 1 do ˆ v ( i ) 0 ← 1 √ ¯ α i  v i + (1 − ¯ α i ) s θ ∗ ( v i , i )  ; z ∼ N (0 , I ) ; v ′ i − 1 ← √ α i (1 − ¯ α i − 1 ) 1 − ¯ α i v i + √ ¯ α i − 1 β i 1 − ¯ α i ˆ v ( i ) 0 + ˆ σ i z ; v i − 1 ← v ′ i − 1 − ζ ∇ v i Φ l 2 ( ˆ v 0 ) ; Baseline likeliho o d p otential (le ast squar es). In the original DPS form ulation [ 30 ], data consistency is enforced through a Gaussian likelihoo d, leading to the least-squares p oten tial Φ ℓ 2 ( v ) := 1 2 ∥F ( v ) − y δ ∥ 2 Σ − 1 . Giv en the prior-driv en prop osal v ′ i − 1 , baseline DPS applies a scalar guidance correction ev aluated at the curren t state v i : v i − 1 = v ′ i − 1 − ζ ∇ v i Φ ℓ 2 ( ˆ v ( i ) 0 ) , (17) = v ′ i − 1 − ζ  F ′ ( ˆ v ( i ) 0 )  ∗ Σ − 1  F ( ˆ v ( i ) 0 ) − y δ  , ζ > 0 . Based on the ab o v e formulation, the DPS-based pro cedure for solving the full wa veform in version problem is summarized in Algorithm 1. 4.3. Limitations of ℓ 2 -guide d DPS in FWI Although the Algorithm 1 is consistent with a Gaussian noise mo del, its direct application to FWI presen ts sev eral well-kno wn and practically imp ortan t difficulties. Sp ecifically , the p oin t wise ℓ 2 p oten tial ma y induce cycle-skipping and ob jective imbalance, while the use of a single global scalar guidan ce parameter is inadequate for the strongly space-dep enden t sensitivities c haracteristic of FWI. (i) Cycle-skipping and nonc onvexity. The ℓ 2 misfit compares wa v eforms p oin t wise in time; small phase shifts can yield large residuals and highly oscillatory gradients. As a consequence, Φ ℓ 2 t ypically exhibits a strongly nonconv ex landscape with man y spurious lo cal minima, whic h can lead to cycle-skipping in solving the present in v erse problem. Within DPS, the guidance term is applied to the denoised estimate ˆ v ( i ) 0 , and early rev erse-time iterates often pro duce rough ˆ v ( i ) 0 ; in this regime, the ℓ 2 gradien t can b e particularly unreliable, causing unstable or ineffectiv e guidance. (ii) Amplitude dominanc e and obje ctive sc aling. Seismic data often exhibit a pronounced dynamic-range im balance across arriv als. Since Φ ℓ 2 is quadratic in the residual, the gradien t ∇ Φ ℓ 2 is dominated b y the largest residual comp onen ts, whic h are t ypically asso ciated with high-amplitude early arriv als. As a result, the baseline correction step tends to prioritize fitting the strongest even ts, while w eak er phases that are informativ e for deep er or p o orly 11 illuminated regions are comparativ ely down-w eighted. Moreo ver, Φ ℓ 2 ( v ) = 1 2 ∥F ( v ) − y δ ∥ 2 Σ − 1 is used without normalization by an observ ation-dep enden t scale; when the data (or residuals after prepro cessing) ha v e small magnitude, b oth Φ ℓ 2 and ∇ Φ ℓ 2 can b ecome n umerically small, making the guidance strength ζ in (17) delicate to tune and p otentially leading to ov erly w eak data-consistency enforcement. (iii) Heter o gene ous sensitivity and sc alar step size. FWI sensitivities are strongly space- dep enden t (e.g., due to limited illumination), so a single global scalar guidance parameter ζ cannot balance updates across well-illuminated and p o orly-illuminated regions. This mismatc h can slow do wn progress in weak-sensitivit y areas and ma y yield unstable b eha vior in sensitiv e regions. These observ ations motiv ate tw o mo difications: we (a) replace the p oint wise ℓ 2 p oten tial b y a robust OT-based data-consistency p oten tial to mitigate cycle-skipping and amplitude im balance, and (b) introduce a v ariable-metric preconditioned guidance scheme to stabilize and balance spatial up dates. W e present these comp onents next. 5. Our metho d: OT-guided preconditioned DPS for FWI This section presents t w o mo difications motiv ated by the limitations of ℓ 2 -guided DPS in FWI discussed in the ab o ve section: (a) an OT-based data-consistency p oten tial Φ = J that mitigates amplitude dominance, reduces phase-sensitivit y (cycle-skipping), and stabilizes ob jective scaling; and (b) a v ariable-metric preconditioned guidance sc heme that adapts the guidance strength across noise lev els and balances spatial updates under heterogeneous sensitivities. 5.1. OT-b ase d data-c onsistency p otential As discussed in Section 4.3, p oin t wise least-squares w av eform fitting can lead to amplitude- dominated up dates, strong phase sensitivity (cycle-skipping), and ill-conditioned ob jectiv e scaling. W e therefore construct a robust OT-based misfit J in three steps, eac h targeting one of these difficulties. (I) Amplitude-adaptive data weighting. Let d obs ( s, r, t ) denote the observed trace asso ciated with the source–receiv er pair ( s, r ) , and let d syn ( s, r, t ; v ) b e its synthetic coun terpart generated b y the velocity mo del v . W e in tro duce a globally normalized instan taneous amplitude a ( s, r, t ) = | d obs ( s, r, t ) | max i,j,t | d obs ( s i , r j , t ) | ∈ [0 , 1] , and define the b ounded w eigh t ω ( s, r, t ) = 1 1 + k a ( s, r , t ) , k ≥ 0 . (18) By construction, ω ( s, r, t ) ∈  1 / (1 + k ) , 1  and is monotone decreasing in a ( s, r, t ) , so large- amplitude arriv als are down-w eighted while lo w-amplitude p ortions are largely preserv ed. W e apply the same p oin t wise weigh ting to b oth observed and synthetic signals, ˜ d obs ( s, r, t ) = ω ( s, r, t ) d obs ( s, r, t ) , ˜ d syn ( s, r, t ; v ) = ω ( s, r , t ) d syn ( s, r, t ; v ) , (19) 12 so that the comparison is p erformed under a fixed, observ ation-dep endent weigh ting. Equiv a- len tly , ˜ d syn − ˜ d obs = ω ( d syn − d obs ) , and thus the contribution of high-amplitude even ts to the misfit and its gradien t is reduced. Imp ortantly , ω is computed from the observ ations and k ept fixed during in version, so the w eighting does not in tro duce additional mo del-dep enden t nonlinearit y . This amplitude-adaptive w eighting alleviates scale disparities induced by raw amplitudes and prev ents the descent direction from b eing dominated b y the high-amplitude p ortions of the data. As an empirical illustration, w e take CurveF ault-A dataset [ 43 ] as an example and compare the original traces and their weigh ted coun terparts at Receiver 1 for the first three source– receiv er pairs in Figure 1. In the original recordings (Figure 1a), the early-arriving p ortions exhibit substan tially larger amplitudes than later-arriving ev ents, thereb y o verwhelming the con tribution of w eaker phases in the least-squares misfit. After applying the b ounded w eighting (18) with k = 100 (Figure 1b), the dynamic range is effectiv ely compressed: high- amplitude arriv als are suppressed while lo w-amplitude even ts are largely preserv ed. This amplitude balancing reduces the tendency of large-amplitude p ortions of the data to dominate the gradien t and yields a more balanced weigh ted comparison for subsequent in v ersion. Nev ertheless, even with amplitude balancing, p oin twise ℓ 2 misfits can remain highly sensitiv e to small time shifts and phase mismatch. T o alleviate the resulting nonlinearit y and reduce cycle-skipping, w e therefore adopt the W asserstein- p distance as the data-misfit measure (w e use p = 2 in all computations). 0.0 0.2 0.4 0.6 0.8 1.0 T ime (s) 25 0 25 Amplitude 0.0 0.2 0.4 0.6 0.8 1.0 T ime (s) 25 0 25 Amplitude 0.0 0.2 0.4 0.6 0.8 1.0 T ime (s) 25 0 25 Amplitude (a) The original observed wa vefield. 0.0 0.2 0.4 0.6 0.8 1.0 T ime (s) 0.25 0.00 0.25 Amplitude 0.0 0.2 0.4 0.6 0.8 1.0 T ime (s) 0.25 0.00 0.25 Amplitude 0.0 0.2 0.4 0.6 0.8 1.0 T ime (s) 0.25 0.00 0.25 Amplitude (b) The weigh ted wa vefield with k = 100 . Figure 1: Comparison of the original and w eighted wa vefield traces at Receiv er 1 for the first three source– receiv er pairs. The original recordings are dominated by early large-amplitude arriv als, whereas the prop osed b ounded weigh ting compresses the dynamic range and renders later w eak-amplitude ev ents more comparable in scale. (II) T r ansp ort-b ase d misfit (W asserstein- p ). The least-squares misfit induces a p oint- wise ℓ 2 geometry that is o verly sensitiv e to temp oral misalignmen t, which can lead to a highly noncon vex ob jective landscap e and promote cycle-skipping. W e therefore adopt the W asserstein- p distance, whic h equips the data space with an optimal-transp ort geometry and measures discrepancy via mass displacemen t (after suitable normalization), rather than b y p oin t wise differences. Since the W asserstein distance is formally defined on nonnegativ e measures with equal 13 total mass, w e follow the standard OT-FWI construction [ 12 ] and transform eac h enhanced 1D trace in to a probability density function (PDF) via a fixed shift–rescale normalization. Sp ecifically , for a trace ˜ d ( t ) (either ˜ d obs or ˜ d syn ), w e introduce a constan t c ′ > 0 to ensure nonnegativit y . T o guaran tee that ˜ d ( t ) + c ′ ≥ 0 on [0 , T ] for all traces, w e fix ( s, r ) and tak e c ′ = 1 . 1    min t ˜ d obs ( s, r, t )    , and k eep c ′ constan t throughout the in v ersion, so that the mapping do es not introduce additional mo del-dep enden t nonlinearit y in the gradien t. In practice, c ′ is c hosen sufficiently large so that ˜ d syn + c ′ ≥ 0 throughout the inv ersion. W e then define the density mapping P ( ˜ d )( t ) = ˜ d ( t ) + c ′ R T 0  ˜ d ( τ ) + c ′  dτ , (20) whic h ensures P ( ˜ d )( t ) ≥ 0 and R T 0 P ( ˜ d )( t ) dt = 1 . Accordingly , for eac h source–receiver pair ( s, r ) , the asso ciated probabilit y densities are ˜ ρ obs ( · ; s, r ) = P  ˜ d obs ( · ; s, r )  , ˜ ρ syn ( · ; s, r, v ) = P  ˜ d syn ( · ; s, r, v )  . With the signals transformed in to densities, the W asserstein- p distance is defined by W p ( ˜ ρ syn , ˜ ρ obs ) =  inf γ ∈ ˜ Γ( ˜ ρ syn , ˜ ρ obs ) Z [0 ,T ] × [0 ,T ] | t − τ | p dγ ( t, τ )  1 /p , p ≥ 1 , where ˜ Γ ( ˜ ρ syn , ˜ ρ obs ) denotes the set of admissible transp ort plans with marginals ˜ ρ syn and ˜ ρ obs . In the one-dimensional setting, W p admits an explicit representation in terms of quan tile functions. Let ˜ F syn and ˜ F obs b e the cum ulativ e distribution functions (CDF s) of ˜ ρ syn and ˜ ρ obs , resp ectiv ely; then W p ( ˜ ρ syn , ˜ ρ obs ) =  Z 1 0    ˜ F − 1 syn ( ξ ) − ˜ F − 1 obs ( ξ )    p dξ  1 /p , (21) where ˜ F − 1 denotes the quan tile function, defined as the generalized in v erse ˜ F − 1 ( ξ ) := inf { t ∈ [0 , T ] : ˜ F ( t ) ≥ ξ } . While w e present the definition for general p ≥ 1 , all subsequent results and computations in this pap er use p = 2 . Numerically , for eac h trace w e compute the PDF using (20) , construct the CDF via cumulativ e summation, and ev aluate the quan tile difference in (21) b y interpolation. (III) Misfit sc ale normalization. T o stabilize the numerical scale of the OT guidance term, w e rescale the aggregated discrepancy b y an observ ation-dep enden t characteristic magnitude. Sp ecifically , let ˜ F − 1 obs ,s,r denote the quantile function asso ciated with the observ ed PDF for the pair ( s, r ) , and define the scale factor S obs = X s X r  Z 1 0 | ˜ F − 1 obs ,s,r ( ξ ) | p dξ  1 /p , 14 whic h dep ends only on observ ations and is indep enden t of the v elo cit y mo del v . Therefore, this standardization does not alter the minimizer of the ob jective; it merely rescales the functional and its gradien t b y a constant factor, improving numerical conditioning, and reducing sensitivit y to step-size tuning in the guidance up dates. W eighte d and normalize d W asserstein misfit. Combining the ab o ve ingredien ts, we define the normalized W asserstein-based ob jective functional as J ( v ) = X s X r W p ( ˜ ρ syn ( · ; s, r ; v ) , ˜ ρ obs ( · ; s, r )) X s X r  Z 1 0 | ˜ F − 1 obs ,s,r ( ξ ) | p dξ  1 /p , (22) where ˜ ρ syn and ˜ ρ obs are the PDF s obtained from the enhanced w av efields ˜ d syn and ˜ d obs , resp ectiv ely . This construction of J is designed to (i) reduce the dominance of high-amplitude arriv als through b ounded w eighting, (ii) alleviate phase sensitivit y via the OT geometry , and (iii) stabilize the n umerical scale through observ ation-dep endent normalization. Choic e of likeliho o d p otential. With the ab o ve construction, we instan tiate the data- consistency p oten tial in the Ba yesian likelihoo d (3) as Φ( v ; d obs ) := J ( v ) , and w e often suppress the dep endence on d obs b y writing J ( v ) ≡ J ( v ; d obs ) and Φ( v ) ≡ Φ( v ; d obs ) . 5.2. Pr e c onditione d Guidanc e in DPS Building on the conditional-score appro ximation (15) , w e now introduce a v ariable-metric (diagonal) preconditioned guidance sc heme tailored to FWI. Our mo dific ation: variable-metric pr e c onditione d guidanc e. W e replace the uniform scalar step size b y a p ositiv e diagonal preconditioner, v i − 1 = v ′ i − 1 − P i ∇ v i Φ( ˆ v ( i ) 0 ) , P i = ρ i D i , (23) where ρ i > 0 is a scalar schedule and D i is a di agonal p ositive matrix. When P i = ζ I , (23) reduces to the baseline DPS step. The guidance gradien t in (23) is computed with resp ect to v i through the clean estimator (16): ∇ v i Φ( ˆ v ( i ) 0 ) =  ∇ v i ˆ v 0 ( v i , i )  ⊤ ∇ ˆ v 0 Φ( ˆ v ( i ) 0 ) . (24) W e ev aluate (24) in practice b y automatic differentiation. Stabilize d guidanc e sche duling. Early reverse-time iterates t ypically pro duce ˆ v ( i ) 0 with spurious high-frequency artifacts. In this regime, the guidance direction can b e unreliable, and ov erly aggressive up dates may destabilize the reverse tra jectory . W e therefore adopt a conserv ative guidance strength when ˆ v ( i ) 0 is still rough, and gradually increase it as the estimate b ecomes smoother and more structured. 15 T o quan tify the roughness of the current denoised estimate, we introduce a TV-based indicator [ 44 ]. Specifically , let TV ( · ) denote the discrete total-v ariation semi-norm on the 2D grid: TV( v ) = 1 N g X ℓ  | ( ∇ x v ) ℓ | + | ( ∇ z v ) ℓ |  , (25) where ℓ indexes grid p oin ts, N g is the n umber of grid p oin ts, and ∇ x , ∇ z are forw ard finite differences. W e use TV ( ˆ v ( i ) 0 ) as a roughness pro xy b ecause spurious oscillatory artifacts in early iterates typically manifest themselv es through large lo cal finite differences and hence elev ated TV v alues. At the same time, TV is less sensitive than quadratic smo othness indicators to gen uine sharp in terfaces, which are common in subsurface mo dels. Therefore, it pro vides a simple and computationally efficient criterion for distinguishing unstable, artifact-dominated iterates from structurally meaningful ones. Based on this pro xy , we define the step-dep endent scalar factor ρ i = ρ 0 exp − (TV( ˆ v ( i ) 0 ) − τ ) + c ! , ( · ) + := max {· , 0 } , (26) with h yp erparameters ρ 0 > 0 , c > 0 , and an optional threshold τ ≥ 0 . By construction, ρ i ∈ (0 , ρ 0 ] . Moreov er, ρ i = ρ 0 whenev er TV ( ˆ v ( i ) 0 ) ≤ τ , and it decreases monotonically with TV ( ˆ v ( i ) 0 ) once TV ( ˆ v ( i ) 0 ) > τ . The parameter c sets the deca y scale: larger c leads to slo wer atten uation, that is, less sensitivit y to roughness. The threshold τ defines a tolerance lev el for total v ariation and preven ts excessiv e suppression of guidance in the presence of genuine in terfaces, such as sharp geological contrasts. Sp atial diagonal pr e c onditioning. FWI gradien ts t ypically deca y with depth due to limited illumination and attenuation, leading to highly nonuniform up date magnitudes. T o mitigate this im balance, we introduce a diagonal rescaling based on the current misfit gradient with resp ect to the clean estimate. Let g ( i ) := ∇ ˆ v 0 Φ( ˆ v ( i ) 0 ) ∈ R n . W e define the diagonal w eights κ ( i ) j = ∥ g ( i ) ∥ ∞ + ε | g ( i ) j | + ε ! γ , D i := diag( κ ( i ) 1 , . . . , κ ( i ) n ) , (27) where γ ≥ 0 and ε > 0 . W e build D i from g ( i ) = ∇ ˆ v 0 Φ b ecause it directly reflects the physical FWI sensitivit y in the mo del domain; the chain rule in (24) then maps this sensitivit y back to the curren t diffusion state. Interpr etation via a desc ent ine quality. F or each rev erse-time index i , consider the comp osite functional ˜ Φ i ( v ) := Φ( ˆ v ( i ) 0 ( v )) . Assume that ˜ Φ i is differentiable and has a lo cally L i - Lipsc hitz gradient on a neighborho o d containing the relev ant iterates. Then the descent lemma [45] yields that, for an y symmetric p ositiv e semidefinite matrix P satisfying ∥ P ∥ 2 ≤ 1 /L i , ˜ Φ i ( v − P ∇ ˜ Φ i ( v )) ≤ ˜ Φ i ( v ) − 1 2 ∥∇ ˜ Φ i ( v ) ∥ 2 P , ∥ a ∥ 2 P := a ⊤ P a. (28) 16 Algorithm 2: OT-guided W av efield-Enhanced Preconditioned DPS (OT-WE-PDPS) Require : N , schedule { β i } N i =1 , score net work s θ ∗ , v ariance { ˆ σ i } , fixed observ ation d obs , h yp erparameters ρ 0 , c, τ , p, ε (and optional clipping b ounds), ob jective function J as (22). Result : approximate posterior sample v 0 . Initialize v N ∼ N (0 , I ) ; set α i := 1 − β i and ¯ α i := Q i j =1 α j . for i = N to 1 do ˆ v ( i ) 0 ← 1 √ ¯ α i  v i + (1 − ¯ α i ) s θ ∗ ( v i , i )  ; z ∼ N (0 , I ) ; v ′ i − 1 ← √ α i (1 − ¯ α i − 1 ) 1 − ¯ α i v i + √ ¯ α i − 1 β i 1 − ¯ α i ˆ v ( i ) 0 + ˆ σ i z ; g ( i ) ← ∇ ˆ v ( i ) 0 J ( ˆ v ( i ) 0 ) ; κ ( i ) j ← (( ∥ g ( i ) ∥ ∞ + ε ) / ( | g ( i ) j | + ε )) p ; form D i = diag( κ ( i ) 1 , . . . , κ ( i ) d ) ; ρ i ← ρ 0 exp  − (TV ( ˆ v ( i ) 0 ) − τ ) + /c  ; Set P i ← ρ i D i (optionally with clipping and/or sp ectral-norm con trol); v i − 1 ← v ′ i − 1 − P i ∇ v i J ( ˆ v ( i ) 0 ) ; Applying (28) p oin t wise with P at the curren t iterate, this estimate pro vides a heuristic justification for c ho osing P i = ρ i D i . Indeed, b y tuning ρ i and clipping the diagonal entries of D i , w e control the effective sp ectral norm of P i and obtain a stabilized v ariable-metric guidance step, analogous to diagonal preconditioning in ill-conditioned in verse problems. Com bining the ab o ve ingredients, w e arriv e at the following final Algorithm 2. 6. Exp erimen ts W e ev aluate our metho d on Op enFWI [ 43 ]. W e train the score-based prior using v elo city mo dels from the V el F amily and F ault F amily datasets. All in version results are rep orted on held-out test samples that are disjoint from the training set; representativ e test mo dels are sho wn in the first row of Figure 2. The score neural netw ork is a U-Net [46] trained via denoising score matc hing under the VP diffusion describ ed in Section 3. Imp ortan tly , prior training uses only samples of the model parameter field (velocity) and do es not require an y w av efield simulations. W e compute p osterior-driv en reconstructions by running a discretized reverse-time diffusion whose drift com bines the learned prior score and the data-consistency term induced by F , implemen ted via our OT-guide d W avefield-Enhanc e d Pr e c onditione d DPS (OT-WE-PDPS) (Algorithm 2). The lik eliho o d p otential is chosen as Φ = J , the normalized OT misfit of Section 5.1. The rep orted reconstructions corresp ond to unseen mo dels that were not used in training the prior score net work. Exp erimen tal p erformance is ev aluated using b oth qualitativ e insp ection and quan titativ e metrics. As a primary accuracy measure, we report the relative ℓ 2 reconstruction error e ℓ 2 := ∥ v true − v rec ∥ 2 ∥ v true ∥ 2 , 17 where v true ∈ R m denotes the ground-truth velocity mo del (vectorized on the computational grid) and v rec ∈ R m is the reconstruction, and ∥ · ∥ 2 is the Euclidean norm. In addition, to quantify p erceptual fidelit y and structural consistency , we report the p eak signal-to-noise ratio (PSNR) and the structural similarit y index measure (SSIM) computed b et w een v rec and v true . Baselines and implementation details. W e compare our metho d against three baselines: (i) Baseline 1 ( W 2 + TV ) [ 47 ]. W e consider a classical baseline that minimizes a normalized W 2 w av eform misfit with total-v ariation (TV) regularization. Define J raw W 2 ( v ) := X s X r W 2 2 ( ρ syn ( · ; s, r ; v ) , ρ obs ( · ; s, r )) X s X r Z 1 0 | F − 1 obs ,s,r ( ξ ) | 2 dξ , where ρ syn ( · ; s, r ; v ) and ρ obs ( · ; s, r ) are probability densities obtained from the r aw syn thetic and observed traces d syn ( · ; s, r ; v ) and d obs ( · ; s, r ) , resp ectiv ely , via the fixed shift–rescale densit y map (cf. (20) ); in particular, no wa vefield enhancemen t is applied here. Moreo ver, F obs ,s,r denotes the CDF of ρ obs ( · ; s, r ) , and F − 1 obs ,s,r is the asso ciated quan tile function. W e then minimize the TV-regularized ob jective min v J raw W 2 ( v ) + α TV ( v ) , where α > 0 is the regularization parameter and TV ( · ) is defined in (25) . Using a first-order metho d, a (sub)gradien t descen t iteration reads v ← v − ρ 0  ∇ v J raw W 2 ( v ) + α g TV ( v )  , g TV ( v ) ∈ ∂ TV( v ) , with a fixed step size ρ 0 > 0 . (ii) Baseline 2 (OT-WE + TV ). W e consider an impro ved v ersion of the ab o ve method that minimizes the wa vefield-enhanced normalized OT misfit J ( v ) in (22) (Section 5.1), augmen ted with a TV p enalt y: min v J ( v ) + α TV( v ) , where α > 0 is the regularization parameter and TV ( · ) is defined in (25) . W e solve this problem using a first-order method. In particular, w e use a diagonal preconditioned (sub)gradient step v ← v − P ( v )  ∇ v J ( v ) + α g TV ( v )  , g TV ( v ) ∈ ∂ TV( v ) , where P ( v ) = ρ ( v ) D ( v ) is a p ositiv e diagonal preconditioner updated from the curren t iterate. Here ρ ( v ) and D ( v ) are defined b y (26) and (27), resp ectiv ely , after replacing ˆ v ( i ) 0 b y v . (iii) Baseline 3 (DPS) [ 30 ]. W e use the original diffusion p osterior sampling baseline with an ℓ 2 (MSE) data-consistency p oten tial, Φ ℓ 2 ( v ) := 1 2 ∥F ( v ) − y δ ∥ 2 Σ − 1 , 18 and apply the standard scalar guidance up date (17) at eac h rev erse-time step. No OT-based misfit, w av efield enhancement, or preconditioned guidance is used in this baseline. A cross all metho ds, w e use the same forw ard solv er, acquisition geometry , and observ ation data. F or a fair comparison, w e tune the h yp erparameters of eac h comp eting metho d as carefully as p ossible (within the same exp erimen tal proto col) to ac hiev e its b est attainable p erformance. 6.1. F orwar d Solver and Observation Data The forw ard operator F (defined in Section 2) is ev aluated by numerically solving problem (1) for the wa vefield u giv en the v elo cit y field v . F or a source located at ξ j , w e use th e source term s ( x , t ; ξ j ) = f ( t ) δ ( x − ξ j ) , where f is a Rick er wa v elet f ( t ) =  1 − 2 π 2 f 2 p ( t − t 0 ) 2  exp  − π 2 f 2 p ( t − t 0 ) 2  , with f p = 15 and t 0 = 1 . 1 /f p . W e discretize (1) on a uniform Cartesian grid of size 71 × 71 with spacing dx = dz = 10 and adv ance in time with step dt = 10 − 3 for n t = 1000 steps. W e use an 8th-order accurate finite-difference discretization in space and a second-order accurate explicit time-stepping sc heme, as implemen ted in Deep w av e [ 48 ]. T o appro ximate an un b ounded domain, w e truncate the computational domain using absorbing p erfectly matched la yers (PMLs) of thickness 6 grid p oin ts on the left, right, and b ottom b oundaries. On the top b oundary , w e imp ose a reflecting (free-surface) b oundary condition. The observed data are generated according to (2), with additiv e Gaussian noise η ∼ N (0 , σ 2 I ) and σ = 0 . 05 . 6.2. Numeric al c alculation of the W asserstein–2 distanc e F or each source–receiver trace, w e compute a one-dimensional W asserstein–2 distance b et w een the observed and synthetic signals after enforcing nonnegativit y and unit mass. In our metho d, the W asserstein distance is computed using the w av efield-enhanced traces ˜ d obs ( t ) and ˜ d syn ( t ) (19). W e first shift eac h trace b y a constan t c ′ > 0 to ensure positivity and then normalize to obtain probabilit y densities: ˜ ρ obs ( t ) = ˜ d obs ( t ) + c ′ R T 0  ˜ d obs ( τ ) + c ′  dτ + ε ′ , ˜ ρ syn ( t ) = ˜ d syn ( t ) + c ′ R T 0  ˜ d syn ( τ ) + c ′  dτ + ε ′ , where the in tegrals are taken o v er the recording window [0 , T ] and ev aluated on the discrete sampling grid, ε ′ = 10 − 9 is a small constan t used to av oid division by a v anishing discrete normalization factor, and w e fix c ′ = 1 . 1   min t ∈ [0 ,T ] ˜ d obs ( t )   in all exp erimen ts. Let ˜ F obs and ˜ F syn denote the correspond ing CDF s, and let ˜ Q obs = ˜ F − 1 obs and ˜ Q syn = ˜ F − 1 syn b e the associated quantile functions. Using the one-dimensional representation W 2 ( ρ obs , ρ syn ) =  Z 1 0   ˜ Q obs ( ξ ) − ˜ Q syn ( ξ )   2 dξ  1 2 , w e discretize ˜ F obs and ˜ F syn on the time grid { t k } n t − 1 k =0 b y trapezoidal quadrature, ev aluate ˜ Q obs ( ξ ) and ˜ Q syn ( ξ ) via linear in terp olation, and approximate the ξ -in tegral on a uniform grid ξ j = j / ( N ξ − 1) b y the trap ezoidal rule ( N ξ = 1000 ). The o verall OT misfit is obtained b y av eraging the resulting W 2 v alues o ver all source–receiv er pairs. 19 T able 1: Hyp erparameters used in each dataset. Here γ is the exponent in the diagonal preconditioner D i for OT-WE-PDPS (27) . The sym b ol ρ 0 denotes the base guidance strength in the noise-aw are sc hedule ρ i for OT-WE-PDPS (26) , and it also denotes the (constant) step size used in the first-order solver for the W 2 + TV baseline. The parameter α is the TV regularization w eight used in the TV-regularized baselines (Baseline 1 and Baseline 2), with TV defined in (25). Curv e V el-A Curv e V el-B P aram Ours DPS OT- WE +TV W 2 + TV Ours DPS OT- WE +TV W 2 + TV γ 0.55 - 0.65 - 0.35 - 0.65 - ρ 0 1.15 8 1.3 13 12.4 4 4 14 α - - 0.05 0.2 - - 1.0 0.2 FlatF ault-A FlatF ault-B P aram Ours DPS OT- WE +TV W 2 + TV Ours DPS OT- WE +TV W 2 + TV γ 0.35 - 0.65 - 0.45 - 0.65 - ρ 0 1.2 0.3 1.1 5 1.45 9 0.6 2 α - - 0.3 1.0 - - 0.2 0.5 Curv eF ault-A Curv eF ault-B P aram Ours DPS OT- WE +TV W 2 + TV Ours DPS OT- WE +TV W 2 + TV γ 0.45 - 0.65 - 0.55 - 0.65 - ρ 0 1.8 4.15 1.3 12 1.75 5 0.6 14 α - - 0.1 0.2 - - 0.1 0.5 6.3. Quantitative and qualitative c omp arisons 6.3.1. A c quisition ge ometry and discr etization. W e consider a surface acquisition setting in whic h b oth sources and receiv ers are lo cated on the free surface z = L z , where L x = L z = 700 . The computational domain is discretized on a 71 × 71 Cartesian grid with spacing dx = dz = 10 m, where x denotes the horizon tal axis and z the depth axis. Grid no des are giv en b y x i = i dx and z j = j dz for i, j = 0 , . . . , 70 . T en p oin t sources are placed at surface no des ( x s k , z s k ) = ( x i k , L z ) = ( i k dx, L z ) , i k ∈ { 0 , 7 , 14 , 21 , 28 , 35 , 42 , 49 , 56 , 63 } , and 70 receiv ers are deploy ed at all surface no des, ( x r , z r ) = ( x r , L z ) = ( r dx, L z ) , r = 0 , 1 , . . . , 69 . 6.3.2. T est datasets. W e ev aluate six v elo cit y mo dels sho wn in Figure 2: Curve V el-A/B, FlatF ault-A/B, and Curv eF ault-A/B. These mo dels are designed to prob e increasing structural complexit y , from 20 T able 2: Quan titative comparison of inv ersion metho ds on Op enFWI velocity mo dels. Metrics are the relative ℓ 2 error e ℓ 2 , PSNR, and SSIM for the reconstructed velocity field (arro ws indicate the preferred direction). Metho ds include W 2 + TV (Baseline 1), OT-WE + TV (Baseline 2), DPS (Baseline 3), and the prop osed metho d. Curv e V el-A Curv e V el-B FlatF ault-A Metho d e l 2 ↓ PSNR ↑ SSIM ↑ e l 2 ↓ PSNR ↑ SSIM ↑ e l 2 ↓ PSNR ↑ SSIM ↑ W 2 + TV 12.10% 20.65 0.4971 10.72% 21.05 0.5425 11.84% 20.56 0.7198 OT-WE +TV 7.96% 24.28 0.6987 7.40% 24.27 0.6434 7.28% 24.79 0.9217 DPS 23.15% 15.01 0.5747 15.70% 17.73 0.4329 16.68% 17.58 0.6965 Ours 3.50% 31.42 0.9039 2.04% 35.45 0.9517 2.91% 32.74 0.9646 FlatF ault-B Curv eF ault-A Curv eF ault-B Metho d e l 2 ↓ PSNR ↑ SSIM ↑ e l 2 ↓ PSNR ↑ SSIM ↑ e l 2 ↓ PSNR ↑ SSIM ↑ W 2 + TV 22.82% 17.05 0.4584 17.63% 17.74 0.4875 16.34% 18.20 0.4343 OT-WE +TV 6.31% 28.21 0.8729 4.79% 29.06 0.7897 9.39% 23.01 0.6997 DPS 13.54% 21.58 0.5662 10.72% 22.06 0.6149 21.82% 15.69 0.2210 Ours 6.13% 28.48 0.9272 2.41% 35.03 0.9540 5.32% 27.95 0.7956 smo oth la yered in terfaces to sharp discon tinuities and their combinations. Curve V el-A/B are la yered media with clear interfac es (including curv ed b oundaries), intended to test in terface reco very . In Curv e V el-A, v elo cities within la yers increase gradually with depth, whereas in Curv e V el-B, the la yer velocities are randomly distributed. FlatF ault-A/B introduces faults that shift la y ers and generate sharp discon tinuities; compared with Curve V el-A/B exhibits more discon tin uities and more sev ere v elo cit y contrasts. Curv eF ault-A/B com bines curv ed stratified in terfaces with fault-induced discon tin uities, thereby coupling smo oth geometric v ariation with sharp jumps and p osing the most c hallenging scenarios among the six datasets. 6.3.3. Settings. W e set k = 100 in (18) and ε = 10 − 4 in the construction of D i (27). When computing ρ i (26) , we set c = 0 . 1 and τ = 0 in our metho d. These settings are k ept fixed across all datasets. Other h yp erparameters v ary across datasets and metho ds and are listed in T able 1. F or Baseline 1 ( W 2 + TV ) and Baseline 2 (OT-WE + TV ) , w e run gradien t descent for 10 , 000 iterations and rep ort the final iterate; step sizes and regularization weigh ts are listed in T able 1. A cross all metho ds, w e use the same forward solver, acquisition geometry , and observ ation data. Hyp erparameters are tuned under a unified proto col to obtain the b est attainable p erformance for eac h comp eting metho d. 6.3.4. R esults. The follo wing observ ations are drawn from T able 2 and Figure 2: 1. Our metho d attains the lo w est reconstruction errors and the highest image-quality metrics across the datasets, while preserving in terfaces and fault geometries in Figure 2. 21 1500 1800 2100 2400 2700 3000 3300 3600 3900 4200 4500 v elo cit y (m/s) -3000 -2400 -1800 -1200 -600 0 600 1200 1800 2400 3000 v elo cit y (m/s) Reconstruction Difference Curv e V el-A Curv e V el-B FlatF ault-A FlatF ault-B Curv eF ault-A Curv eF ault-B T ruth Ours Ours DPS DPS OT-WE +TV OT-WE +TV W 2 + TV W 2 + TV Figure 2: Inv ersion results across different datasets. The first ro w displays the true velocity field v true . Ro ws 2 to 5 show the in verted velocity fields v rec obtained using differen t algorithms. Rows 6 to 9 presen t the difference betw een the true and inv erted v elo cit y fields, v rec − v true . The observed w av efield has b een p erturbed with additive noise of in tensit y σ = 0 . 05 . 22 1 0 3 1 0 2 1 0 1 1 0 0 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 e l 2 (a) Relative l 2 -error of the reconstructed v elo city field v for the W 2 + TV method under different v alues of the TV regularization parameter α . 1 0 3 1 0 2 1 0 1 1 0 0 0.08 0.10 0.12 0.14 0.16 e l 2 (b) Relativ e l 2 -error of the reconstructed velocity field v for the OT-WE + TV method under different values of the TV regularization parameter α . Figure 3: Relative l 2 -error of the reconstructed v elo cit y field v v ersus the TV regularization parameter α for the t wo deterministic metho ds: (a) W 2 + TV and (b) OT-WE +TV . 0 2000 4000 6000 8000 10000 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 Er r or (a) Relative l 2 -error of the reconstructed v elo city field v versus the iteration num b er for the W 2 + TV method with the optimal TV regularization parameter α = 0 . 2 . 0 2000 4000 6000 8000 10000 0.08 0.10 0.12 0.14 0.16 Er r or (b) Relativ e l 2 -error of the reconstructed velocity field v versus the iteration num b er for the OT-WE + TV method with the optimal TV regularization parameter α = 0 . 08 . Figure 4: Relativ e l 2 -error of the reconstructed v elo city field v v ersus the iteration num b er for the tw o deterministic methods with their optimal TV regularization parameters α : (a) W 2 + TV and (b) OT-WE + TV . In con trast, the DPS baseline often yields reconstructions with a larger mismatch to the ground truth, suggesting that the standard MSE-based w av efield guidance ma y b e too weak to enforce data consistency under surface-only acquisition. These results indicate that our prop osed metho d ac hieves higher reconstruction quality than the DPS baseline under surface-only acquisition. 2. W 2 + TV baseline often produces o ver-smoothed reconstructions, with blurred in terfaces and loss of structural details, which is reflected b y low er SSIM and visually smeared la yer b oundaries. Our metho d mitigates this ov er-smo othing and recov ers the main structures, although errors remain in c hallenging regions. 3. As the mo del complexit y increases (e.g., more la y ers, stronger discon tinuities, and curv ed/faulted in terfaces), the inv ersion problem b ecomes more ill-posed and the reconstruction accuracy degrades. F or our metho d, residual errors concentrate near sharp v elo cit y transitions and at larger depths, consistent with reduced illumination 23 T able 3: Ablation results on Curv e V el-B: relativ e ℓ 2 error ( e ℓ 2 ; low er is b etter), PSNR and SSIM (higher is b etter). Here Only denotes the baseline DPS update with the sp ecified misfit, Enh enables w a vefield enhance- men t (amplitude-adaptiv e data weigh ting and misfit-scale normalization), Step enables the preconditioned guidance update P i = ρ i D i , and All enables b oth Enh and Step . Metho d e ℓ 2 ↓ PSNR ↑ SSIM ↑ MSE guidance Only 18.21% 16.45 0.3827 Step 12.52% 19.70 0.5082 Enh 9.20% 22.37 0.6578 All 5.10% 27.50 0.8226 W 2 guidance Only 10.81% 20.98 0.6307 Step 8.65% 22.91 0.6726 Enh 5.33% 27.13 0.8244 All (ours) 2.04% 35.45 0.9517 and parameter sensitivit y in deep er regions for surface-only acquisition. 4. The OT-WE + TV baseline frequen tly outp erforms the W 2 + TV baseline in terms of b oth quan titativ e metrics and visual quality . This observ ation suggests that wa vefield enhancemen t and adaptiv e step sizing are effectiv e not only within the DPS framework, but ma y also improv e the p erformance of other inv ersion metho ds. T o v erify that the selected regularization parameter α is appropriate for the deterministic metho ds, w e tak e Dataset Curve V el-A as an example and plot, in Figure 3, the relationship b et w een the regularization parameter and the final relative error for (i) the W 2 + TV metho d and (ii) the OT-WE + TV metho d. In addition, Figure 4 sho ws the evolution of the relative error with resp ect to the iteration num b er for these tw o metho ds. Since Figure 4a indicates that the relativ e error of the W 2 + TV metho d is still decreasing, we further contin ue this metho d to 80 , 000 iterations under the same parameter setting. Ev en then, the relative error is only reduced to 10 . 5% , whic h is still higher than the b est relative error ac hiev ed by the OT-WE +TV metho d, namely , 8% . The results indicate that, for the deterministic metho ds, the final in v ersion error is sensitiv e to the choice of the TV regularization parameter α , and a b est-p erforming v alue can b e identified within the tested parameter range. F or the remaining datasets, the parameter α w as selected by the same tuning pro cedure in order to achiev e the b est performance attainable within the tested parameter range for eac h deterministic metho d. In addition, measured in terms of iteration coun t, the deterministic metho ds require substantially more iterations to con verge than our metho d. 6.4. A blation Study In the preceding exp erimen ts, w e ev aluated the reconstruction p erformance of our metho d for FWI. W e next report an ablation study to assess the con tribution of its principal 24 1500 1800 2100 2400 2700 3000 3300 3600 3900 4200 4500 v elo cit y (m/s) -3000 -2400 -1800 -1200 -600 0 600 1200 1800 2400 3000 v elo cit y (m/s) Reconstruction Difference Truth Only Step Enh All W 2 MSE W 2 MSE Figure 5: Ablation study on Curve V el-B. The top panel sho ws the ground-truth v elo cit y mo del. The second and third panels rep ort reconstructions obtained with W 2 -based and MSE-based guidance, respectively . F rom left to righ t, the columns corresp ond to Only (baseline DPS), Step (preconditioned guidance up date P i = ρ i D i ), Enh (w av efield enhancemen t: amplitude-adaptiv e w eighting and misfit-scale normalization), and All ( Step + Enh ). The bottom t wo panels sho w the corresp onding error maps (reconstruction minus ground truth). comp onen ts. T aking the original DPS framework as the reference, we inv estigate: (i) the choice of data-misfit metric, comparing MSE with the W asserstein- 2 distance ( W 2 ); (ii) wavefield enhanc ement , consisting of amplitude-adaptive data weighting and misfit-sc ale normalization (Section 5.1); and (iii) the pr e c onditione d guidanc e up date P i = ρ i D i (Section 5.2), whic h com bines stabilized guidance scheduling with spatially adaptiv e diagonal scaling. Unless otherwise noted, all ablation exp eriments are p erformed on the Curv e V el-B mo del in Figure 2. F or eac h setting, hyperparameters are selected according to the same tuning proto col so as to ac hieve the b est p erformance attainable within that setting. The results are summarized in Figure 5 and T able 3. Figure 6 further rep orts the evolution of the relativ e ℓ 2 error (with resp ect to v true ) for the intermediate reconstructions ˆ v ( i ) 0 and v i o ver the reverse-diffusion steps t . Notation. Throughout this ablation study , eac h lab el has the form (misfit)-(component) . The prefix W 2 - or MSE- sp ecifies the guid ance misfit used in the data-consistency term, namely the W asserstein- 2 distance ( W 2 ) or the mean-squared error (MSE), resp ectiv ely . The suffix indicates which additional comp onen ts are enabled: 25 • Only : baseline DPS with the c hosen misfit, without any of our additional mo difications. • Enh : baseline DPS plus wavefield enhanc ement , i.e., amplitude-adaptive data w eighting together with misfit-scale normalization (Sec. 5.1). • Step : baseline DPS plus the pr e c onditione d guidanc e up date P i = ρ i D i (Sec. 5.2), which com bines the noise-aw are scalar schedule ρ i and the spatial diagonal scaling D i . • All : baseline DPS with b oth Enh and Step enabled. F or example, W 2 -Enh uses W 2 guidance together with wa vefield enhancemen t, while MSE-All enables b oth enhancemen ts under MSE guidance. Observations. Figures 5– 6 and T able 3 reveal the follo wing trends: (i) Com bining w a vefield enhancement with the preconditioned update yields the best o verall p erformance. In particular, W 2 -All ac hieves e ℓ 2 = 2 . 04% , PSNR = 35 . 45 , and SSIM = 0 . 9517 , substan tially improving o v er W 2 -Only ( e ℓ 2 = 10 . 81% , PSNR = 20 . 98 , SSIM = 0 . 6307 ). (ii) W a vefield enhancemen t pro vides the larger gain when used alone, while the precondi- tioned up date offers additional impro vemen t when com bined. F or example, under W 2 guid- ance, adding w av efield enhancement reduces e ℓ 2 from 10 . 81% ( W 2 -Only ) to 5 . 33% ( W 2 -Enh ), whereas adding the preconditioned up date alone reduces it to 8 . 65% ( W 2 -Step).Enabling b oth components yields the b est p erformance, achieving e ℓ 2 = 2 . 04% ( W 2 -All ). (iii) W 2 -based guidance is consisten tly more robust than MSE-based guidance across all ablation settings, b oth in the baseline configuration ( Only ) and in the com bined configuration ( All ). This is also reflected in the visual reconstructions and error maps in Figure 5. (iv) The metric tra jectories suggest impro ved stabilit y with preconditioned guidance. As sho wn in Figure 6, enabling P i = ρ i D i reduces transien t oscillations during in termediate rev erse steps and leads to more stable con vergence, with the com bined setting ( All ) ac hieving the b est final metrics. 6.5. R obustness Study 6.5.1. R obustness to Observation Noise W e ev aluate the robustness of our metho d to observ ation noise b y corrupting the recorded data with additive noise at increasing levels σ ∈ { 0 , 10 − 4 , 10 − 3 , 10 − 2 , 5 × 10 − 2 } . All other h yp erparameters are k ept iden tical to those used in the previous exp erimen ts. Qualitativ e reconstructions are shown in Figure 7, and quantitativ e results, including the relativ e ℓ 2 -error, PSNR, and SSIM, are rep orted in T able 4. Ov erall, the reconstructions remain visually stable across all noise levels for all six datasets: the ma jor interfaces and fault geometries are preserv ed, and the predicted v elo cit y ranges remain consistent (Figure 7). Quantitativ ely , the p erformance degrades only mildly as σ increases, with nonmonotone but limited v ariations in e ℓ 2 , PSNR, and SSIM (T able 4). These results indicate that the prop osed guidance remains effective under mo derate lev els of observ ation noise. 26 Enh Step Only ALL 0 200 400 600 800 1000 0.0 0.1 0.2 0.3 0.4 0.5 (a) Relative ℓ 2 -error of v i using W 2 0 200 400 600 800 1000 10 20 30 (b) PSNR of v i using W 2 0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 (c) SSIM of v i using W 2 0 200 400 600 800 1000 0.0 0.1 0.2 0.3 0.4 0.5 (d) Relative ℓ 2 -error of ˆ v ( i ) 0 using W 2 0 200 400 600 800 1000 10 20 30 (e) PSNR of ˆ v ( i ) 0 using W 2 0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 1.0 (f ) SSIM of ˆ v ( i ) 0 using W 2 0 200 400 600 800 1000 0.1 0.2 0.3 0.4 0.5 (g) Relative ℓ 2 -error of v i using MSE 0 200 400 600 800 1000 10 15 20 25 (h) PSNR of v i using MSE 0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 (i) SSIM of v i using MSE 0 200 400 600 800 1000 0.1 0.2 0.3 0.4 0.5 (j) Relative ℓ 2 -error of ˆ v ( i ) 0 using MSE 0 200 400 600 800 1000 10 15 20 25 (k) PSNR of ˆ v ( i ) 0 using MSE 0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 (l) SSIM of ˆ v ( i ) 0 using MSE Figure 6: Evolution of the statistical metrics ov er rev erse-time steps for the ablation settings Enh , Step , Only and All . P anels (a)–(c) sho w the relativ e ℓ 2 -error, PSNR and SSIM of the intermediate iterate v i under W 2 -based guidance; panels (d)–(f ) sho w the corresp onding metrics for the denoised estimate ˆ v ( i ) 0 under W 2 -based guidance. Panels (g)–(i) and (j)–(l) report the analogous quantities under MSE-based guidance for v i and ˆ v ( i ) 0 , respectively . 6.6. Gener alization acr oss forwar d op er ators An imp ortan t feature of the prop osed framew ork is that the learned score-based prior is trained only on prior v elo cit y mo dels and do es not dep end on a sp ecific forward operator. Once this prior has b een trained, it can b e coupled with differen t forward models F for in version without retraining the prior net work. In this subsection, we examine the robustness of the prop osed metho d with resp ect to c hanges in the forw ard op erator b y p erturbing the acquisition and ph ysical settings and then re-running the inv ersion. W e consider the FlatF ault-B dataset and define fiv e forw ard-op erator configurations. The baseline configuration, denoted b y Init , coincides with that used in the preceding exp erimen ts. W e then mo dify the acquisition and ph ysical settings in sev eral wa ys, thereby 27 1500 1800 2100 2400 2700 3000 3300 3600 3900 4200 4500 v elo cit y (m/s) Curv e V el-A Curv e V el-B FlatF ault-A FlatF ault-B Curv eF ault-A Curv eF ault-B T ruth σ = 0 σ = 1 e − 4 σ = 1 e − 3 σ = 1 e − 2 σ = 5 e − 2 Figure 7: Qualitative robustness to observ ation noise across the six datasets. Eac h column corresponds to one dataset. The first ro w sho ws the true velocity field, and the remaining ro ws sho w the reconstructed velocity fields under progressiv ely increasing noise levels σ ∈ { 0 , 10 − 4 , 10 − 3 , 10 − 2 , 5 × 10 − 2 } . The reconstructions remain visually stable as the noise level increases, with the main interfaces and fault structures largely preserv ed. obtaining p erturbed forward operators as follo ws: 1. Freq10 : we replace the temp oral source function b y a Ric k er wa velet of p eak frequency f p = 10 Hz , while k eeping the spatial p oint-source locations, all other parameters, and the acquisition geometry un c hanged. Consequen tly , the observed and synthetic w av efields are generated with a different source peak frequency . 2. Depth2 : we shift the acquisition arra y do wnw ard by tw o grid in terv als b y setting the source and receiv er depths to z = L z − 2 dx instead of z = L z , while k eeping their horizon tal co ordinates fixed. This mo dification c hanges how wa ves propagate from the sources to the receiv ers and how the subsurface structures are illuminated. 3. Shot17 : relativ e to the baseline configuration with 10 sources, we increase the n umber 28 T able 4: Quan titative robustness to observ ation noise across the six datasets. Relative ℓ 2 -error ( e ℓ 2 ; lo wer is b etter), PSNR, and SSIM (higher is b etter) of the reconstructed velocity fields under differen t noise lev els σ . Curv e V el-A Curv e V el-B FlatF ault-A σ e ℓ 2 ↓ PSNR ↑ SSIM ↑ e ℓ 2 ↓ PSNR ↑ SSIM ↑ e ℓ 2 ↓ PSNR ↑ SSIM ↑ 0 3.29% 31.96 0.9111 1.95% 35.83 0.9584 1.60% 37.94 0.9820 10 − 4 3.29% 31.96 0.9113 1.94% 35.90 0.9594 1.84% 36.74 0.9691 10 − 3 3.39% 31.70 0.9075 2.23% 34.69 0.9451 1.95% 36.20 0.9739 10 − 2 3.62% 31.12 0.9121 2.07% 35.34 0.9557 1.97% 36.15 0.9747 5 × 10 − 2 3.50% 31.42 0.9039 2.04% 35.45 0.9517 2.91% 32.74 0.9646 FlatF ault-B Curv eF ault-A Curv eF ault-B σ e ℓ 2 ↓ PSNR ↑ SSIM ↑ e ℓ 2 ↓ PSNR ↑ SSIM ↑ e ℓ 2 ↓ PSNR ↑ SSIM ↑ 0 6.07% 28.56 0.9242 2.41% 35.01 0.9602 5.67% 27.39 0.7920 10 − 4 6.03% 28.61 0.9240 2.22% 35.75 0.9606 6.03% 26.86 0.7779 10 − 3 6.10% 28.51 0.9225 2.45% 34.90 0.9552 6.34% 26.43 0.7407 10 − 2 6.15% 28.45 0.9214 2.26% 35.56 0.9609 6.65% 26.01 0.7304 5 × 10 − 2 6.13% 28.48 0.9272 2.41% 35.03 0.9540 5.32% 27.95 0.7956 of sources to 17 and distribute them uniformly along the surface, while keeping the receiv er lay out unchanged. This p erturbation yields denser illumination and increased data redundancy for the in version. 4. AllChange : we apply Freq10 , Depth2 , and Shot17 sim ultaneously , thereby introduc- ing a comp ound p erturbation of the forw ard op erator that mo difies the source p eak frequency , the source and receiv er depths, and the num b er of sources. F or a fair comparison across the different op erator settings, the netw ork arc hitecture, diffusion sc hedule, and all remaining hyperparameters are k ept identical to those used in the preceding exp erimen ts. F or each op erator configuration, only the guidance-scale parameter ρ 0 is tuned, following the same proto col, so as to ac hieve the b est p erformance attainable for our metho d; the selected v alues are listed in T able 5. The corresp onding quan titativ e metrics and qualitativ e reconstructions are rep orted in T able 5 and Figure 8, resp ectiv ely .› It can b e observed that, across all forw ard-op erator configurations, our metho d consistently yields high-qualit y reconstructions, with SSIM exceeding 0 . 9 and relative errors below 6 . 2% (T able 5). These results indicate that, once the score-based prior is trained on v elo cit y mo dels, it can b e coupled with mo dified acquisition/ph ysics settings (e.g., changes in source frequency , source/receiv er depth, or the n umber of shots) without retraining the prior. This op erator-agnostic prop ert y substan tially broadens the applicabilit y of the prop osed in version framew ork. 29 1500 2100 2700 3300 3900 4500 v elo cit y (m/s) -3000 -1800 -600 600 1800 3000 v elo cit y (m/s) Reconstruction Difference Truth Init Shot17 Depth2 Freq10 AllChange Figure 8: Inv ersion results under differen t forw ard-op erator configurations (FlatF ault-B). F rom left to righ t: Init (baseline setup), Shot17 (17 uniformly distributed sources), Depth2 (sources and receivers shifted do wnw ard b y 2 dx ), Freq10 (Ric ker wa velet p eak frequency f p = 10 , Hz ), and AllChange (all modifications applied). The reconstructions remain visually consistent across op erator c hanges, indicating that the prop osed metho d generalizes w ell to p erturb ed forward op erators without retraining the prior. T able 5: Reconstruction p erformance on FlatF ault-B under different forward-operator configurations. Rep orted are the tuned v alues of ρ 0 , together with the relativ e ℓ 2 -error ( e ℓ 2 ; lo wer is b etter), PSNR, and SSIM (higher is better). Op erator ρ 0 e ℓ 2 ↓ PSNR ↑ SSIM ↑ Init 1.45 6.13% 28.48 0.9272 Shot17 2.25 5.24% 29.84 0.9399 Depth2 1.60 5.37% 29.62 0.9343 Freq10 1.00 5.85% 28.87 0.9160 AllChange 1.30 5.82% 28.92 0.9168 6.7. A unifie d diffusion mo del for heter o gene ous velo city fields 6.7.1. Performanc e of the mixe d mo del on Op enFWI In the preceding exp eriments, a separate diffusion mo del w as trained for each family of v elo cit y fields. Although this dataset-sp ecific strategy is effective when the structural class is kno wn a priori, it is of limited practical v alue for inv erse problems, where the subsurface structure is t ypically unknown. T o address this issue, w e next consider a mixe d mo del trained join tly on a heterogeneous dataset assem bled from multiple structural classes. The resulting unified mo del is designed to learn a broader prior and thus av oids the need to select a class-sp ecific model at the inference stage. 30 1500 2100 2700 3300 3900 4500 v elo cit y (m/s) -3000 -1800 -600 600 1800 3000 v elo cit y (m/s) Reconstruction Difference CurveVel-A FlatFault-A CurveFault-B T ruth mixed mo del separate mo del mixed mo del separate mo del Figure 9: Comparison of inv ersion results pro duced by the mixed and separate mo dels for Curve V el-A, FlatF ault-A, and CurveF ault-B. The first row shows the true v elo cit y models v true . The second and third ro ws displa y the reconstructed v elo cit y mo dels v rec obtained by the mixed mo del and the separate model, resp ectiv ely . The fourth and fifth rows sho w the reconstruction errors, v rec − v true , corresponding to the mixed and separate models, resp ectiv ely . Sp ecifically , we randomly selected 7,500 samples from eac h of Curv e V el-A, Curv e V el-B, FlatF ault-A, FlatF ault-B, Curv eF ault-A, and CurveF ault-B, and merged them in to a single training set. The mixed mo del was then ev aluated on several test datasets, among whic h Curv e V el-A, FlatF ault-A, and CurveF ault-B are rep orted here as represen tative examples. W e adopted the same op erator setting as in Section 6.3.1 and carefully tuned the remaining h yp erparameters. More precisely , w e set k = 100 , c = 0 . 1 , τ = 0 , ε = 10 − 4 , and γ = 0 . 55 , and chose ρ 0 = 0 . 95 , 0 . 85 , and 1 . 4 for Curv e V el-A, FlatF ault-A, and CurveF ault-B, resp ectiv ely . F or eac h dataset, the p erformance of the mixed mo del is compared with that of the corresp onding separate mo del trained exclusiv ely on the same dataset. The quan titativ e and qualitativ e results are rep orted in T able 6 and Figure 9, resp ectiv ely . 31 It can b e observed that, when the score-based neural netw ork is trained on the mixed dataset rather than on a single dataset, the inv ersion p erformance degrades sligh tly across all three represen tativ e examples. Nevertheless, the mixed mo del still yields satisfactory reconstructions and remains clearly comp etitiv e in practice. These results indicate that, when strong prior knowledge of the target velocity field is a v ailable, a dataset-sp ecific mo del trained on structurally homogeneous samples is generally preferable. By contrast, when prior information is limited or the structural type is unkno wn, the mixed model pro vides a practical and flexible alternativ e. A plausible reason for the p erformance gap is that the heterogeneous training distribution mak es it more c hallenging for a single score net work to capture the full range of structural patterns with equal fidelit y . As a consequence, the learned prior ma y b ecome less accurate for individual classes than that obtained from dataset-sp ecific training. Moreov er, o wing to computational constraints, the mixed mo del w as not trained on the full collection of a v ailable samples, whic h may also contribute to the observ ed p erformance gap. T able 6: Quantitativ e comparison of in version results on Op enFWI velocity mo dels using the separate and mixed mo dels. Metrics include the relative ℓ 2 error e ℓ 2 , PSNR, and SSIM for the reconstructed velocity mo del (arrows indicate the preferred direction). Dataset e ℓ 2 ↓ PSNR ↑ SSIM ↑ Curv e V el-A (separate mo del) 3.50% 31.42 0.9039 Curv e V el-A (mixed mo del) 3.87% 30.54 0.8906 FlatF ault-A (separate model) 2.91% 32.74 0.9646 FlatF ault-A (mixed model) 3.28% 31.72 0.9179 Curv eF ault-B (separate mo del) 5.32% 27.95 0.7956 Curv eF ault-B (mixed mo del) 5.91% 27.04 0.7079 6.7.2. Performanc e of the mixe d mo del on the Marmousi2 mo del An imp ortan t practical adv antage of the mixed mo del is its ability to provide a unified prior for structures not explicitly represen ted by any single training class. T o inv estigate this generalization capability , w e applied the mixed mo del to the inv ersion of the Marmousi2 dataset [ 49 ]. W e emphasize that the training data contained no samples from Marmousi2. In the numerical exp erimen ts, lo cal v elo city models of differen t sizes w ere cropp ed from Marmousi2 and normalized to form dimensionless 71 × 71 data matrices. F or all lo cal velocity mo dels, w e fixed k = 100 , c = 0 . 1 , τ = 0 , ε = 10 − 4 , and γ = 0 . 55 , while ρ 0 w as chosen separately for eac h example. W e considered four lo cal velocity mo dels cropp ed from Marmousi2 at differen t physical scales. Their ph ysical sizes, grid spacings, and the corresp onding v alues of ρ 0 are summarized in T able 7. F or all cases, w e used exactly the same trained mixed mo del without an y fine- tuning or distillation. W e follow ed the same exp erimen tal setup as in the depth2 exp erimen t in the multi-operation generalization section. The quantitativ e metrics and qualitativ e reconstructions are rep orted in T able 8 and Figure 10, resp ectiv ely . 32 1500 2100 2700 3300 3900 4500 v elo cit y (m/s) -3000 -1800 -600 600 1800 3000 v elo cit y (m/s) T ruth Reconstruction Difference Marmousi2-P1 Marmousi2-P2 Marmousi2-P3 Marmousi2-P4 Figure 10: In version results for the four local v elo city mo dels extracted from Marmousi2, denoted by Marmousi2-P1–P4. The first ro w shows the true v elo cit y models v true . The second ro w presents the reconstructed velocity mo dels v rec obtained by the prop osed metho d with the mixed mo del. The third row sho ws the corresponding reconstruction errors v rec − v true . T able 7: Exp erimental settings for the four lo cal v elo cit y models extracted from Marmousi2. Lo cal mo del Ph ysical size dx (m) dz (m) ρ 0 Marmousi2-P1 700 m × 175 m 10 2.5 3.9 Marmousi2-P2 700 m × 263 m 10 3.75 17.9 Marmousi2-P3 1000 m × 175 m 100 7 2.5 6.1 Marmousi2-P4 1500 m × 175 m 150 7 2.5 5.0 T able 8: Quan titative comparison of in version results obtained b y the mixed mo del for the four local velocity mo dels extracted from Marmousi2. Metrics include the relativ e ℓ 2 error e ℓ 2 , PSNR, and SSIM for the reconstructed v elo city mo del (arrows indicate the preferred direction). Lo cal mo del e ℓ 2 ↓ PSNR ↑ SSIM ↑ Marmousi2-P1 ( 700 m × 175 m ) 1.80% 35.95 0.9206 Marmousi2-P2 ( 700 m × 263 m ) 3.57% 30.49 0.7553 Marmousi2-P3 ( 1000 m × 175 m ) 2.81% 33.21 0.8975 Marmousi2-P4 ( 1500 m × 175 m ) 5.31% 26.96 0.8536 The lo cal v elo cit y mo dels extracted from Marmousi2 differ substantially from the Op enFWI datasets in structural features. Despite this mismatc h, the prop osed metho d using the mixed mo del still pro duces satisfactory reconstructions for lo cal velocity mo dels of differen t sizes. 33 This is supp orted both qualitatively by Figure 10, where the reconstructed velocity mo dels preserv e the main subsurface structures, and quan titativ ely by the relative ℓ 2 error, PSNR, and SSIM rep orted in T able 8, whic h indicate go od agreemen t b et ween the reconstructed and true velocity mo dels. Since no Marmousi2-sp ecific samples w ere included in the training data, these results pro vide evidence of the cross-dataset generalization capabilit y of the prop osed approac h. In this sense, the metho d offers a promising data-driv en framework for mitigating the difficult y arising from limited training data in full wa veform in v ersion. 7. Conclusion In this paper, w e prop osed a ph ysics-guided diffusion framew ork for seismic v elo cit y reconstruction in full w av eform inv ersion. The metho d combines an optimal-transp ort (OT) data-consistency p oten tial with a stabilized v ariable-metric guidance rule in the rev erse diffusion pro cess. On the data-consistency side, the OT p otential incorp orates b ounded amplitude-adaptiv e reweigh ting together with a one-dimensional W asserstein discrepancy ev aluated through quan tile functions. This construction reduces the dominance of large- amplitude early arriv als and alleviates the sensitivit y of p oint wise ℓ 2 ob jectives to time and phase misalignment. On the guidance side, the scalar DPS correction is replaced by a diagonal preconditioner P i = ρ i D i , where ρ i pro vides a step-dep endent scalar schedule and D i supplies spatially adaptiv e scaling. This yields a more balanced and stable guidance mechanism for the heterogeneous sensitivities arising in FWI. The n umerical results on the Op enFWI b enc hmarks show that, under matc hed computa- tional budgets, the prop osed metho d improv es reconstruction quality relativ e to b oth th e standard DPS approac h and the deterministic baselines. In particular, the reconstructions pro duced b y the prop osed metho d exhibit lo wer relative ℓ 2 errors and few er visible artifacts across the tested datasets. Sev eral directions remain for future work. First, the computational cost of physics-based guidance is still substantial, and it would b e v aluable to develop more efficien t implemen tations and acceleration strategies, for example through m ulti-fidelit y surrogate strategies. Second, an imp ortan t extension is to consider more realistic forw ard models, including elastic, anisotropic, and atten uative media, as well as three-dimensional acquisition geometries. A c knowledgmen ts Zheng Ma is supp orted b y NSFC Grant No. 12531016 and Beijing Institute of Applied Ph ysics and Computational Mathematics funding HX02023-6. Xiong-Bin Y an is supported by NSF C Grant No. 12401563. A dditionally , we also thank Shanghai Institute for Mathematics and In terdisciplinary Sciences (SIMIS) for their financial supp ort. This researc h was funded b y SIMIS under gran t num b er SIMIS-ID-2025-ST. The authors are grateful for the resources and facilities pro vided by SIMIS, which w ere essen tial for the completion of this work. 34 References [1] A. J. Berkhout, Pushing the limits of seismic imaging, Part I I: In tegration of prestac k migration, v elo cit y estimation, and A VO analysis, Geoph ysics 62 (3) (1997) 954–969. [2] C. Zelt, R. Smith, Seismic trav eltime inv ersion for 2-D crustal velocity structure, Geo- ph ysical Journal International 108 (1) (1992) 16–34. [3] F. Clemen t, G. Cha ven t, S. Gómez, Migration-based tra veltime wa veform inv ersion of 2-D simple structures: A synthetic example, Geoph ysics 66 (3) (2001) 845–860. [4] J. Hudson, J. Heritage, The use of the Born appro ximation in seismic scattering problems, Geoph ysical Journal International 66 (1) (1981) 221–240. [5] K. Muh um uza, M. Jakobsen, T. Luostari, T. Lähiv aara, Seismic monitoring of CO2 injection using a distorted Born T-matrix approach in acoustic approximation, J. Seism. Explor 27 (2018) 403–431. [6] A. T aran tola, Inv ersion of seismic reflection data in the acoustic approximation, Geo- ph ysics 49 (8) (1984) 1259–1266. [7] M. W arner, A. Ratcliffe, T. Nango o, J. Morgan, A. Umpleby , N. Shah, V. Vinje, I. Štekl, L. Guasc h, C. Win, Anisotropic 3D full-wa v eform in version, Geophysics 78 (2) (2013) R59–R80. [8] M. Jak obsen, B. Ursin, F ull wa veform in v ersion in the frequency domain using direct iterativ e T-matrix metho ds, Journal of Geoph ysics and Engineering 12 (3) (2015) 400– 418. [9] J. Virieux, S. Op erto, An ov erview of full-wa v eform inv ersion in exploration geoph ysics, Geoph ysics 74 (6) (2009) WCC1–W CC26. [10] A. Fich tner, F ull seismic w av eform mo delling and in version, Springer Science & Business Media, 2010. [11] G. Y ao, N. V. da Silv a, M. W arner, D. W u, C. Y ang, T ac kling cycle skipping in full- w av eform inv ersion with intermediate data, Geoph ysics 84 (3) (2019) R411–R427. [12] Y. Y ang, B. Engquist, J. Sun, B. F. Hamfeldt, Application of optimal transp ort and the quadratic Wasserstein metric to full-wa veform inv ersion, Geoph ysics 83 (1) (2018) R43–R62. [13] R. G. Pratt, Seismic w a veform inv ersion in the frequency domain, part 1: Theory and v erification in a physical scale mo del, Geophysics 64 (3) (1999) 888–901. [14] C. Bunks, F. M. Salec k, S. Zaleski, G. Chav en t, Multiscale seismic wa veform in v ersion, Geoph ysics 60 (5) (1995) 1457–1473. [15] O. Gauthier, J. Virieux, A. T arantola, T wo-dimensional nonlinear inv ersion of seismic w av eforms: Numerical results, geoph ysics 51 (7) (1986) 1387–1403. 35 [16] L. Métivier, R. Brossier, Q. Mérigot, E. Oudet, J. Virieux, Measuring the misfit b et ween seismograms using an optimal transp ort distance: Application to full w av eform in version, Geoph ysical Supplements to the Monthly Notices of the Ro y al Astronomical So ciety 205 (1) (2016) 345–377. [17] Y. Liu, J. T eng, T. Xu, Y. W ang, Q. Liu, J. Badal, Robust time-domain full w a veform in version with normalized zero-lag cross-correlation ob jective function, Geoph ysical Journal In ternational 209 (1) (2017) 106–122. [18] E. Bozdağ, J. T ramp ert, J. T romp, Misfit functions for full wa v eform in version based on instan taneous phase and env elop e measuremen ts, Geophysical Journal International 185 (2) (2011) 845–870. [19] B. Chi, L. Dong, Y. Liu, F ull w av eform inv ersion metho d using env elop e ob jectiv e function without lo w frequency data, Journal of Applied Geophysics 109 (2014) 36–46. [20] M. W arner, L. Guasch, Adaptiv e w av eform inv ersion: Theory , Geoph ysics 81 (6) (2016) R429–R445. [21] L. Qiu, J. Ramos-Martínez, A. V alenciano, Y. Y ang, B. Engquist, F ull-wa v eform in version with an exp onen tially enco ded optimal-transport norm, in: SEG tec hnical program expanded abstracts 2017, 2017, pp. 1286–1290. [22] Z. Li, Y. T ang, J. Chen, H. W u, The quadratic Wasserstein metric with squaring scaling for seismic v elo cit y inv ersion, arXiv preprint arXiv:2201.11305 (2022). [23] Y. Gao, F. Tilmann, A. Rietbrock, A review of misfit functions for adjoint full w av eform in version in seismology , Geophysical Journal In ternational 235 (3) (2023) 2794–2827. [24] L. Mosser, O. Dubrule, M. J. Blun t, Sto chastic seismic w av eform in version using genera- tiv e adversarial netw orks as a geological prior, Mathematical Geosciences 52 (1) (2020) 53–79. [25] J. Lopez-Alvis, F. Nguyen, M. Lo oms, T. Hermans, Geoph ysical in version using a v ariational auto enco der to mo del an assembled spatial prior uncertaint y , Journal of Geoph ysical Research: Solid Earth 127 (3) (2022) e2021JB022581. [26] C. Sun, A. Malcolm, R. Kumar, W. Mao, Enabling uncertain ty quantification in a standard full-w a veform inv ersion metho d using normalizing flo ws, Geophysics 89 (5) (2024) R493–R507. [27] J. Ho, A. Jain, P . Abb eel, Denoising diffusion probabilistic mo dels, A dv ances in neural information pro cessing systems 33 (2020) 6840–6851. [28] Y. Song, J. Sohl-Dic kstein, D. P . Kingma, A. Kumar, S. Ermon, B. P o ole, Score- based generative mo deling through sto chastic differen tial equations, arXiv preprin t arXiv:2011.13456 (2020). 36 [29] Y. Song, L. Shen, L. Xing, S. Ermon, Solving inv erse problems in medical imaging with score-based generativ e mo dels, arXiv preprin t arXiv:2111.08005 (2021). [30] H. Ch ung, J. Kim, M. T. Mccann, M. L. Klasky , J. C. Y e, Diffusion p osterior sampling for general noisy in verse problems, arXiv preprint arXiv:2209.14687 (2022). [31] Y. Li, H. Zhang, Z. Y an, T. Alkhalifah, Diffusionin v: Prior-enhanced bay esian full w av eform inv ersion using diffusion mo dels, arXiv preprin t arXiv:2505.03138 (2025). [32] F. W ang, X. Huang, T. Alkhalifah, Controllable seismic v elo cit y syn thesis using generativ e diffusion mo dels, Journal of Geophysical Research: Machine Learning and Computation 1 (3) (2024) e2024JH000153. [33] W. S. Phillips, M. C. F ehler, T rav eltime tomography; a comparison of p opular metho ds, Geoph ysics 56 (10) (1991) 1639–1649. [34] J. Chen, G. Chen, M. Nagaso, P . T ong, A djoint-state trav eltime tomograph y for az- im uthally anisotropic media in spherical co ordinates, Geophysical Journal In ternational 234 (1) (2023) 712–736. [35] S. Hao, J. Chen, M. Xu, P . T ong, T op ography-incorporated adjoint-state surface wa ve tra veltime tomography: Metho d and a case study in ha waii, Journal of Geoph ysical Researc h: Solid Earth 129 (1) (2024) e2023JB027454. [36] A. M. Stuart, In v erse problems: a bay esian persp ective, A cta numerica 19 (2010) 451–559. [37] M. Dashti, A. M. Stuart, The ba yesian approac h to inv erse problems, arXiv preprint arXiv:1302.6989 (2013). [38] L. Y ang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, M.-H. Y ang, Diffusion models: A comprehensiv e surv ey of metho ds and applications, A CM computing surv eys 56 (4) (2023) 1–39. [39] Y. Song, J. Sohl-Dic kstein, D. P . Kingma, A. Kumar, S. Ermon, B. P o ole, Score- based generativ e mo deling through sto c hastic differential equations, in: International Conference on Learning Represen tations, 2021. [40] B. Øksendal, Sto c hastic differen tial equations, in: Sto chastic differential equations: an in tro duction with applications, Springer, 2003, pp. 38–50. [41] B. D. Anderson, Reverse-time diffusion equation mo dels, Sto c hastic Pro cesses and their Applications 12 (3) (1982) 313–326. [42] P . Vincen t, A connection b et ween score matching and denoising auto enco ders, Neural computation 23 (7) (2011) 1661–1674. [43] C. Deng, S. F eng, H. W ang, X. Zhang, P . Jin, Y. F eng, Q. Zeng, Y. Chen, Y. Lin, Op enfwi: Large-scale m ulti-structural b enc hmark datasets for full w av eform inv ersion, A dv ances in Neural Information Pro cessing Systems 35 (2022) 6007–6020. 37 [44] L. I. Rudin, S. Osher, E. F atemi, Nonlinear total v ariation based noise remov al algorithms, Ph ysica D: nonlinear phenomena 60 (1-4) (1992) 259–268. [45] Y. Nesterov, In tro ductory lectures on con v ex optimization: A basic course, V ol. 87, Springer Science & Business Media, 2013. [46] O. Ronneb erger, P . Fisc her, T. Brox, U-net: Conv olutional netw orks for biomedical image segmen tation, in: International Conference on Medical image computing and computer-assisted in terven tion, Springer, 2015, pp. 234–241. [47] B. Engquist, Y. Y ang, Optimal transp ort based seismic inv ersion: Bey ond cycle skipping, Comm unications on Pure and Applied Mathematics 75 (10) (2022) 2201–2244. [48] A. Ric hardson, Deepw av e (Oct. 2025). doi:10.5281/zenodo.3829886 . URL https://doi.org/10.5281/zenodo.3829886 [49] G. S. Martin, R. Wiley , K. J. Marfurt, Marmousi2: An elastic upgrade for marmousi, The leading edge 25 (2) (2006) 156–166. 38

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment