Robust Physics-Guided Diffusion for Full-Waveform Inversion

Robust Ph ysics-Guided Diﬀusion for F ull-W a v eform In v ersion Jishen P eng a,1 , Enze Jiang a,1 , Zheng Ma a,b,c,d , Xiong-Bin Y an e, ∗ a Scho ol of Mathematic al Scienc es, Shanghai Jiao T ong University, Shanghai, China b Institute of Natur al Scienc es, MOE-LSC, Shanghai Jiao T ong University, Shanghai, China c Qing Y uan R ese ar ch Institute, Shanghai Jiao T ong University, Shanghai, China d CMA-Shanghai, Shanghai Jiao T ong University, Shanghai, China e Scho ol of Mathematics and Statistics, L anzhou University, L anzhou, China. Abstract W e dev elop a robust physics-guided diﬀusion framew ork for full-wa veform in v ersion that com bines a score-based generativ e prior with lik eliho o d guidance computed through wa ve- equation simulations. W e adopt a transp ort-based data-consistency p otential (W asserstein-2), incorp orating w av eﬁeld enhancement via b ounded weigh ting and observ ation-dep endent normalization, thereb y impro ving robustness to amplitude imbalance and time/phase mis- alignmen t. On the inference side, we in tro duce a preconditioned guided reverse-diﬀusion sc heme that adapts the guidance strength and spatial scaling throughout the rev erse-time dynamics, yielding a more stable and eﬀectiv e data-consistency guidance step than standard diﬀusion p osterior sampling (DPS). Numerical exp erimen ts on Op enFWI datasets demon- strate impro ved reconstruction qualit y ov er deterministic optimization baselines and standard DPS under comparable computational budgets. Keyw ords. F ull-w a veform inv ersion; Optimal transp ort; Ph ysics-guided diﬀusion; Precondi- tioned guidance 1. In tro duction In geophysics, seismic inv ersion is widely used to infer quantitativ e subsurface mo dels that match observed seismic recordings. F rom a mathematical viewp oin t, recov ering medium parameters from time-dep enden t w av eﬁeld measuremen ts constitutes a nonlinear and ill-posed in verse problem, primarily due to limited acquisition geometry and band-limited observ ations. Sev eral classes of metho ds ha ve b een developed, including v elo city analysis based on stack ed traces [ 1 ], migration and tra veltime-based methods [ 2 , 3 ], Born-appro ximation-based linearized in version [ 4 , 5 ], and full-w av eform inv ersion (FWI) [ 6 , 7 , 8 ]. Among these, FWI is distinguished b y its p otential to reco v er high-resolution subsurface structure by exploiting b oth phase and amplitude information through the solution of a PDE-constrained optimization problem that minimizes a c hosen data-misﬁt functional b et ween observ ed and synthetic seismograms. ∗ Corresp onding author. Email addr ess: yanxb2015@163.com (Xiong-Bin Y an) 1 Jishen P eng and Enze Jiang con tributed equally to this w ork. F ull-w av eform in v ersion (FWI) is a PDE-constrained inv erse problem that aims to re- construct subsurface parameters (e.g., w a vespeed or velocity) from time-dep enden t seismic recordings [ 9 , 10 , 6 ]. Given an acquisition geometry and a forward w a ve-propagation operator F , the classical form ulation seeks a v elo cit y mo del v suc h that F ( v ) matc hes the observ ed data d obs , where v ∈ R m denotes the discretized v elo cit y ﬁeld and F : R m → R n maps mo del parameters to recorded traces. Despite its ability to recov er high-resolution structures by exploiting the full wa v eﬁeld conten t, FWI is widely recognized as a challenging nonconv ex optimization problem: the forward map is nonlinear, the inv erse problem is ill-conditioned, and the misﬁt landscap e t ypically contains many local minima due to phase am biguity and limited illumination [ 9 , 11 , 12 , 13 ]. These diﬃculties are esp ecially pronounced when the initial mo del is inaccurate and when the observ ations are contaminated b y noise (and other data errors). In addition, strong m ulti-scale comp onen ts with markedly diﬀeren t amplitudes can further bias residual-based ob jectiv es and degrade stability [14, 9]. Since p oint wise ℓ 2 ob jectives p enalize the p oin twise residual in time, they are highly sensi- tiv e to small time/phase shifts b et w een observed and syn thetic traces. This sensitivit y leads to a strongly nonconv ex ob jectiv e landscap e and the well-kno wn cycle-skipping phenomenon [ 15 , 16 ], in which iterativ e methods can b ecome trapp ed in lo cal minima asso ciated with incorrect phase (kinematic) alignment. Moreo ver, seismic traces often exhibit pronounced dynamic-range imbalance: early-arriving phases can ha ve substantially larger amplitudes than later reﬂections, causing b oth the misﬁt and its gradien t to b e dominated b y a small p ortion of the data [ 17 ]. As a consequence, the resulting gradients can be dominated b y high-amplitude comp onen ts, leading up dates to underﬁt weak er later phases. A large b ody of w ork aims to mitigate these issues through multi-scale con tinuation [ 14 ], alternative misﬁt functions based on phase/env elop e attributes [ 18 , 19 ], adaptiv e wa v eform inv ersion [ 20 ], and geometry-a ware metrics suc h as optimal-transp ort/W asserstein distances [ 12 , 21 , 22 ]; see also [23] for a recen t review of misﬁt functions and adjoint sources. In parallel with adv ances in data misﬁt design, recent y ears ha ve witnessed rapid progress in learning expressiv e priors for in verse problems using generative mo deling. In seismic imaging, a range of generative models—including GAN-based geological priors, latent-v ariable mo dels suc h as V AEs, and normalizing ﬂo ws—hav e b een explored to mitigate ill-p osedness and to facilitate uncertaint y quantiﬁcation [ 24 , 25 , 26 ]. Score-based diﬀusion mo dels (also known as score-based generativ e mo dels) pro vide a particularly attractiv e mechanism: they learn the score ﬁeld ∇ x log p t ( x ) of progressively noised marginals and enable conditional generation b y rev ersing the asso ciated diﬀusion dynamics [ 27 , 28 ]. F rom the p ersp ectiv e of in verse problems, diﬀusion mo dels can b e in terpreted as providing a learned prior ov er plausible v elo cit y ﬁelds, trained solely from samples of the v elo city mo del (i.e., parameter-ﬁeld samples), without requiring paired data of the form (w a veﬁeld, velocity) and without em b edding a forw ard solv er in the prior-training lo op; conditional inference is then p erformed at test time b y incorp orating data-consistency (lik eliho o d) information during reverse-time sampling or related p osterior-sampling schemes [ 29 , 30 ]. This decoupling is especially app ealing in seismic imaging, where rep eated wa veﬁeld sim ulation is exp ensiv e and acquisition geometries ma y v ary across applications; accordingly , diﬀusion-based Ba yesian FWI and diﬀusion-driven v elo cit y mo deling hav e started to b e in vestigated in recen t w ork [31, 32]. T o solv e the in verse problem giv en observ ations, guided rev erse diﬀusion metho ds com bine the learned prior score with a data-consistency term induced b y a p otential Φ (i.e., a negativ e 2 log-lik eliho o d, or more generally a data-misﬁt functional that measures agreemen t b et ween sim ulated data and the observ ed data). Diﬀusion p osterior sampling (DPS) is a represen tative metho d: at rev erse step i it forms a clean-state estimate ˆ v ( i ) 0 , whic h is a denoised estimate of the underlying clean v ariable asso ciated with the current noisy state. This estimate is computed using the learned prior score s θ ∗ ( · , i ) . A likelihoo d-guidance step is then applied using the gradien t ∇ Φ( ˆ v ( i ) 0 ) . How ever, a direct use of the baseline DPS up date with a ﬁxed global guidance step size is p oorly matched to the characteristics of full-wa v eform inv ersion (FWI) in t wo fundamental respects. First, the eﬀectiv eness of DPS critically dep ends on the c hoice of the p oten tial Φ . In FWI, con ven tional p oint wise ℓ 2 (residual-based) p oten tials inherit t wo structural deﬁciencies. On the one hand, the quadratic residual makes the induced descen t direction disprop ortionately inﬂuenced b y large-amplitude arriv als, resulting in markedly unbalanced up dates across traces and time windo ws. On the other hand, the p oin twise ℓ 2 geometry is highly sensitive to small time/phase shifts, whic h typically pro duces a highly nonconv ex misﬁt landscap e with spurious lo cal minima (cycle-skipping) associated with incorrect phase alignment. When suc h a potential is used for likelihoo d guidance in DPS, these eﬀects can yield correction terms whose scale and direction are p o orly matched to the current rev erse-time state, thereby degrading the stabilit y and eﬀectiveness of the reverse-time dynamics. Second, ev en with a robust potential, a single scalar guidance strength in baseline DPS cannot capture the spatially v arying sensitivit y of the misﬁt to the v elo cit y ﬁeld v . Early rev erse-time steps often yield a rough ˆ v ( i ) 0 , so ov erly aggressiv e guidance can be unreliable, whereas later steps t ypically b eneﬁt from stronger data-consistency enforcement. Moreov er, illumination and acquisition geometry induce strong spatial v ariability in sensitivity , so a single global step size ma y ov ersho ot in w ell-illuminated regions while remaining ineﬀective in p oorly illuminated ones. These mismatc hes can reduce stabilit y and eﬃciency along the rev erse-time iterations and ultimately degrade reconstruction quality . Contributions. This work develops a robust physics-guided diﬀusion framework for full- w av eform in v ersion (FWI), com bining a score-based generative prior with an OT-based data-consistency p oten tial and a stabilized, v ariable-metric guidance mechanism in guided rev erse diﬀusion. Our main con tributions are: • An OT-based data-consistency p oten tial for ph ysics-guided diﬀusion. W e design a robust likelihoo d p oten tial for guided reverse diﬀusion b y combining bounded amplitude-adaptiv e weigh ting of seismic traces with a one-dimensional W asserstein discrepancy computed via quantile functions. When used as the data-consistency term in DPS-type guidance, the resulting guidance signal is less dominated b y high-amplitude arriv als and less sensitive to small time/phase misalignment, thereby alleviating cycle- skipping and yielding a more balanced and stable data-consistency mechanism for FWI. • A daptiv e, v ariable-metric guidance for guided rev erse diﬀusion. W e in tro duce a v ariable-metric (diagonal) preconditioner for the guidance step, replacing the scalar DPS step size b y a matrix P i = ρ i D i . Here ρ i adapts the guidance strength along the rev erse tra jectory to suppress unreliable early updates, while the diagonal scaling D i balances the spatially heterogeneous sensitivit y of the misﬁt with resp ect to the velocity 3 ﬁeld in FWI. This yields a more stable and eﬀectiv e guidance mec hanism than standard DPS in seismic settings. • Numerical v alidation. W e ev aluate the prop osed metho d on Op enFWI b enc hmarks and compare against deterministic optimization-based in version baselines and the origi- nal DPS under matched computational budgets, demonstrating impro v ed reconstruction qualit y and stability . W e further assess generalization on other standard b enchmarks b ey ond Op enFWI, where the metho d main tains strong p erformance. The remainder of the pap er is organized as follo ws. In Section 2 w e state the FWI problem and cast it in Ba yesian form. Section 3 recalls score-based diﬀusion mo dels and the learned prior score needed for reverse-time dynamics. Section 4 deriv es a conditional rev erse-diﬀusion form ulation for Bay esian in v ersion and presen ts a baseline discretization in the spirit of diﬀusion p osterior sampling (DPS). Section 5 in tro duces our robust physics-guided diﬀusion sc heme, including an OT-based data-consistency p oten tial and an adaptive, v ariable-metric guidance mec hanism. Section 6 pro vides n umerical results, comparativ e studies on Op enFWI b enc hmarks, and generalization tests on additional standard b enc hmark datasets. Section 7 concludes with a summary of the main results. 2. Problem form ulation 2.1. F ul l-waveform inversion (FWI) F ull-w av eform inv ersion (FWI) aims to reco ver a spatially v arying w av e sp eed v ( x ) from seismic observ ations recorded at receiv ers { x r } N r r =1 . Giv en observed seismograms d r ( t ) = d ( x r , t ) , the estimate of v is t ypically obtained b y minimizing a wa veform-misﬁt functional b et w een observed and sim ulated data, sub ject to the gov erning wa ve equation. By exploiting the full w av eform information—including phase, amplitude, and m ultiple arriv als—FWI can in principle achiev e higher resolution than kinematic approaches such as trav eltime tomograph y [ 33 , 34 , 35 ]. A t the same time, the resulting PDE-constrained optimization problem is strongly nonlinear and generally noncon vex; robust performance therefore requires careful treatmen t of noise and modeling errors, together with appropriate regularization or prior information. In what follows, the simulated data are generated b y a forward op erator F ( v ) deﬁned through the solution of an acoustic w av e equation. Sp eciﬁcally , w e consider seismic w av e propagation under the constant-densit y acoustic approximation in an isotropic medium. The w av eﬁeld u ( x , t ) denotes the acoustic pressure, and the unkno wn parameter is the spatially v arying wa ve speed v ( x ) > 0 . Neglecting v ariable-density and anisotropic eﬀects leads to the standard Laplacian as the spatial op erator. Let Ω ⊂ R d ( d = 2 , 3) b e the domain and T > 0 the ﬁnal time. F or a given source term s ( x , t ) , the forw ard problem reads          1 v 2 ( x ) ∂ tt u ( x , t ) − ∆ u ( x , t ) = s ( x , t ) , ( x , t ) ∈ Ω × (0 , T ] , u ( x , 0) = 0 , ∂ t u ( x , 0) = 0 , x ∈ Ω , ∂ n u ( x , t ) = 0 , ( x , t ) ∈ Γ ref × (0 , T ] , (1) 4 where Γ ref ⊂ ∂ Ω denotes the (top) reﬂecting b oundary and ∂ n is the outw ard normal deriv ativ e. On the remaining boundary Γ abs := ∂ Ω \ Γ ref , we imp ose absorbing b oundary conditions, implemen ted in practice via p erfectly matched lay ers (PML), in order to minimize artiﬁcial reﬂections from the truncation of the computational domain. W e tak e the domain to b e the rectangle Ω = [0 , L x ] × [0 , L z ] ⊂ R 2 with x = ( x, z ) , where the reﬂecting b oundary corresp onds to the top side Γ ref = { ( x, z ) ∈ ∂ Ω : z = L z } . W e discretize Ω b y a tensor-pro duct grid with nodes { ( x i , z j ) } i =1 ,...,m x ; j =1 ,...,m z . The w a ve-speed ﬁeld is represen ted by its no dal v alues v ij := v ( x i , z j ) , collected in a matrix V = { v ij } ∈ R m x × m z , or equiv alen tly by the v ectorization v = vec( V ) ∈ R m , where m = m x m z . Receiv ers are deploy ed exclusiv ely on the top b oundary Γ ref ⊂ ∂ Ω . W e denote the discrete receiv er set b y Γ r := { x r } N r r =1 ⊂ Γ ref . F or the j -th exp erimen t with source term s ( x , t ; ξ j ) , w e solv e (1) (with this source) to obtain the wa v eﬁeld u j ( x , t ; v , ξ j ) . Let R denote the receiv er sampling (restriction) op erator, deﬁned b y ( Ru j ) r ( t ) := u j ( x r , t ; v , ξ j ) , r = 1 , . . . , N r , so that the corresp onding noise-free synthetic data for experiment j are d ( j ) r ( t ) = ( Ru j ) r ( t ) . Stac king the traces across all exp erimen ts and receivers deﬁnes the parameter-to-observ able op erator F : v 7− → { ( Ru j ) r ( t ) } j =1 ,...,N s ; r =1 ,...,N r , whic h maps the velocity mo del to the predicted receiv er recordings. After discretizing time at { t k } N T k =1 , the data can b e view ed as an arra y d = { d j,r ( t k ) } ∈ R N s × N r × N T , ( F ( v )) j,r,k = ( Ru j ) r ( t k ) . The observ ation mo del is then d obs = F ( v ) + η , (2) where η denotes the measuremen t noise. W e adopt an additiv e Gaussian noise η ∼ N (0 , Σ) , where Σ is the noise cov ariance in data space. In the simplest case of indep endent and iden tically distributed noise, Σ = σ 2 I , so that η ∼ N (0 , σ 2 I ) with σ > 0 con trolling the noise lev el. Classically , FWI is form ulated as the minimization of a data-misﬁt functional (p ossibly augmen ted b y regularization). In this work, w e instead adopt a Ba y esian formulation and infer v through the p osterior, where data consistency is enco ded b y the likelihoo d p oten tial Φ (i.e., the negativ e log-likelihoo d), deﬁned b elo w. 2.2. Bayesian formulation of FWI Giv en the observ ation mo del (2) , w e cast FWI as a Bay esian inv erse problem and seek the subsurface velocity v through the p osterior distribution p ( v | d obs ) . In this w ork, w e use the p osterior form ulation primarily as a computational target for a diﬀusion-based in version algorithm, rather than pursuing full p osterior uncertain t y quantiﬁcation. Likeliho o d as a data-c onsistency p otential. Rather than ﬁxing the likelihoo d to a p oint- wise ℓ 2 discrepancy a priori, w e express it through a general data-c onsistency p otential 5 Φ( · ; d obs ) (i.e., a negativ e log-likelihoo d up to an additiv e constan t), following the standard Ba yesian inv erse problem form ulation; see, e.g., [36, 37]: p ( d obs | v ) ∝ exp  − Φ( v ; d obs )  . (3) This form ulation decouples the data-error mo del (2) from the c hoice of data-misﬁt functional used to quantify agreemen t b et ween sim ulated and observed data. As a classical sp ecial case, the additiv e Gaussian noise mo del in (2) yields p ( d obs | v ) ∝ exp  − 1 2   F ( v ) − d obs   2 Σ − 1  , ∥ r ∥ 2 Σ − 1 := r ⊤ Σ − 1 r . (4) corresp onding to Φ( v ; d obs ) = 1 2 ∥F ( v ) − d obs ∥ 2 Σ − 1 . Throughout the pap er, the observ ation d obs is ﬁxed. W e often suppress this dep endence and write Φ( v ) ≡ Φ( v ; d obs ) . In what follo ws, we keep Φ general un til Section 5, where w e instan tiate it by a robust OT-based wa v eform misﬁt tailored to amplitude dominance, cycle-skipping, and ob jective scaling. Prior. T o regularize the inv erse problem and enco de admissible structural assumptions on the subsurface, we endo w the wa ve speed v with a prior distribution π 0 (d v ) (with density p ( v ) in the ﬁnite-dimensional discretization). Such prior information is crucial in FWI, whic h is t ypically ill-p osed due to limited acquisition geometry and band-limited data, and hence ma y admit non-unique and unstable reconstructions under purely data-driven ﬁtting. A common c hoice is a Gibbs-type prior of the form p ( v ) ∝ exp  − µ R ( v )  , where R ( v ) is a regularization functional and µ > 0 con trols its strength. In this case, the negative log-prior con tributes the term µ R ( v ) to the p osterior ob jective, so that clas- sical deterministic regularization is recov ered at the level of maxim um a p osteriori (MAP) estimation. Posterior. Com bining the lik eliho o d (3) with the prior p ( v ) via Ba y es’ rule giv es the p osterior distribution p ( v | d obs ) ∝ p ( d obs | v ) p ( v ) ∝ exp  − Φ( v ; d obs )  p ( v ) . (5) Once Φ is sp eciﬁed, the p osterior combines the data-ﬁt term enco ded b y the lik eliho o d with the prior information enco ded b y p ( v ) . Conne ction to deterministic FWI (MAP). Classical deterministic form ulations of FWI are often p osed as minimizing a data-misﬁt functional, p ossibly augmented b y regularization. Within the Ba yesian formulation, suc h approac hes can b e interpreted as computing a MAP estimate. T aking the negative logarithm of (5) yields, up to an additiv e constant, − log p ( v | d obs ) = Φ( v ; d obs ) − log p ( v ) + C, where C is indep endent of v . Consequen tly , v MAP = arg min v ∈V Φ( v ; d obs ) − log p ( v ) , (6) where V denotes the admissible set (e.g., enforcing v > 0 ). If Φ is c hosen as a quadratic (Gaussian) data-misﬁt and p ( v ) ∝ exp ( − µ R ( v )) , then (6) reduces to a classical regularized PDE-constrained optimization problem for FWI. In the remainder of the pap er, we compute p osterior-driv en reconstructions using a diﬀusion-based solver with a learned prior score and a data-consistency p oten tial Φ . 6 3. Score-Based Generativ e Mo del Diﬀusion mo dels, also known as sc or e-b ase d gener ative mo dels [ 28 , 27 , 38 ], deﬁne a generativ e pro cess b y gradually p erturbing samples with Gaussian noise and then learning to rev erse this corruption pro cess. F rom the score-based viewp oin t, these mo dels are grounded in the estimation of the sc or e function , i.e., the gradien t of the log-density with resp ect to the v ariable of in terest. 3.1. Setup and notation W e identify the discretized velocity ﬁeld v ∈ R m ( m = m x m z ) with the data v ector x 0 and write x 0 := v . This is purely a notational con ven tion to align with diﬀusion-model form ulations, where x 0 denotes an uncorrupted sample at diﬀusion time t = 0 . W e regard x 0 as a random vector following an unknown distribution p data on R m , representing the p opulation of plausible subsurface mo dels. In our Ba yesian setting, p data pla ys the role of the prior p ( v ) , from whic h we assume access to samples. 3.2. F orwar d diﬀusion pr o c ess A diﬀusion mo del begins b y deﬁning a contin uum of progressiv ely noised v ariables { x ( t ) } t ∈ [0 ,T diﬀ ] through an Itô sto c hastic diﬀeren tial equation (SDE) [39] d x ( t ) = f ( x ( t ) , t ) d t + g ( t ) d w t , x (0) = x 0 ∼ p data , where w t is a standard m -dimensional Wiener pro cess, f is the drift, and g is the diﬀusion co eﬃcien t. Under mild regularit y assumptions [ 40 ], the la w of x ( t ) admits a density p t , and { p t } t ∈ [0 ,T diﬀ ] ev olves according to the Kolmogoro v forw ard (F okker–Planc k) equation [39] ∂ t p t ( x ) = −∇ ·  f ( x , t ) p t ( x )  + 1 2 g 2 ( t )∆ p t ( x ) , p t =0 = p data . In particular, the forw ard SDE induces a family of distributions { p t } obtained b y Gaussian p erturbations of p data . 3.3. V arianc e-pr eserving diﬀusion and tr ansition kernel W e adopt the varianc e-pr eserving (VP) SDE [28], deﬁned by the linear Itô SDE d x ( t ) = − 1 2 β ( t ) x ( t ) d t + p β ( t ) d w t , t ∈ [0 , T diﬀ ] , (7) where β ( t ) > 0 is a prescrib ed noise schedule. In our setting, x (0) = x 0 represen ts a clean (v ectorized) velocity mo del, while x ( t ) denotes its progressively p erturb ed v ersion. Since (7) is linear with time-dep enden t co eﬃcien ts, it admits an explicit mild solution. Deﬁne the in tegrating factor α ( t ) := exp  − 1 2 Z t 0 β ( τ ) d τ  . 7 Applying Itô’s form ula to α ( t ) − 1 x ( t ) yields x ( t ) = α ( t ) x 0 + α ( t ) Z t 0 α ( s ) − 1 p β ( s ) d w s = α ( t ) x 0 + Z t 0 exp  − 1 2 Z t s β ( τ ) d τ  p β ( s ) d w s . (8) The sto c hastic in tegral in (8) is Gaussian with mean zero. Setting ˆ σ 2 ( t ) := 1 − exp  − Z t 0 β ( τ ) d τ  = 1 − α 2 ( t ) , w e obtain the Gaussian transition kernel x ( t ) | x 0 ∼ N  α ( t ) x 0 , ˆ σ 2 ( t ) I  , (9) or, equiv alen tly , the reparameterization x ( t ) = α ( t ) x 0 + ˆ σ ( t ) z , z ∼ N (0 , I ) . If the schedule is c hosen so that R T diﬀ 0 β ( τ )d τ is suﬃcien tly large, then α ( T diﬀ ) ≈ 0 and ˆ σ 2 ( T diﬀ ) ≈ 1 , implying that x ( T diﬀ ) is appro ximately standard Gaussian and essen tially indep enden t of x 0 . In practice we sample x ( T diﬀ ) ∼ N (0 , I ) to initialize the reverse-time pro cedure. 3.4. R everse-time dynamics and sc or e le arning Let q t ( ·| x 0 ) denote the transition k ernel of the Marko v diﬀusion (7) , and deﬁne, for eac h t ∈ [0 , T diﬀ ] , the time- t densit y p t ( x ) := Z R m q t ( x | x 0 ) p data ( x 0 )d x 0 , whenev er the density exists. The main ob ject of in terest is the score ﬁeld s ⋆ ( x , t ) := ∇ x log p t ( x ) , i.e., the logarithmic deriv ativ e of the marginal densit y induced b y the forw ard diﬀusion at time t (equiv alen tly , at the corresp onding noise level). Under mild regularity assumptions, a time-reversal formula for diﬀusion pro cesses [ 41 ] implies that the rev erse-time dynamics asso ciated with (7) can b e expressed in terms of the score. In the VP case, where f ( x , t ) = − 1 2 β ( t ) x and g 2 ( t ) = β ( t ) , the rev erse-time SDE can b e written as d x ( t ) =  − 1 2 β ( t ) x ( t ) − β ( t ) s ⋆ ( x ( t ) , t )  d t + p β ( t ) d ¯ w t , t ∈ [0 , T diﬀ ] , (10) in terpreted in rev erse time, i.e., the pro cess is run backw ard from t = T diﬀ to t = 0 , where ¯ w t is a Wiener pro cess in the rev erse-time parameterization. 8 W e appro ximate the true score s ⋆ ( x , t ) = ∇ x log p t ( x ) b y a neural net w ork s θ ( x , t ) (the sc or e network ), parameterized by θ . A natural training ob jective is the mean-squared error regression: L 1 ( θ ) := E t ∼U (0 ,T diﬀ ) E x ∼ p t h κ ( t ) ∥ s θ ( x , t ) − ∇ x log p t ( x ) ∥ 2 2 i , (11) where κ ( t ) > 0 rew eights the con tributions of diﬀerent noise lev els. How ever, (11) is not directly computable in practice, since the marginal densit y p t is induced by the unkno wn data distribution p data . Since the marginal score ∇ x log p t ( x ) is in tractable, w e adopt denoising score matching (DSM) [ 42 ]. DSM exploits the fact that the forw ard transition k ernel q t ( · | x 0 ) is kno wn, and trains the score netw ork to match the corresp onding conditional score. Concretely , it minimizes L 2 ( θ ) := E t ∼U (0 ,T diﬀ ) E x 0 ∼ p data E x ∼ q t ( ·| x 0 ) h κ ( t ) ∥ s θ ( x , t ) − ∇ x log q t ( x | x 0 ) ∥ 2 2 i . (12) The mathematical rationale is the iden tity ∇ x log p t ( x ) = E x 0 ∼ p ( · | x ( t )= x ) [ ∇ x log q t ( x | x 0 )] , so the marginal score is the conditional exp ectation of the conditional score. By the L 2 pro jection prop ert y of conditional exp ectation, L 2 ( θ ) = L 1 ( θ ) + C with a constan t C ≥ 0 indep enden t of θ . Hence L 1 and L 2 ha ve the same minimizers and L 2 ( θ ) ≥ L 1 ( θ ) . F or the VP diﬀusion, q t ( ·| x 0 ) is Gaussian ((9)), hence ∇ x log q t ( x | x 0 ) = − x − α ( t ) x 0 ˆ σ 2 ( t ) . After training, let θ ∗ denote the learned parameters (e.g., an empirical minimizer of the training loss (12) ). Replacing the unknown score in the rev erse-time dynamics b y s θ ∗ yields the appro ximate reverse SDE d x ( t ) =  f ( x ( t ) , t ) − g 2 ( t ) s θ ∗ ( x ( t ) , t )  d t + g ( t ) d ¯ w t , t ∈ [0 , T diﬀ ] , in terpreted in reverse time (run bac kward from t = T diﬀ to t = 0 ), with f ( x , t ) = − 1 2 β ( t ) x and g ( t ) = p β ( t ) for the VP diﬀusion. 4. Conditional Diﬀusion for Ba yesian In version 4.1. Diﬀusion-b ase d p osterior solver W e no w describ e a diﬀusion-based solv er for computing p osterior-driv en reconstructions of v from the Bay esian mo del (5) . F ollowing the diﬀusion p osterior sampling (DPS) framework, w e incorp orate data consistency through a lik eliho o d p oten tial and use a learned score mo del to represent prior information. F or notational consistency with the DPS literature, w e denote the observ ation b y y δ := d obs and recall the rev erse-time SDE (10) for the VP diﬀusion. Thr oughout this se ction, the observation y δ is ﬁxe d; henc e we suppr ess this dep endenc e in the notation and write Φ( · ) ≡ Φ( · ; y δ ) . T o dra w samples from the Bay esian p osterior distribution 9 p ( v | y δ ) in (5), w e replace the marginal score s ⋆ ( x , t ) = ∇ x log p t ( x ) b y the conditional score ∇ x log p t ( x | y δ ) , leading to d x ( t ) =  − 1 2 β ( t ) x ( t ) − β ( t ) ∇ x log p t ( x ( t ) | y δ )  d t + p β ( t ) d ¯ w t , t ∈ [0 , T diﬀ ] , in terpreted in reverse time (run backw ard from t = T diﬀ to t = 0 ). Using Bay es’ rule, the conditional score decomp oses as ∇ x ( t ) log p t ( x ( t ) | y δ ) = ∇ x ( t ) log p t ( x ( t )) + ∇ x ( t ) log p t ( y δ | x ( t )) , (13) so that, giv en an appro ximation of the marginal score ∇ log p t ( x ( t )) ≈ s θ ∗ ( x ( t ) , t ) , it remains to appro ximate the term ∇ x ( t ) log p t ( y δ | x ( t )) . F r om likeliho o d to a gener al p otential. Under a Gaussian likelihoo d (4) , a common appro ximation is ∇ x ( t ) log p t ( y δ | x ( t )) ≃ − 1 2 ∇ x ( t ) ∥ y δ − F ( ˆ x 0 ( t )) ∥ 2 Σ − 1 , where ˆ x 0 ( t ) denotes the denoised (p osterior-mean) estimate of the clean v ariable asso ciated with x ( t ) . F ollowing Eq. (10) in [30], w e set ˆ x 0 ( t ) := 1 p ¯ α ( t )  x ( t ) +  1 − ¯ α ( t )  s θ ∗ ( x ( t ) , t )  , where s θ ∗ ( x , t ) is the trained score netw ork and, for the VP diﬀusion, ¯ α ( t ) := exp  − R t 0 β ( τ ) dτ  . In our form ulation (3) , data consistency is enco ded through a p oten tial Φ . Accordingly , w e approximate the conditioning term by ∇ x ( t ) log p t ( y δ | x ( t )) ≃ −∇ x ( t ) Φ( ˆ x 0 ( t )) , (14) so that the observ ational constraint enters the reverse-time dynamics via the chosen p oten- tial Φ . Com bining (13) with (14) and replacing the prior score by the trained score net work yields ∇ x ( t ) log p t ( x ( t ) | y δ ) ≃ s θ ∗ ( x ( t ) , t ) − ζ ∇ x ( t ) Φ( ˆ x 0 ( t )) , (15) where ζ > 0 is a user-chosen guidance-strength parameter that rescales the data-consistency term to accoun t for ob jective scaling. 4.2. Baseline DPS discr etization for FWI Let { β i } N i =1 b e the discrete noise sc hedule, α i := 1 − β i , and ¯ α i := Q i j =1 α j . Denote by v i the rev erse-time diﬀusion state at step i (DDPM indexing [ 27 ]), so that v i ≈ x ( t i ) . Here i = N corresp onds to the highest noise lev el (an appro ximately Gaussian state), and i = 1 corresp onds to the lo west noise level; thus { v i } N i =1 traces a discrete reverse-time tra jectory that progressively denoises the state from v N to ward a clean reconstruction. Denote by s θ ∗ ( · , i ) the trained score netw ork. The standard clean estimator is ˆ v ( i ) 0 = ˆ v 0 ( v i , i ) := 1 √ ¯ α i  v i + (1 − ¯ α i ) s θ ∗ ( v i , i )  . (16) 10 Algorithm 1: Baseline Diﬀusion P osterior Sampling (DPS) Require : N , sc hedule { β i } N i =1 , score net w ork s θ ∗ , v ariance { ˆ σ i } ﬁxed observ ation d obs , h yp erparameters { ζ i } (and optional clipping b ounds), ob jective function Φ l 2 . Result : obtain the inv ersion v elo cit y mo del v = v 0 . Initialize v N ∼ N (0 , I ) ; set α i := 1 − β i and ¯ α i := Q i j =1 α j . for i = N to 1 do ˆ v ( i ) 0 ← 1 √ ¯ α i  v i + (1 − ¯ α i ) s θ ∗ ( v i , i )  ; z ∼ N (0 , I ) ; v ′ i − 1 ← √ α i (1 − ¯ α i − 1 ) 1 − ¯ α i v i + √ ¯ α i − 1 β i 1 − ¯ α i ˆ v ( i ) 0 + ˆ σ i z ; v i − 1 ← v ′ i − 1 − ζ ∇ v i Φ l 2 ( ˆ v 0 ) ; Baseline likeliho o d p otential (le ast squar es). In the original DPS form ulation [ 30 ], data consistency is enforced through a Gaussian likelihoo d, leading to the least-squares p oten tial Φ ℓ 2 ( v ) := 1 2 ∥F ( v ) − y δ ∥ 2 Σ − 1 . Giv en the prior-driv en prop osal v ′ i − 1 , baseline DPS applies a scalar guidance correction ev aluated at the curren t state v i : v i − 1 = v ′ i − 1 − ζ ∇ v i Φ ℓ 2 ( ˆ v ( i ) 0 ) , (17) = v ′ i − 1 − ζ  F ′ ( ˆ v ( i ) 0 )  ∗ Σ − 1  F ( ˆ v ( i ) 0 ) − y δ  , ζ > 0 . Based on the ab o v e formulation, the DPS-based pro cedure for solving the full wa veform in version problem is summarized in Algorithm 1. 4.3. Limitations of ℓ 2 -guide d DPS in FWI Although the Algorithm 1 is consistent with a Gaussian noise mo del, its direct application to FWI presen ts sev eral well-kno wn and practically imp ortan t diﬃculties. Sp eciﬁcally , the p oin t wise ℓ 2 p oten tial ma y induce cycle-skipping and ob jective imbalance, while the use of a single global scalar guidan ce parameter is inadequate for the strongly space-dep enden t sensitivities c haracteristic of FWI. (i) Cycle-skipping and nonc onvexity. The ℓ 2 misﬁt compares wa v eforms p oin t wise in time; small phase shifts can yield large residuals and highly oscillatory gradients. As a consequence, Φ ℓ 2 t ypically exhibits a strongly nonconv ex landscape with man y spurious lo cal minima, whic h can lead to cycle-skipping in solving the present in v erse problem. Within DPS, the guidance term is applied to the denoised estimate ˆ v ( i ) 0 , and early rev erse-time iterates often pro duce rough ˆ v ( i ) 0 ; in this regime, the ℓ 2 gradien t can b e particularly unreliable, causing unstable or ineﬀectiv e guidance. (ii) Amplitude dominanc e and obje ctive sc aling. Seismic data often exhibit a pronounced dynamic-range im balance across arriv als. Since Φ ℓ 2 is quadratic in the residual, the gradien t ∇ Φ ℓ 2 is dominated b y the largest residual comp onen ts, whic h are t ypically asso ciated with high-amplitude early arriv als. As a result, the baseline correction step tends to prioritize ﬁtting the strongest even ts, while w eak er phases that are informativ e for deep er or p o orly 11 illuminated regions are comparativ ely down-w eighted. Moreo ver, Φ ℓ 2 ( v ) = 1 2 ∥F ( v ) − y δ ∥ 2 Σ − 1 is used without normalization by an observ ation-dep enden t scale; when the data (or residuals after prepro cessing) ha v e small magnitude, b oth Φ ℓ 2 and ∇ Φ ℓ 2 can b ecome n umerically small, making the guidance strength ζ in (17) delicate to tune and p otentially leading to ov erly w eak data-consistency enforcement. (iii) Heter o gene ous sensitivity and sc alar step size. FWI sensitivities are strongly space- dep enden t (e.g., due to limited illumination), so a single global scalar guidance parameter ζ cannot balance updates across well-illuminated and p o orly-illuminated regions. This mismatc h can slow do wn progress in weak-sensitivit y areas and ma y yield unstable b eha vior in sensitiv e regions. These observ ations motiv ate tw o mo diﬁcations: we (a) replace the p oint wise ℓ 2 p oten tial b y a robust OT-based data-consistency p oten tial to mitigate cycle-skipping and amplitude im balance, and (b) introduce a v ariable-metric preconditioned guidance scheme to stabilize and balance spatial up dates. W e present these comp onents next. 5. Our metho d: OT-guided preconditioned DPS for FWI This section presents t w o mo diﬁcations motiv ated by the limitations of ℓ 2 -guided DPS in FWI discussed in the ab o ve section: (a) an OT-based data-consistency p oten tial Φ = J that mitigates amplitude dominance, reduces phase-sensitivit y (cycle-skipping), and stabilizes ob jective scaling; and (b) a v ariable-metric preconditioned guidance sc heme that adapts the guidance strength across noise lev els and balances spatial updates under heterogeneous sensitivities. 5.1. OT-b ase d data-c onsistency p otential As discussed in Section 4.3, p oin t wise least-squares w av eform ﬁtting can lead to amplitude- dominated up dates, strong phase sensitivity (cycle-skipping), and ill-conditioned ob jectiv e scaling. W e therefore construct a robust OT-based misﬁt J in three steps, eac h targeting one of these diﬃculties. (I) Amplitude-adaptive data weighting. Let d obs ( s, r, t ) denote the observed trace asso ciated with the source–receiv er pair ( s, r ) , and let d syn ( s, r, t ; v ) b e its synthetic coun terpart generated b y the velocity mo del v . W e in tro duce a globally normalized instan taneous amplitude a ( s, r, t ) = | d obs ( s, r, t ) | max i,j,t | d obs ( s i , r j , t ) | ∈ [0 , 1] , and deﬁne the b ounded w eigh t ω ( s, r, t ) = 1 1 + k a ( s, r , t ) , k ≥ 0 . (18) By construction, ω ( s, r, t ) ∈  1 / (1 + k ) , 1  and is monotone decreasing in a ( s, r, t ) , so large- amplitude arriv als are down-w eighted while lo w-amplitude p ortions are largely preserv ed. W e apply the same p oin t wise weigh ting to b oth observed and synthetic signals, ˜ d obs ( s, r, t ) = ω ( s, r, t ) d obs ( s, r, t ) , ˜ d syn ( s, r, t ; v ) = ω ( s, r , t ) d syn ( s, r, t ; v ) , (19) 12 so that the comparison is p erformed under a ﬁxed, observ ation-dep endent weigh ting. Equiv a- len tly , ˜ d syn − ˜ d obs = ω ( d syn − d obs ) , and thus the contribution of high-amplitude even ts to the misﬁt and its gradien t is reduced. Imp ortantly , ω is computed from the observ ations and k ept ﬁxed during in version, so the w eighting does not in tro duce additional mo del-dep enden t nonlinearit y . This amplitude-adaptive w eighting alleviates scale disparities induced by raw amplitudes and prev ents the descent direction from b eing dominated b y the high-amplitude p ortions of the data. As an empirical illustration, w e take CurveF ault-A dataset [ 43 ] as an example and compare the original traces and their weigh ted coun terparts at Receiver 1 for the ﬁrst three source– receiv er pairs in Figure 1. In the original recordings (Figure 1a), the early-arriving p ortions exhibit substan tially larger amplitudes than later-arriving ev ents, thereb y o verwhelming the con tribution of w eaker phases in the least-squares misﬁt. After applying the b ounded w eighting (18) with k = 100 (Figure 1b), the dynamic range is eﬀectiv ely compressed: high- amplitude arriv als are suppressed while lo w-amplitude even ts are largely preserv ed. This amplitude balancing reduces the tendency of large-amplitude p ortions of the data to dominate the gradien t and yields a more balanced weigh ted comparison for subsequent in v ersion. Nev ertheless, even with amplitude balancing, p oin twise ℓ 2 misﬁts can remain highly sensitiv e to small time shifts and phase mismatch. T o alleviate the resulting nonlinearit y and reduce cycle-skipping, w e therefore adopt the W asserstein- p distance as the data-misﬁt measure (w e use p = 2 in all computations). 0.0 0.2 0.4 0.6 0.8 1.0 T ime (s) 25 0 25 Amplitude 0.0 0.2 0.4 0.6 0.8 1.0 T ime (s) 25 0 25 Amplitude 0.0 0.2 0.4 0.6 0.8 1.0 T ime (s) 25 0 25 Amplitude (a) The original observed wa veﬁeld. 0.0 0.2 0.4 0.6 0.8 1.0 T ime (s) 0.25 0.00 0.25 Amplitude 0.0 0.2 0.4 0.6 0.8 1.0 T ime (s) 0.25 0.00 0.25 Amplitude 0.0 0.2 0.4 0.6 0.8 1.0 T ime (s) 0.25 0.00 0.25 Amplitude (b) The weigh ted wa veﬁeld with k = 100 . Figure 1: Comparison of the original and w eighted wa veﬁeld traces at Receiv er 1 for the ﬁrst three source– receiv er pairs. The original recordings are dominated by early large-amplitude arriv als, whereas the prop osed b ounded weigh ting compresses the dynamic range and renders later w eak-amplitude ev ents more comparable in scale. (II) T r ansp ort-b ase d misﬁt (W asserstein- p ). The least-squares misﬁt induces a p oint- wise ℓ 2 geometry that is o verly sensitiv e to temp oral misalignmen t, which can lead to a highly noncon vex ob jective landscap e and promote cycle-skipping. W e therefore adopt the W asserstein- p distance, whic h equips the data space with an optimal-transp ort geometry and measures discrepancy via mass displacemen t (after suitable normalization), rather than b y p oin t wise diﬀerences. Since the W asserstein distance is formally deﬁned on nonnegativ e measures with equal 13 total mass, w e follow the standard OT-FWI construction [ 12 ] and transform eac h enhanced 1D trace in to a probability density function (PDF) via a ﬁxed shift–rescale normalization. Sp eciﬁcally , for a trace ˜ d ( t ) (either ˜ d obs or ˜ d syn ), w e introduce a constan t c ′ > 0 to ensure nonnegativit y . T o guaran tee that ˜ d ( t ) + c ′ ≥ 0 on [0 , T ] for all traces, w e ﬁx ( s, r ) and tak e c ′ = 1 . 1    min t ˜ d obs ( s, r, t )    , and k eep c ′ constan t throughout the in v ersion, so that the mapping do es not introduce additional mo del-dep enden t nonlinearit y in the gradien t. In practice, c ′ is c hosen suﬃciently large so that ˜ d syn + c ′ ≥ 0 throughout the inv ersion. W e then deﬁne the density mapping P ( ˜ d )( t ) = ˜ d ( t ) + c ′ R T 0  ˜ d ( τ ) + c ′  dτ , (20) whic h ensures P ( ˜ d )( t ) ≥ 0 and R T 0 P ( ˜ d )( t ) dt = 1 . Accordingly , for eac h source–receiver pair ( s, r ) , the asso ciated probabilit y densities are ˜ ρ obs ( · ; s, r ) = P  ˜ d obs ( · ; s, r )  , ˜ ρ syn ( · ; s, r, v ) = P  ˜ d syn ( · ; s, r, v )  . With the signals transformed in to densities, the W asserstein- p distance is deﬁned by W p ( ˜ ρ syn , ˜ ρ obs ) =  inf γ ∈ ˜ Γ( ˜ ρ syn , ˜ ρ obs ) Z [0 ,T ] × [0 ,T ] | t − τ | p dγ ( t, τ )  1 /p , p ≥ 1 , where ˜ Γ ( ˜ ρ syn , ˜ ρ obs ) denotes the set of admissible transp ort plans with marginals ˜ ρ syn and ˜ ρ obs . In the one-dimensional setting, W p admits an explicit representation in terms of quan tile functions. Let ˜ F syn and ˜ F obs b e the cum ulativ e distribution functions (CDF s) of ˜ ρ syn and ˜ ρ obs , resp ectiv ely; then W p ( ˜ ρ syn , ˜ ρ obs ) =  Z 1 0    ˜ F − 1 syn ( ξ ) − ˜ F − 1 obs ( ξ )    p dξ  1 /p , (21) where ˜ F − 1 denotes the quan tile function, deﬁned as the generalized in v erse ˜ F − 1 ( ξ ) := inf { t ∈ [0 , T ] : ˜ F ( t ) ≥ ξ } . While w e present the deﬁnition for general p ≥ 1 , all subsequent results and computations in this pap er use p = 2 . Numerically , for eac h trace w e compute the PDF using (20) , construct the CDF via cumulativ e summation, and ev aluate the quan tile diﬀerence in (21) b y interpolation. (III) Misﬁt sc ale normalization. T o stabilize the numerical scale of the OT guidance term, w e rescale the aggregated discrepancy b y an observ ation-dep enden t characteristic magnitude. Sp eciﬁcally , let ˜ F − 1 obs ,s,r denote the quantile function asso ciated with the observ ed PDF for the pair ( s, r ) , and deﬁne the scale factor S obs = X s X r  Z 1 0 | ˜ F − 1 obs ,s,r ( ξ ) | p dξ  1 /p , 14 whic h dep ends only on observ ations and is indep enden t of the v elo cit y mo del v . Therefore, this standardization does not alter the minimizer of the ob jective; it merely rescales the functional and its gradien t b y a constant factor, improving numerical conditioning, and reducing sensitivit y to step-size tuning in the guidance up dates. W eighte d and normalize d W asserstein misﬁt. Combining the ab o ve ingredien ts, we deﬁne the normalized W asserstein-based ob jective functional as J ( v ) = X s X r W p ( ˜ ρ syn ( · ; s, r ; v ) , ˜ ρ obs ( · ; s, r )) X s X r  Z 1 0 | ˜ F − 1 obs ,s,r ( ξ ) | p dξ  1 /p , (22) where ˜ ρ syn and ˜ ρ obs are the PDF s obtained from the enhanced w av eﬁelds ˜ d syn and ˜ d obs , resp ectiv ely . This construction of J is designed to (i) reduce the dominance of high-amplitude arriv als through b ounded w eighting, (ii) alleviate phase sensitivit y via the OT geometry , and (iii) stabilize the n umerical scale through observ ation-dep endent normalization. Choic e of likeliho o d p otential. With the ab o ve construction, we instan tiate the data- consistency p oten tial in the Ba yesian likelihoo d (3) as Φ( v ; d obs ) := J ( v ) , and w e often suppress the dep endence on d obs b y writing J ( v ) ≡ J ( v ; d obs ) and Φ( v ) ≡ Φ( v ; d obs ) . 5.2. Pr e c onditione d Guidanc e in DPS Building on the conditional-score appro ximation (15) , w e now introduce a v ariable-metric (diagonal) preconditioned guidance sc heme tailored to FWI. Our mo diﬁc ation: variable-metric pr e c onditione d guidanc e. W e replace the uniform scalar step size b y a p ositiv e diagonal preconditioner, v i − 1 = v ′ i − 1 − P i ∇ v i Φ( ˆ v ( i ) 0 ) , P i = ρ i D i , (23) where ρ i > 0 is a scalar schedule and D i is a di agonal p ositive matrix. When P i = ζ I , (23) reduces to the baseline DPS step. The guidance gradien t in (23) is computed with resp ect to v i through the clean estimator (16): ∇ v i Φ( ˆ v ( i ) 0 ) =  ∇ v i ˆ v 0 ( v i , i )  ⊤ ∇ ˆ v 0 Φ( ˆ v ( i ) 0 ) . (24) W e ev aluate (24) in practice b y automatic diﬀerentiation. Stabilize d guidanc e sche duling. Early reverse-time iterates t ypically pro duce ˆ v ( i ) 0 with spurious high-frequency artifacts. In this regime, the guidance direction can b e unreliable, and ov erly aggressive up dates may destabilize the reverse tra jectory . W e therefore adopt a conserv ative guidance strength when ˆ v ( i ) 0 is still rough, and gradually increase it as the estimate b ecomes smoother and more structured. 15 T o quan tify the roughness of the current denoised estimate, we introduce a TV-based indicator [ 44 ]. Speciﬁcally , let TV ( · ) denote the discrete total-v ariation semi-norm on the 2D grid: TV( v ) = 1 N g X ℓ  | ( ∇ x v ) ℓ | + | ( ∇ z v ) ℓ |  , (25) where ℓ indexes grid p oin ts, N g is the n umber of grid p oin ts, and ∇ x , ∇ z are forw ard ﬁnite diﬀerences. W e use TV ( ˆ v ( i ) 0 ) as a roughness pro xy b ecause spurious oscillatory artifacts in early iterates typically manifest themselv es through large lo cal ﬁnite diﬀerences and hence elev ated TV v alues. At the same time, TV is less sensitive than quadratic smo othness indicators to gen uine sharp in terfaces, which are common in subsurface mo dels. Therefore, it pro vides a simple and computationally eﬃcient criterion for distinguishing unstable, artifact-dominated iterates from structurally meaningful ones. Based on this pro xy , we deﬁne the step-dep endent scalar factor ρ i = ρ 0 exp − (TV( ˆ v ( i ) 0 ) − τ ) + c ! , ( · ) + := max {· , 0 } , (26) with h yp erparameters ρ 0 > 0 , c > 0 , and an optional threshold τ ≥ 0 . By construction, ρ i ∈ (0 , ρ 0 ] . Moreov er, ρ i = ρ 0 whenev er TV ( ˆ v ( i ) 0 ) ≤ τ , and it decreases monotonically with TV ( ˆ v ( i ) 0 ) once TV ( ˆ v ( i ) 0 ) > τ . The parameter c sets the deca y scale: larger c leads to slo wer atten uation, that is, less sensitivit y to roughness. The threshold τ deﬁnes a tolerance lev el for total v ariation and preven ts excessiv e suppression of guidance in the presence of genuine in terfaces, such as sharp geological contrasts. Sp atial diagonal pr e c onditioning. FWI gradien ts t ypically deca y with depth due to limited illumination and attenuation, leading to highly nonuniform up date magnitudes. T o mitigate this im balance, we introduce a diagonal rescaling based on the current misﬁt gradient with resp ect to the clean estimate. Let g ( i ) := ∇ ˆ v 0 Φ( ˆ v ( i ) 0 ) ∈ R n . W e deﬁne the diagonal w eights κ ( i ) j = ∥ g ( i ) ∥ ∞ + ε | g ( i ) j | + ε ! γ , D i := diag( κ ( i ) 1 , . . . , κ ( i ) n ) , (27) where γ ≥ 0 and ε > 0 . W e build D i from g ( i ) = ∇ ˆ v 0 Φ b ecause it directly reﬂects the physical FWI sensitivit y in the mo del domain; the chain rule in (24) then maps this sensitivit y back to the curren t diﬀusion state. Interpr etation via a desc ent ine quality. F or each rev erse-time index i , consider the comp osite functional ˜ Φ i ( v ) := Φ( ˆ v ( i ) 0 ( v )) . Assume that ˜ Φ i is diﬀerentiable and has a lo cally L i - Lipsc hitz gradient on a neighborho o d containing the relev ant iterates. Then the descent lemma [45] yields that, for an y symmetric p ositiv e semideﬁnite matrix P satisfying ∥ P ∥ 2 ≤ 1 /L i , ˜ Φ i ( v − P ∇ ˜ Φ i ( v )) ≤ ˜ Φ i ( v ) − 1 2 ∥∇ ˜ Φ i ( v ) ∥ 2 P , ∥ a ∥ 2 P := a ⊤ P a. (28) 16 Algorithm 2: OT-guided W av eﬁeld-Enhanced Preconditioned DPS (OT-WE-PDPS) Require : N , schedule { β i } N i =1 , score net work s θ ∗ , v ariance { ˆ σ i } , ﬁxed observ ation d obs , h yp erparameters ρ 0 , c, τ , p, ε (and optional clipping b ounds), ob jective function J as (22). Result : approximate posterior sample v 0 . Initialize v N ∼ N (0 , I ) ; set α i := 1 − β i and ¯ α i := Q i j =1 α j . for i = N to 1 do ˆ v ( i ) 0 ← 1 √ ¯ α i  v i + (1 − ¯ α i ) s θ ∗ ( v i , i )  ; z ∼ N (0 , I ) ; v ′ i − 1 ← √ α i (1 − ¯ α i − 1 ) 1 − ¯ α i v i + √ ¯ α i − 1 β i 1 − ¯ α i ˆ v ( i ) 0 + ˆ σ i z ; g ( i ) ← ∇ ˆ v ( i ) 0 J ( ˆ v ( i ) 0 ) ; κ ( i ) j ← (( ∥ g ( i ) ∥ ∞ + ε ) / ( | g ( i ) j | + ε )) p ; form D i = diag( κ ( i ) 1 , . . . , κ ( i ) d ) ; ρ i ← ρ 0 exp  − (TV ( ˆ v ( i ) 0 ) − τ ) + /c  ; Set P i ← ρ i D i (optionally with clipping and/or sp ectral-norm con trol); v i − 1 ← v ′ i − 1 − P i ∇ v i J ( ˆ v ( i ) 0 ) ; Applying (28) p oin t wise with P at the curren t iterate, this estimate pro vides a heuristic justiﬁcation for c ho osing P i = ρ i D i . Indeed, b y tuning ρ i and clipping the diagonal entries of D i , w e control the eﬀective sp ectral norm of P i and obtain a stabilized v ariable-metric guidance step, analogous to diagonal preconditioning in ill-conditioned in verse problems. Com bining the ab o ve ingredients, w e arriv e at the following ﬁnal Algorithm 2. 6. Exp erimen ts W e ev aluate our metho d on Op enFWI [ 43 ]. W e train the score-based prior using v elo city mo dels from the V el F amily and F ault F amily datasets. All in version results are rep orted on held-out test samples that are disjoint from the training set; representativ e test mo dels are sho wn in the ﬁrst row of Figure 2. The score neural netw ork is a U-Net [46] trained via denoising score matc hing under the VP diﬀusion describ ed in Section 3. Imp ortan tly , prior training uses only samples of the model parameter ﬁeld (velocity) and do es not require an y w av eﬁeld simulations. W e compute p osterior-driv en reconstructions by running a discretized reverse-time diﬀusion whose drift com bines the learned prior score and the data-consistency term induced by F , implemen ted via our OT-guide d W aveﬁeld-Enhanc e d Pr e c onditione d DPS (OT-WE-PDPS) (Algorithm 2). The lik eliho o d p otential is chosen as Φ = J , the normalized OT misﬁt of Section 5.1. The rep orted reconstructions corresp ond to unseen mo dels that were not used in training the prior score net work. Exp erimen tal p erformance is ev aluated using b oth qualitativ e insp ection and quan titativ e metrics. As a primary accuracy measure, we report the relative ℓ 2 reconstruction error e ℓ 2 := ∥ v true − v rec ∥ 2 ∥ v true ∥ 2 , 17 where v true ∈ R m denotes the ground-truth velocity mo del (vectorized on the computational grid) and v rec ∈ R m is the reconstruction, and ∥ · ∥ 2 is the Euclidean norm. In addition, to quantify p erceptual ﬁdelit y and structural consistency , we report the p eak signal-to-noise ratio (PSNR) and the structural similarit y index measure (SSIM) computed b et w een v rec and v true . Baselines and implementation details. W e compare our metho d against three baselines: (i) Baseline 1 ( W 2 + TV ) [ 47 ]. W e consider a classical baseline that minimizes a normalized W 2 w av eform misﬁt with total-v ariation (TV) regularization. Deﬁne J raw W 2 ( v ) := X s X r W 2 2 ( ρ syn ( · ; s, r ; v ) , ρ obs ( · ; s, r )) X s X r Z 1 0 | F − 1 obs ,s,r ( ξ ) | 2 dξ , where ρ syn ( · ; s, r ; v ) and ρ obs ( · ; s, r ) are probability densities obtained from the r aw syn thetic and observed traces d syn ( · ; s, r ; v ) and d obs ( · ; s, r ) , resp ectiv ely , via the ﬁxed shift–rescale densit y map (cf. (20) ); in particular, no wa veﬁeld enhancemen t is applied here. Moreo ver, F obs ,s,r denotes the CDF of ρ obs ( · ; s, r ) , and F − 1 obs ,s,r is the asso ciated quan tile function. W e then minimize the TV-regularized ob jective min v J raw W 2 ( v ) + α TV ( v ) , where α > 0 is the regularization parameter and TV ( · ) is deﬁned in (25) . Using a ﬁrst-order metho d, a (sub)gradien t descen t iteration reads v ← v − ρ 0  ∇ v J raw W 2 ( v ) + α g TV ( v )  , g TV ( v ) ∈ ∂ TV( v ) , with a ﬁxed step size ρ 0 > 0 . (ii) Baseline 2 (OT-WE + TV ). W e consider an impro ved v ersion of the ab o ve method that minimizes the wa veﬁeld-enhanced normalized OT misﬁt J ( v ) in (22) (Section 5.1), augmen ted with a TV p enalt y: min v J ( v ) + α TV( v ) , where α > 0 is the regularization parameter and TV ( · ) is deﬁned in (25) . W e solve this problem using a ﬁrst-order method. In particular, w e use a diagonal preconditioned (sub)gradient step v ← v − P ( v )  ∇ v J ( v ) + α g TV ( v )  , g TV ( v ) ∈ ∂ TV( v ) , where P ( v ) = ρ ( v ) D ( v ) is a p ositiv e diagonal preconditioner updated from the curren t iterate. Here ρ ( v ) and D ( v ) are deﬁned b y (26) and (27), resp ectiv ely , after replacing ˆ v ( i ) 0 b y v . (iii) Baseline 3 (DPS) [ 30 ]. W e use the original diﬀusion p osterior sampling baseline with an ℓ 2 (MSE) data-consistency p oten tial, Φ ℓ 2 ( v ) := 1 2 ∥F ( v ) − y δ ∥ 2 Σ − 1 , 18 and apply the standard scalar guidance up date (17) at eac h rev erse-time step. No OT-based misﬁt, w av eﬁeld enhancement, or preconditioned guidance is used in this baseline. A cross all metho ds, w e use the same forw ard solv er, acquisition geometry , and observ ation data. F or a fair comparison, w e tune the h yp erparameters of eac h comp eting metho d as carefully as p ossible (within the same exp erimen tal proto col) to ac hiev e its b est attainable p erformance. 6.1. F orwar d Solver and Observation Data The forw ard operator F (deﬁned in Section 2) is ev aluated by numerically solving problem (1) for the wa veﬁeld u giv en the v elo cit y ﬁeld v . F or a source located at ξ j , w e use th e source term s ( x , t ; ξ j ) = f ( t ) δ ( x − ξ j ) , where f is a Rick er wa v elet f ( t ) =  1 − 2 π 2 f 2 p ( t − t 0 ) 2  exp  − π 2 f 2 p ( t − t 0 ) 2  , with f p = 15 and t 0 = 1 . 1 /f p . W e discretize (1) on a uniform Cartesian grid of size 71 × 71 with spacing dx = dz = 10 and adv ance in time with step dt = 10 − 3 for n t = 1000 steps. W e use an 8th-order accurate ﬁnite-diﬀerence discretization in space and a second-order accurate explicit time-stepping sc heme, as implemen ted in Deep w av e [ 48 ]. T o appro ximate an un b ounded domain, w e truncate the computational domain using absorbing p erfectly matched la yers (PMLs) of thickness 6 grid p oin ts on the left, right, and b ottom b oundaries. On the top b oundary , w e imp ose a reﬂecting (free-surface) b oundary condition. The observed data are generated according to (2), with additiv e Gaussian noise η ∼ N (0 , σ 2 I ) and σ = 0 . 05 . 6.2. Numeric al c alculation of the W asserstein–2 distanc e F or each source–receiver trace, w e compute a one-dimensional W asserstein–2 distance b et w een the observed and synthetic signals after enforcing nonnegativit y and unit mass. In our metho d, the W asserstein distance is computed using the w av eﬁeld-enhanced traces ˜ d obs ( t ) and ˜ d syn ( t ) (19). W e ﬁrst shift eac h trace b y a constan t c ′ > 0 to ensure positivity and then normalize to obtain probabilit y densities: ˜ ρ obs ( t ) = ˜ d obs ( t ) + c ′ R T 0  ˜ d obs ( τ ) + c ′  dτ + ε ′ , ˜ ρ syn ( t ) = ˜ d syn ( t ) + c ′ R T 0  ˜ d syn ( τ ) + c ′  dτ + ε ′ , where the in tegrals are taken o v er the recording window [0 , T ] and ev aluated on the discrete sampling grid, ε ′ = 10 − 9 is a small constan t used to av oid division by a v anishing discrete normalization factor, and w e ﬁx c ′ = 1 . 1   min t ∈ [0 ,T ] ˜ d obs ( t )   in all exp erimen ts. Let ˜ F obs and ˜ F syn denote the correspond ing CDF s, and let ˜ Q obs = ˜ F − 1 obs and ˜ Q syn = ˜ F − 1 syn b e the associated quantile functions. Using the one-dimensional representation W 2 ( ρ obs , ρ syn ) =  Z 1 0   ˜ Q obs ( ξ ) − ˜ Q syn ( ξ )   2 dξ  1 2 , w e discretize ˜ F obs and ˜ F syn on the time grid { t k } n t − 1 k =0 b y trapezoidal quadrature, ev aluate ˜ Q obs ( ξ ) and ˜ Q syn ( ξ ) via linear in terp olation, and approximate the ξ -in tegral on a uniform grid ξ j = j / ( N ξ − 1) b y the trap ezoidal rule ( N ξ = 1000 ). The o verall OT misﬁt is obtained b y av eraging the resulting W 2 v alues o ver all source–receiv er pairs. 19 T able 1: Hyp erparameters used in each dataset. Here γ is the exponent in the diagonal preconditioner D i for OT-WE-PDPS (27) . The sym b ol ρ 0 denotes the base guidance strength in the noise-aw are sc hedule ρ i for OT-WE-PDPS (26) , and it also denotes the (constant) step size used in the ﬁrst-order solver for the W 2 + TV baseline. The parameter α is the TV regularization w eight used in the TV-regularized baselines (Baseline 1 and Baseline 2), with TV deﬁned in (25). Curv e V el-A Curv e V el-B P aram Ours DPS OT- WE +TV W 2 + TV Ours DPS OT- WE +TV W 2 + TV γ 0.55 - 0.65 - 0.35 - 0.65 - ρ 0 1.15 8 1.3 13 12.4 4 4 14 α - - 0.05 0.2 - - 1.0 0.2 FlatF ault-A FlatF ault-B P aram Ours DPS OT- WE +TV W 2 + TV Ours DPS OT- WE +TV W 2 + TV γ 0.35 - 0.65 - 0.45 - 0.65 - ρ 0 1.2 0.3 1.1 5 1.45 9 0.6 2 α - - 0.3 1.0 - - 0.2 0.5 Curv eF ault-A Curv eF ault-B P aram Ours DPS OT- WE +TV W 2 + TV Ours DPS OT- WE +TV W 2 + TV γ 0.45 - 0.65 - 0.55 - 0.65 - ρ 0 1.8 4.15 1.3 12 1.75 5 0.6 14 α - - 0.1 0.2 - - 0.1 0.5 6.3. Quantitative and qualitative c omp arisons 6.3.1. A c quisition ge ometry and discr etization. W e consider a surface acquisition setting in whic h b oth sources and receiv ers are lo cated on the free surface z = L z , where L x = L z = 700 . The computational domain is discretized on a 71 × 71 Cartesian grid with spacing dx = dz = 10 m, where x denotes the horizon tal axis and z the depth axis. Grid no des are giv en b y x i = i dx and z j = j dz for i, j = 0 , . . . , 70 . T en p oin t sources are placed at surface no des ( x s k , z s k ) = ( x i k , L z ) = ( i k dx, L z ) , i k ∈ { 0 , 7 , 14 , 21 , 28 , 35 , 42 , 49 , 56 , 63 } , and 70 receiv ers are deploy ed at all surface no des, ( x r , z r ) = ( x r , L z ) = ( r dx, L z ) , r = 0 , 1 , . . . , 69 . 6.3.2. T est datasets. W e ev aluate six v elo cit y mo dels sho wn in Figure 2: Curve V el-A/B, FlatF ault-A/B, and Curv eF ault-A/B. These mo dels are designed to prob e increasing structural complexit y , from 20 T able 2: Quan titative comparison of inv ersion metho ds on Op enFWI velocity mo dels. Metrics are the relative ℓ 2 error e ℓ 2 , PSNR, and SSIM for the reconstructed velocity ﬁeld (arro ws indicate the preferred direction). Metho ds include W 2 + TV (Baseline 1), OT-WE + TV (Baseline 2), DPS (Baseline 3), and the prop osed metho d. Curv e V el-A Curv e V el-B FlatF ault-A Metho d e l 2 ↓ PSNR ↑ SSIM ↑ e l 2 ↓ PSNR ↑ SSIM ↑ e l 2 ↓ PSNR ↑ SSIM ↑ W 2 + TV 12.10% 20.65 0.4971 10.72% 21.05 0.5425 11.84% 20.56 0.7198 OT-WE +TV 7.96% 24.28 0.6987 7.40% 24.27 0.6434 7.28% 24.79 0.9217 DPS 23.15% 15.01 0.5747 15.70% 17.73 0.4329 16.68% 17.58 0.6965 Ours 3.50% 31.42 0.9039 2.04% 35.45 0.9517 2.91% 32.74 0.9646 FlatF ault-B Curv eF ault-A Curv eF ault-B Metho d e l 2 ↓ PSNR ↑ SSIM ↑ e l 2 ↓ PSNR ↑ SSIM ↑ e l 2 ↓ PSNR ↑ SSIM ↑ W 2 + TV 22.82% 17.05 0.4584 17.63% 17.74 0.4875 16.34% 18.20 0.4343 OT-WE +TV 6.31% 28.21 0.8729 4.79% 29.06 0.7897 9.39% 23.01 0.6997 DPS 13.54% 21.58 0.5662 10.72% 22.06 0.6149 21.82% 15.69 0.2210 Ours 6.13% 28.48 0.9272 2.41% 35.03 0.9540 5.32% 27.95 0.7956 smo oth la yered in terfaces to sharp discon tinuities and their combinations. Curve V el-A/B are la yered media with clear interfac es (including curv ed b oundaries), intended to test in terface reco very . In Curv e V el-A, v elo cities within la yers increase gradually with depth, whereas in Curv e V el-B, the la yer velocities are randomly distributed. FlatF ault-A/B introduces faults that shift la y ers and generate sharp discon tinuities; compared with Curve V el-A/B exhibits more discon tin uities and more sev ere v elo cit y contrasts. Curv eF ault-A/B com bines curv ed stratiﬁed in terfaces with fault-induced discon tin uities, thereby coupling smo oth geometric v ariation with sharp jumps and p osing the most c hallenging scenarios among the six datasets. 6.3.3. Settings. W e set k = 100 in (18) and ε = 10 − 4 in the construction of D i (27). When computing ρ i (26) , we set c = 0 . 1 and τ = 0 in our metho d. These settings are k ept ﬁxed across all datasets. Other h yp erparameters v ary across datasets and metho ds and are listed in T able 1. F or Baseline 1 ( W 2 + TV ) and Baseline 2 (OT-WE + TV ) , w e run gradien t descent for 10 , 000 iterations and rep ort the ﬁnal iterate; step sizes and regularization weigh ts are listed in T able 1. A cross all metho ds, w e use the same forward solver, acquisition geometry , and observ ation data. Hyp erparameters are tuned under a uniﬁed proto col to obtain the b est attainable p erformance for eac h comp eting metho d. 6.3.4. R esults. The follo wing observ ations are drawn from T able 2 and Figure 2: 1. Our metho d attains the lo w est reconstruction errors and the highest image-quality metrics across the datasets, while preserving in terfaces and fault geometries in Figure 2. 21 1500 1800 2100 2400 2700 3000 3300 3600 3900 4200 4500 v elo cit y (m/s) -3000 -2400 -1800 -1200 -600 0 600 1200 1800 2400 3000 v elo cit y (m/s) Reconstruction Diﬀerence Curv e V el-A Curv e V el-B FlatF ault-A FlatF ault-B Curv eF ault-A Curv eF ault-B T ruth Ours Ours DPS DPS OT-WE +TV OT-WE +TV W 2 + TV W 2 + TV Figure 2: Inv ersion results across diﬀerent datasets. The ﬁrst ro w displays the true velocity ﬁeld v true . Ro ws 2 to 5 show the in verted velocity ﬁelds v rec obtained using diﬀeren t algorithms. Rows 6 to 9 presen t the diﬀerence betw een the true and inv erted v elo cit y ﬁelds, v rec − v true . The observed w av eﬁeld has b een p erturbed with additive noise of in tensit y σ = 0 . 05 . 22 1 0 3 1 0 2 1 0 1 1 0 0 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 e l 2 (a) Relative l 2 -error of the reconstructed v elo city ﬁeld v for the W 2 + TV method under diﬀerent v alues of the TV regularization parameter α . 1 0 3 1 0 2 1 0 1 1 0 0 0.08 0.10 0.12 0.14 0.16 e l 2 (b) Relativ e l 2 -error of the reconstructed velocity ﬁeld v for the OT-WE + TV method under diﬀerent values of the TV regularization parameter α . Figure 3: Relative l 2 -error of the reconstructed v elo cit y ﬁeld v v ersus the TV regularization parameter α for the t wo deterministic metho ds: (a) W 2 + TV and (b) OT-WE +TV . 0 2000 4000 6000 8000 10000 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 Er r or (a) Relative l 2 -error of the reconstructed v elo city ﬁeld v versus the iteration num b er for the W 2 + TV method with the optimal TV regularization parameter α = 0 . 2 . 0 2000 4000 6000 8000 10000 0.08 0.10 0.12 0.14 0.16 Er r or (b) Relativ e l 2 -error of the reconstructed velocity ﬁeld v versus the iteration num b er for the OT-WE + TV method with the optimal TV regularization parameter α = 0 . 08 . Figure 4: Relativ e l 2 -error of the reconstructed v elo city ﬁeld v v ersus the iteration num b er for the tw o deterministic methods with their optimal TV regularization parameters α : (a) W 2 + TV and (b) OT-WE + TV . In con trast, the DPS baseline often yields reconstructions with a larger mismatch to the ground truth, suggesting that the standard MSE-based w av eﬁeld guidance ma y b e too weak to enforce data consistency under surface-only acquisition. These results indicate that our prop osed metho d ac hieves higher reconstruction quality than the DPS baseline under surface-only acquisition. 2. W 2 + TV baseline often produces o ver-smoothed reconstructions, with blurred in terfaces and loss of structural details, which is reﬂected b y low er SSIM and visually smeared la yer b oundaries. Our metho d mitigates this ov er-smo othing and recov ers the main structures, although errors remain in c hallenging regions. 3. As the mo del complexit y increases (e.g., more la y ers, stronger discon tinuities, and curv ed/faulted in terfaces), the inv ersion problem b ecomes more ill-posed and the reconstruction accuracy degrades. F or our metho d, residual errors concentrate near sharp v elo cit y transitions and at larger depths, consistent with reduced illumination 23 T able 3: Ablation results on Curv e V el-B: relativ e ℓ 2 error ( e ℓ 2 ; low er is b etter), PSNR and SSIM (higher is b etter). Here Only denotes the baseline DPS update with the sp eciﬁed misﬁt, Enh enables w a veﬁeld enhance- men t (amplitude-adaptiv e data weigh ting and misﬁt-scale normalization), Step enables the preconditioned guidance update P i = ρ i D i , and All enables b oth Enh and Step . Metho d e ℓ 2 ↓ PSNR ↑ SSIM ↑ MSE guidance Only 18.21% 16.45 0.3827 Step 12.52% 19.70 0.5082 Enh 9.20% 22.37 0.6578 All 5.10% 27.50 0.8226 W 2 guidance Only 10.81% 20.98 0.6307 Step 8.65% 22.91 0.6726 Enh 5.33% 27.13 0.8244 All (ours) 2.04% 35.45 0.9517 and parameter sensitivit y in deep er regions for surface-only acquisition. 4. The OT-WE + TV baseline frequen tly outp erforms the W 2 + TV baseline in terms of b oth quan titativ e metrics and visual quality . This observ ation suggests that wa veﬁeld enhancemen t and adaptiv e step sizing are eﬀectiv e not only within the DPS framework, but ma y also improv e the p erformance of other inv ersion metho ds. T o v erify that the selected regularization parameter α is appropriate for the deterministic metho ds, w e tak e Dataset Curve V el-A as an example and plot, in Figure 3, the relationship b et w een the regularization parameter and the ﬁnal relative error for (i) the W 2 + TV metho d and (ii) the OT-WE + TV metho d. In addition, Figure 4 sho ws the evolution of the relative error with resp ect to the iteration num b er for these tw o metho ds. Since Figure 4a indicates that the relativ e error of the W 2 + TV metho d is still decreasing, we further contin ue this metho d to 80 , 000 iterations under the same parameter setting. Ev en then, the relative error is only reduced to 10 . 5% , whic h is still higher than the b est relative error ac hiev ed by the OT-WE +TV metho d, namely , 8% . The results indicate that, for the deterministic metho ds, the ﬁnal in v ersion error is sensitiv e to the choice of the TV regularization parameter α , and a b est-p erforming v alue can b e identiﬁed within the tested parameter range. F or the remaining datasets, the parameter α w as selected by the same tuning pro cedure in order to achiev e the b est performance attainable within the tested parameter range for eac h deterministic metho d. In addition, measured in terms of iteration coun t, the deterministic metho ds require substantially more iterations to con verge than our metho d. 6.4. A blation Study In the preceding exp erimen ts, w e ev aluated the reconstruction p erformance of our metho d for FWI. W e next report an ablation study to assess the con tribution of its principal 24 1500 1800 2100 2400 2700 3000 3300 3600 3900 4200 4500 v elo cit y (m/s) -3000 -2400 -1800 -1200 -600 0 600 1200 1800 2400 3000 v elo cit y (m/s) Reconstruction Diﬀerence Truth Only Step Enh All W 2 MSE W 2 MSE Figure 5: Ablation study on Curve V el-B. The top panel sho ws the ground-truth v elo cit y mo del. The second and third panels rep ort reconstructions obtained with W 2 -based and MSE-based guidance, respectively . F rom left to righ t, the columns corresp ond to Only (baseline DPS), Step (preconditioned guidance up date P i = ρ i D i ), Enh (w av eﬁeld enhancemen t: amplitude-adaptiv e w eighting and misﬁt-scale normalization), and All ( Step + Enh ). The bottom t wo panels sho w the corresp onding error maps (reconstruction minus ground truth). comp onen ts. T aking the original DPS framework as the reference, we inv estigate: (i) the choice of data-misﬁt metric, comparing MSE with the W asserstein- 2 distance ( W 2 ); (ii) waveﬁeld enhanc ement , consisting of amplitude-adaptive data weighting and misﬁt-sc ale normalization (Section 5.1); and (iii) the pr e c onditione d guidanc e up date P i = ρ i D i (Section 5.2), whic h com bines stabilized guidance scheduling with spatially adaptiv e diagonal scaling. Unless otherwise noted, all ablation exp eriments are p erformed on the Curv e V el-B mo del in Figure 2. F or eac h setting, hyperparameters are selected according to the same tuning proto col so as to ac hieve the b est p erformance attainable within that setting. The results are summarized in Figure 5 and T able 3. Figure 6 further rep orts the evolution of the relativ e ℓ 2 error (with resp ect to v true ) for the intermediate reconstructions ˆ v ( i ) 0 and v i o ver the reverse-diﬀusion steps t . Notation. Throughout this ablation study , eac h lab el has the form (misfit)-(component) . The preﬁx W 2 - or MSE- sp eciﬁes the guid ance misﬁt used in the data-consistency term, namely the W asserstein- 2 distance ( W 2 ) or the mean-squared error (MSE), resp ectiv ely . The suﬃx indicates which additional comp onen ts are enabled: 25 • Only : baseline DPS with the c hosen misﬁt, without any of our additional mo diﬁcations. • Enh : baseline DPS plus waveﬁeld enhanc ement , i.e., amplitude-adaptive data w eighting together with misﬁt-scale normalization (Sec. 5.1). • Step : baseline DPS plus the pr e c onditione d guidanc e up date P i = ρ i D i (Sec. 5.2), which com bines the noise-aw are scalar schedule ρ i and the spatial diagonal scaling D i . • All : baseline DPS with b oth Enh and Step enabled. F or example, W 2 -Enh uses W 2 guidance together with wa veﬁeld enhancemen t, while MSE-All enables b oth enhancemen ts under MSE guidance. Observations. Figures 5– 6 and T able 3 reveal the follo wing trends: (i) Com bining w a veﬁeld enhancement with the preconditioned update yields the best o verall p erformance. In particular, W 2 -All ac hieves e ℓ 2 = 2 . 04% , PSNR = 35 . 45 , and SSIM = 0 . 9517 , substan tially improving o v er W 2 -Only ( e ℓ 2 = 10 . 81% , PSNR = 20 . 98 , SSIM = 0 . 6307 ). (ii) W a veﬁeld enhancemen t pro vides the larger gain when used alone, while the precondi- tioned up date oﬀers additional impro vemen t when com bined. F or example, under W 2 guid- ance, adding w av eﬁeld enhancement reduces e ℓ 2 from 10 . 81% ( W 2 -Only ) to 5 . 33% ( W 2 -Enh ), whereas adding the preconditioned up date alone reduces it to 8 . 65% ( W 2 -Step).Enabling b oth components yields the b est p erformance, achieving e ℓ 2 = 2 . 04% ( W 2 -All ). (iii) W 2 -based guidance is consisten tly more robust than MSE-based guidance across all ablation settings, b oth in the baseline conﬁguration ( Only ) and in the com bined conﬁguration ( All ). This is also reﬂected in the visual reconstructions and error maps in Figure 5. (iv) The metric tra jectories suggest impro ved stabilit y with preconditioned guidance. As sho wn in Figure 6, enabling P i = ρ i D i reduces transien t oscillations during in termediate rev erse steps and leads to more stable con vergence, with the com bined setting ( All ) ac hieving the b est ﬁnal metrics. 6.5. R obustness Study 6.5.1. R obustness to Observation Noise W e ev aluate the robustness of our metho d to observ ation noise b y corrupting the recorded data with additive noise at increasing levels σ ∈ { 0 , 10 − 4 , 10 − 3 , 10 − 2 , 5 × 10 − 2 } . All other h yp erparameters are k ept iden tical to those used in the previous exp erimen ts. Qualitativ e reconstructions are shown in Figure 7, and quantitativ e results, including the relativ e ℓ 2 -error, PSNR, and SSIM, are rep orted in T able 4. Ov erall, the reconstructions remain visually stable across all noise levels for all six datasets: the ma jor interfaces and fault geometries are preserv ed, and the predicted v elo cit y ranges remain consistent (Figure 7). Quantitativ ely , the p erformance degrades only mildly as σ increases, with nonmonotone but limited v ariations in e ℓ 2 , PSNR, and SSIM (T able 4). These results indicate that the prop osed guidance remains eﬀective under mo derate lev els of observ ation noise. 26 Enh Step Only ALL 0 200 400 600 800 1000 0.0 0.1 0.2 0.3 0.4 0.5 (a) Relative ℓ 2 -error of v i using W 2 0 200 400 600 800 1000 10 20 30 (b) PSNR of v i using W 2 0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 (c) SSIM of v i using W 2 0 200 400 600 800 1000 0.0 0.1 0.2 0.3 0.4 0.5 (d) Relative ℓ 2 -error of ˆ v ( i ) 0 using W 2 0 200 400 600 800 1000 10 20 30 (e) PSNR of ˆ v ( i ) 0 using W 2 0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 1.0 (f ) SSIM of ˆ v ( i ) 0 using W 2 0 200 400 600 800 1000 0.1 0.2 0.3 0.4 0.5 (g) Relative ℓ 2 -error of v i using MSE 0 200 400 600 800 1000 10 15 20 25 (h) PSNR of v i using MSE 0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 (i) SSIM of v i using MSE 0 200 400 600 800 1000 0.1 0.2 0.3 0.4 0.5 (j) Relative ℓ 2 -error of ˆ v ( i ) 0 using MSE 0 200 400 600 800 1000 10 15 20 25 (k) PSNR of ˆ v ( i ) 0 using MSE 0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 (l) SSIM of ˆ v ( i ) 0 using MSE Figure 6: Evolution of the statistical metrics ov er rev erse-time steps for the ablation settings Enh , Step , Only and All . P anels (a)–(c) sho w the relativ e ℓ 2 -error, PSNR and SSIM of the intermediate iterate v i under W 2 -based guidance; panels (d)–(f ) sho w the corresp onding metrics for the denoised estimate ˆ v ( i ) 0 under W 2 -based guidance. Panels (g)–(i) and (j)–(l) report the analogous quantities under MSE-based guidance for v i and ˆ v ( i ) 0 , respectively . 6.6. Gener alization acr oss forwar d op er ators An imp ortan t feature of the prop osed framew ork is that the learned score-based prior is trained only on prior v elo cit y mo dels and do es not dep end on a sp eciﬁc forward operator. Once this prior has b een trained, it can b e coupled with diﬀeren t forward models F for in version without retraining the prior net work. In this subsection, we examine the robustness of the prop osed metho d with resp ect to c hanges in the forw ard op erator b y p erturbing the acquisition and ph ysical settings and then re-running the inv ersion. W e consider the FlatF ault-B dataset and deﬁne ﬁv e forw ard-op erator conﬁgurations. The baseline conﬁguration, denoted b y Init , coincides with that used in the preceding exp erimen ts. W e then mo dify the acquisition and ph ysical settings in sev eral wa ys, thereby 27 1500 1800 2100 2400 2700 3000 3300 3600 3900 4200 4500 v elo cit y (m/s) Curv e V el-A Curv e V el-B FlatF ault-A FlatF ault-B Curv eF ault-A Curv eF ault-B T ruth σ = 0 σ = 1 e − 4 σ = 1 e − 3 σ = 1 e − 2 σ = 5 e − 2 Figure 7: Qualitative robustness to observ ation noise across the six datasets. Eac h column corresponds to one dataset. The ﬁrst ro w sho ws the true velocity ﬁeld, and the remaining ro ws sho w the reconstructed velocity ﬁelds under progressiv ely increasing noise levels σ ∈ { 0 , 10 − 4 , 10 − 3 , 10 − 2 , 5 × 10 − 2 } . The reconstructions remain visually stable as the noise level increases, with the main interfaces and fault structures largely preserv ed. obtaining p erturbed forward operators as follo ws: 1. Freq10 : we replace the temp oral source function b y a Ric k er wa velet of p eak frequency f p = 10 Hz , while k eeping the spatial p oint-source locations, all other parameters, and the acquisition geometry un c hanged. Consequen tly , the observed and synthetic w av eﬁelds are generated with a diﬀerent source peak frequency . 2. Depth2 : we shift the acquisition arra y do wnw ard by tw o grid in terv als b y setting the source and receiv er depths to z = L z − 2 dx instead of z = L z , while k eeping their horizon tal co ordinates ﬁxed. This mo diﬁcation c hanges how wa ves propagate from the sources to the receiv ers and how the subsurface structures are illuminated. 3. Shot17 : relativ e to the baseline conﬁguration with 10 sources, we increase the n umber 28 T able 4: Quan titative robustness to observ ation noise across the six datasets. Relative ℓ 2 -error ( e ℓ 2 ; lo wer is b etter), PSNR, and SSIM (higher is b etter) of the reconstructed velocity ﬁelds under diﬀeren t noise lev els σ . Curv e V el-A Curv e V el-B FlatF ault-A σ e ℓ 2 ↓ PSNR ↑ SSIM ↑ e ℓ 2 ↓ PSNR ↑ SSIM ↑ e ℓ 2 ↓ PSNR ↑ SSIM ↑ 0 3.29% 31.96 0.9111 1.95% 35.83 0.9584 1.60% 37.94 0.9820 10 − 4 3.29% 31.96 0.9113 1.94% 35.90 0.9594 1.84% 36.74 0.9691 10 − 3 3.39% 31.70 0.9075 2.23% 34.69 0.9451 1.95% 36.20 0.9739 10 − 2 3.62% 31.12 0.9121 2.07% 35.34 0.9557 1.97% 36.15 0.9747 5 × 10 − 2 3.50% 31.42 0.9039 2.04% 35.45 0.9517 2.91% 32.74 0.9646 FlatF ault-B Curv eF ault-A Curv eF ault-B σ e ℓ 2 ↓ PSNR ↑ SSIM ↑ e ℓ 2 ↓ PSNR ↑ SSIM ↑ e ℓ 2 ↓ PSNR ↑ SSIM ↑ 0 6.07% 28.56 0.9242 2.41% 35.01 0.9602 5.67% 27.39 0.7920 10 − 4 6.03% 28.61 0.9240 2.22% 35.75 0.9606 6.03% 26.86 0.7779 10 − 3 6.10% 28.51 0.9225 2.45% 34.90 0.9552 6.34% 26.43 0.7407 10 − 2 6.15% 28.45 0.9214 2.26% 35.56 0.9609 6.65% 26.01 0.7304 5 × 10 − 2 6.13% 28.48 0.9272 2.41% 35.03 0.9540 5.32% 27.95 0.7956 of sources to 17 and distribute them uniformly along the surface, while keeping the receiv er lay out unchanged. This p erturbation yields denser illumination and increased data redundancy for the in version. 4. AllChange : we apply Freq10 , Depth2 , and Shot17 sim ultaneously , thereby introduc- ing a comp ound p erturbation of the forw ard op erator that mo diﬁes the source p eak frequency , the source and receiv er depths, and the num b er of sources. F or a fair comparison across the diﬀerent op erator settings, the netw ork arc hitecture, diﬀusion sc hedule, and all remaining hyperparameters are k ept identical to those used in the preceding exp erimen ts. F or each op erator conﬁguration, only the guidance-scale parameter ρ 0 is tuned, following the same proto col, so as to ac hieve the b est p erformance attainable for our metho d; the selected v alues are listed in T able 5. The corresp onding quan titativ e metrics and qualitativ e reconstructions are rep orted in T able 5 and Figure 8, resp ectiv ely .› It can b e observed that, across all forw ard-op erator conﬁgurations, our metho d consistently yields high-qualit y reconstructions, with SSIM exceeding 0 . 9 and relative errors below 6 . 2% (T able 5). These results indicate that, once the score-based prior is trained on v elo cit y mo dels, it can b e coupled with mo diﬁed acquisition/ph ysics settings (e.g., changes in source frequency , source/receiv er depth, or the n umber of shots) without retraining the prior. This op erator-agnostic prop ert y substan tially broadens the applicabilit y of the prop osed in version framew ork. 29 1500 2100 2700 3300 3900 4500 v elo cit y (m/s) -3000 -1800 -600 600 1800 3000 v elo cit y (m/s) Reconstruction Diﬀerence Truth Init Shot17 Depth2 Freq10 AllChange Figure 8: Inv ersion results under diﬀeren t forw ard-op erator conﬁgurations (FlatF ault-B). F rom left to righ t: Init (baseline setup), Shot17 (17 uniformly distributed sources), Depth2 (sources and receivers shifted do wnw ard b y 2 dx ), Freq10 (Ric ker wa velet p eak frequency f p = 10 , Hz ), and AllChange (all modiﬁcations applied). The reconstructions remain visually consistent across op erator c hanges, indicating that the prop osed metho d generalizes w ell to p erturb ed forward op erators without retraining the prior. T able 5: Reconstruction p erformance on FlatF ault-B under diﬀerent forward-operator conﬁgurations. Rep orted are the tuned v alues of ρ 0 , together with the relativ e ℓ 2 -error ( e ℓ 2 ; lo wer is b etter), PSNR, and SSIM (higher is better). Op erator ρ 0 e ℓ 2 ↓ PSNR ↑ SSIM ↑ Init 1.45 6.13% 28.48 0.9272 Shot17 2.25 5.24% 29.84 0.9399 Depth2 1.60 5.37% 29.62 0.9343 Freq10 1.00 5.85% 28.87 0.9160 AllChange 1.30 5.82% 28.92 0.9168 6.7. A uniﬁe d diﬀusion mo del for heter o gene ous velo city ﬁelds 6.7.1. Performanc e of the mixe d mo del on Op enFWI In the preceding exp eriments, a separate diﬀusion mo del w as trained for each family of v elo cit y ﬁelds. Although this dataset-sp eciﬁc strategy is eﬀective when the structural class is kno wn a priori, it is of limited practical v alue for inv erse problems, where the subsurface structure is t ypically unknown. T o address this issue, w e next consider a mixe d mo del trained join tly on a heterogeneous dataset assem bled from multiple structural classes. The resulting uniﬁed mo del is designed to learn a broader prior and thus av oids the need to select a class-sp eciﬁc model at the inference stage. 30 1500 2100 2700 3300 3900 4500 v elo cit y (m/s) -3000 -1800 -600 600 1800 3000 v elo cit y (m/s) Reconstruction Diﬀerence CurveVel-A FlatFault-A CurveFault-B T ruth mixed mo del separate mo del mixed mo del separate mo del Figure 9: Comparison of inv ersion results pro duced by the mixed and separate mo dels for Curve V el-A, FlatF ault-A, and CurveF ault-B. The ﬁrst row shows the true v elo cit y models v true . The second and third ro ws displa y the reconstructed v elo cit y mo dels v rec obtained by the mixed mo del and the separate model, resp ectiv ely . The fourth and ﬁfth rows sho w the reconstruction errors, v rec − v true , corresponding to the mixed and separate models, resp ectiv ely . Sp eciﬁcally , we randomly selected 7,500 samples from eac h of Curv e V el-A, Curv e V el-B, FlatF ault-A, FlatF ault-B, Curv eF ault-A, and CurveF ault-B, and merged them in to a single training set. The mixed mo del was then ev aluated on several test datasets, among whic h Curv e V el-A, FlatF ault-A, and CurveF ault-B are rep orted here as represen tative examples. W e adopted the same op erator setting as in Section 6.3.1 and carefully tuned the remaining h yp erparameters. More precisely , w e set k = 100 , c = 0 . 1 , τ = 0 , ε = 10 − 4 , and γ = 0 . 55 , and chose ρ 0 = 0 . 95 , 0 . 85 , and 1 . 4 for Curv e V el-A, FlatF ault-A, and CurveF ault-B, resp ectiv ely . F or eac h dataset, the p erformance of the mixed mo del is compared with that of the corresp onding separate mo del trained exclusiv ely on the same dataset. The quan titativ e and qualitativ e results are rep orted in T able 6 and Figure 9, resp ectiv ely . 31 It can b e observed that, when the score-based neural netw ork is trained on the mixed dataset rather than on a single dataset, the inv ersion p erformance degrades sligh tly across all three represen tativ e examples. Nevertheless, the mixed mo del still yields satisfactory reconstructions and remains clearly comp etitiv e in practice. These results indicate that, when strong prior knowledge of the target velocity ﬁeld is a v ailable, a dataset-sp eciﬁc mo del trained on structurally homogeneous samples is generally preferable. By contrast, when prior information is limited or the structural type is unkno wn, the mixed model pro vides a practical and ﬂexible alternativ e. A plausible reason for the p erformance gap is that the heterogeneous training distribution mak es it more c hallenging for a single score net work to capture the full range of structural patterns with equal ﬁdelit y . As a consequence, the learned prior ma y b ecome less accurate for individual classes than that obtained from dataset-sp eciﬁc training. Moreov er, o wing to computational constraints, the mixed mo del w as not trained on the full collection of a v ailable samples, whic h may also contribute to the observ ed p erformance gap. T able 6: Quantitativ e comparison of in version results on Op enFWI velocity mo dels using the separate and mixed mo dels. Metrics include the relative ℓ 2 error e ℓ 2 , PSNR, and SSIM for the reconstructed velocity mo del (arrows indicate the preferred direction). Dataset e ℓ 2 ↓ PSNR ↑ SSIM ↑ Curv e V el-A (separate mo del) 3.50% 31.42 0.9039 Curv e V el-A (mixed mo del) 3.87% 30.54 0.8906 FlatF ault-A (separate model) 2.91% 32.74 0.9646 FlatF ault-A (mixed model) 3.28% 31.72 0.9179 Curv eF ault-B (separate mo del) 5.32% 27.95 0.7956 Curv eF ault-B (mixed mo del) 5.91% 27.04 0.7079 6.7.2. Performanc e of the mixe d mo del on the Marmousi2 mo del An imp ortan t practical adv antage of the mixed mo del is its ability to provide a uniﬁed prior for structures not explicitly represen ted by any single training class. T o inv estigate this generalization capability , w e applied the mixed mo del to the inv ersion of the Marmousi2 dataset [ 49 ]. W e emphasize that the training data contained no samples from Marmousi2. In the numerical exp erimen ts, lo cal v elo city models of diﬀeren t sizes w ere cropp ed from Marmousi2 and normalized to form dimensionless 71 × 71 data matrices. F or all lo cal velocity mo dels, w e ﬁxed k = 100 , c = 0 . 1 , τ = 0 , ε = 10 − 4 , and γ = 0 . 55 , while ρ 0 w as chosen separately for eac h example. W e considered four lo cal velocity mo dels cropp ed from Marmousi2 at diﬀeren t physical scales. Their ph ysical sizes, grid spacings, and the corresp onding v alues of ρ 0 are summarized in T able 7. F or all cases, w e used exactly the same trained mixed mo del without an y ﬁne- tuning or distillation. W e follow ed the same exp erimen tal setup as in the depth2 exp erimen t in the multi-operation generalization section. The quantitativ e metrics and qualitativ e reconstructions are rep orted in T able 8 and Figure 10, resp ectiv ely . 32 1500 2100 2700 3300 3900 4500 v elo cit y (m/s) -3000 -1800 -600 600 1800 3000 v elo cit y (m/s) T ruth Reconstruction Diﬀerence Marmousi2-P1 Marmousi2-P2 Marmousi2-P3 Marmousi2-P4 Figure 10: In version results for the four local v elo city mo dels extracted from Marmousi2, denoted by Marmousi2-P1–P4. The ﬁrst ro w shows the true v elo cit y models v true . The second ro w presents the reconstructed velocity mo dels v rec obtained by the prop osed metho d with the mixed mo del. The third row sho ws the corresponding reconstruction errors v rec − v true . T able 7: Exp erimental settings for the four lo cal v elo cit y models extracted from Marmousi2. Lo cal mo del Ph ysical size dx (m) dz (m) ρ 0 Marmousi2-P1 700 m × 175 m 10 2.5 3.9 Marmousi2-P2 700 m × 263 m 10 3.75 17.9 Marmousi2-P3 1000 m × 175 m 100 7 2.5 6.1 Marmousi2-P4 1500 m × 175 m 150 7 2.5 5.0 T able 8: Quan titative comparison of in version results obtained b y the mixed mo del for the four local velocity mo dels extracted from Marmousi2. Metrics include the relativ e ℓ 2 error e ℓ 2 , PSNR, and SSIM for the reconstructed v elo city mo del (arrows indicate the preferred direction). Lo cal mo del e ℓ 2 ↓ PSNR ↑ SSIM ↑ Marmousi2-P1 ( 700 m × 175 m ) 1.80% 35.95 0.9206 Marmousi2-P2 ( 700 m × 263 m ) 3.57% 30.49 0.7553 Marmousi2-P3 ( 1000 m × 175 m ) 2.81% 33.21 0.8975 Marmousi2-P4 ( 1500 m × 175 m ) 5.31% 26.96 0.8536 The lo cal v elo cit y mo dels extracted from Marmousi2 diﬀer substantially from the Op enFWI datasets in structural features. Despite this mismatc h, the prop osed metho d using the mixed mo del still pro duces satisfactory reconstructions for lo cal velocity mo dels of diﬀeren t sizes. 33 This is supp orted both qualitatively by Figure 10, where the reconstructed velocity mo dels preserv e the main subsurface structures, and quan titativ ely by the relative ℓ 2 error, PSNR, and SSIM rep orted in T able 8, whic h indicate go od agreemen t b et ween the reconstructed and true velocity mo dels. Since no Marmousi2-sp eciﬁc samples w ere included in the training data, these results pro vide evidence of the cross-dataset generalization capabilit y of the prop osed approac h. In this sense, the metho d oﬀers a promising data-driv en framework for mitigating the diﬃcult y arising from limited training data in full wa veform in v ersion. 7. Conclusion In this paper, w e prop osed a ph ysics-guided diﬀusion framew ork for seismic v elo cit y reconstruction in full w av eform inv ersion. The metho d combines an optimal-transp ort (OT) data-consistency p oten tial with a stabilized v ariable-metric guidance rule in the rev erse diﬀusion pro cess. On the data-consistency side, the OT p otential incorp orates b ounded amplitude-adaptiv e reweigh ting together with a one-dimensional W asserstein discrepancy ev aluated through quan tile functions. This construction reduces the dominance of large- amplitude early arriv als and alleviates the sensitivit y of p oint wise ℓ 2 ob jectives to time and phase misalignment. On the guidance side, the scalar DPS correction is replaced by a diagonal preconditioner P i = ρ i D i , where ρ i pro vides a step-dep endent scalar schedule and D i supplies spatially adaptiv e scaling. This yields a more balanced and stable guidance mechanism for the heterogeneous sensitivities arising in FWI. The n umerical results on the Op enFWI b enc hmarks show that, under matc hed computa- tional budgets, the prop osed metho d improv es reconstruction quality relativ e to b oth th e standard DPS approac h and the deterministic baselines. In particular, the reconstructions pro duced b y the prop osed metho d exhibit lo wer relative ℓ 2 errors and few er visible artifacts across the tested datasets. Sev eral directions remain for future work. First, the computational cost of physics-based guidance is still substantial, and it would b e v aluable to develop more eﬃcien t implemen tations and acceleration strategies, for example through m ulti-ﬁdelit y surrogate strategies. Second, an imp ortan t extension is to consider more realistic forw ard models, including elastic, anisotropic, and atten uative media, as well as three-dimensional acquisition geometries. A c knowledgmen ts Zheng Ma is supp orted b y NSFC Grant No. 12531016 and Beijing Institute of Applied Ph ysics and Computational Mathematics funding HX02023-6. Xiong-Bin Y an is supported by NSF C Grant No. 12401563. A dditionally , we also thank Shanghai Institute for Mathematics and In terdisciplinary Sciences (SIMIS) for their ﬁnancial supp ort. This researc h was funded b y SIMIS under gran t num b er SIMIS-ID-2025-ST. The authors are grateful for the resources and facilities pro vided by SIMIS, which w ere essen tial for the completion of this work. 34 References [1] A. J. Berkhout, Pushing the limits of seismic imaging, Part I I: In tegration of prestac k migration, v elo cit y estimation, and A VO analysis, Geoph ysics 62 (3) (1997) 954–969. [2] C. Zelt, R. Smith, Seismic trav eltime inv ersion for 2-D crustal velocity structure, Geo- ph ysical Journal International 108 (1) (1992) 16–34. [3] F. Clemen t, G. Cha ven t, S. Gómez, Migration-based tra veltime wa veform inv ersion of 2-D simple structures: A synthetic example, Geoph ysics 66 (3) (2001) 845–860. [4] J. Hudson, J. Heritage, The use of the Born appro ximation in seismic scattering problems, Geoph ysical Journal International 66 (1) (1981) 221–240. [5] K. Muh um uza, M. Jakobsen, T. Luostari, T. Lähiv aara, Seismic monitoring of CO2 injection using a distorted Born T-matrix approach in acoustic approximation, J. Seism. Explor 27 (2018) 403–431. [6] A. T aran tola, Inv ersion of seismic reﬂection data in the acoustic approximation, Geo- ph ysics 49 (8) (1984) 1259–1266. [7] M. W arner, A. Ratcliﬀe, T. Nango o, J. Morgan, A. Umpleby , N. Shah, V. Vinje, I. Štekl, L. Guasc h, C. Win, Anisotropic 3D full-wa v eform in version, Geophysics 78 (2) (2013) R59–R80. [8] M. Jak obsen, B. Ursin, F ull wa veform in v ersion in the frequency domain using direct iterativ e T-matrix metho ds, Journal of Geoph ysics and Engineering 12 (3) (2015) 400– 418. [9] J. Virieux, S. Op erto, An ov erview of full-wa v eform inv ersion in exploration geoph ysics, Geoph ysics 74 (6) (2009) WCC1–W CC26. [10] A. Fich tner, F ull seismic w av eform mo delling and in version, Springer Science & Business Media, 2010. [11] G. Y ao, N. V. da Silv a, M. W arner, D. W u, C. Y ang, T ac kling cycle skipping in full- w av eform inv ersion with intermediate data, Geoph ysics 84 (3) (2019) R411–R427. [12] Y. Y ang, B. Engquist, J. Sun, B. F. Hamfeldt, Application of optimal transp ort and the quadratic Wasserstein metric to full-wa veform inv ersion, Geoph ysics 83 (1) (2018) R43–R62. [13] R. G. Pratt, Seismic w a veform inv ersion in the frequency domain, part 1: Theory and v eriﬁcation in a physical scale mo del, Geophysics 64 (3) (1999) 888–901. [14] C. Bunks, F. M. Salec k, S. Zaleski, G. Chav en t, Multiscale seismic wa veform in v ersion, Geoph ysics 60 (5) (1995) 1457–1473. [15] O. Gauthier, J. Virieux, A. T arantola, T wo-dimensional nonlinear inv ersion of seismic w av eforms: Numerical results, geoph ysics 51 (7) (1986) 1387–1403. 35 [16] L. Métivier, R. Brossier, Q. Mérigot, E. Oudet, J. Virieux, Measuring the misﬁt b et ween seismograms using an optimal transp ort distance: Application to full w av eform in version, Geoph ysical Supplements to the Monthly Notices of the Ro y al Astronomical So ciety 205 (1) (2016) 345–377. [17] Y. Liu, J. T eng, T. Xu, Y. W ang, Q. Liu, J. Badal, Robust time-domain full w a veform in version with normalized zero-lag cross-correlation ob jective function, Geoph ysical Journal In ternational 209 (1) (2017) 106–122. [18] E. Bozdağ, J. T ramp ert, J. T romp, Misﬁt functions for full wa v eform in version based on instan taneous phase and env elop e measuremen ts, Geophysical Journal International 185 (2) (2011) 845–870. [19] B. Chi, L. Dong, Y. Liu, F ull w av eform inv ersion metho d using env elop e ob jectiv e function without lo w frequency data, Journal of Applied Geophysics 109 (2014) 36–46. [20] M. W arner, L. Guasch, Adaptiv e w av eform inv ersion: Theory , Geoph ysics 81 (6) (2016) R429–R445. [21] L. Qiu, J. Ramos-Martínez, A. V alenciano, Y. Y ang, B. Engquist, F ull-wa v eform in version with an exp onen tially enco ded optimal-transport norm, in: SEG tec hnical program expanded abstracts 2017, 2017, pp. 1286–1290. [22] Z. Li, Y. T ang, J. Chen, H. W u, The quadratic Wasserstein metric with squaring scaling for seismic v elo cit y inv ersion, arXiv preprint arXiv:2201.11305 (2022). [23] Y. Gao, F. Tilmann, A. Rietbrock, A review of misﬁt functions for adjoint full w av eform in version in seismology , Geophysical Journal In ternational 235 (3) (2023) 2794–2827. [24] L. Mosser, O. Dubrule, M. J. Blun t, Sto chastic seismic w av eform in version using genera- tiv e adversarial netw orks as a geological prior, Mathematical Geosciences 52 (1) (2020) 53–79. [25] J. Lopez-Alvis, F. Nguyen, M. Lo oms, T. Hermans, Geoph ysical in version using a v ariational auto enco der to mo del an assembled spatial prior uncertaint y , Journal of Geoph ysical Research: Solid Earth 127 (3) (2022) e2021JB022581. [26] C. Sun, A. Malcolm, R. Kumar, W. Mao, Enabling uncertain ty quantiﬁcation in a standard full-w a veform inv ersion metho d using normalizing ﬂo ws, Geophysics 89 (5) (2024) R493–R507. [27] J. Ho, A. Jain, P . Abb eel, Denoising diﬀusion probabilistic mo dels, A dv ances in neural information pro cessing systems 33 (2020) 6840–6851. [28] Y. Song, J. Sohl-Dic kstein, D. P . Kingma, A. Kumar, S. Ermon, B. P o ole, Score- based generative mo deling through sto chastic diﬀeren tial equations, arXiv preprin t arXiv:2011.13456 (2020). 36 [29] Y. Song, L. Shen, L. Xing, S. Ermon, Solving inv erse problems in medical imaging with score-based generativ e mo dels, arXiv preprin t arXiv:2111.08005 (2021). [30] H. Ch ung, J. Kim, M. T. Mccann, M. L. Klasky , J. C. Y e, Diﬀusion p osterior sampling for general noisy in verse problems, arXiv preprint arXiv:2209.14687 (2022). [31] Y. Li, H. Zhang, Z. Y an, T. Alkhalifah, Diﬀusionin v: Prior-enhanced bay esian full w av eform inv ersion using diﬀusion mo dels, arXiv preprin t arXiv:2505.03138 (2025). [32] F. W ang, X. Huang, T. Alkhalifah, Controllable seismic v elo cit y syn thesis using generativ e diﬀusion mo dels, Journal of Geophysical Research: Machine Learning and Computation 1 (3) (2024) e2024JH000153. [33] W. S. Phillips, M. C. F ehler, T rav eltime tomography; a comparison of p opular metho ds, Geoph ysics 56 (10) (1991) 1639–1649. [34] J. Chen, G. Chen, M. Nagaso, P . T ong, A djoint-state trav eltime tomograph y for az- im uthally anisotropic media in spherical co ordinates, Geophysical Journal In ternational 234 (1) (2023) 712–736. [35] S. Hao, J. Chen, M. Xu, P . T ong, T op ography-incorporated adjoint-state surface wa ve tra veltime tomography: Metho d and a case study in ha waii, Journal of Geoph ysical Researc h: Solid Earth 129 (1) (2024) e2023JB027454. [36] A. M. Stuart, In v erse problems: a bay esian persp ective, A cta numerica 19 (2010) 451–559. [37] M. Dashti, A. M. Stuart, The ba yesian approac h to inv erse problems, arXiv preprint arXiv:1302.6989 (2013). [38] L. Y ang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, M.-H. Y ang, Diﬀusion models: A comprehensiv e surv ey of metho ds and applications, A CM computing surv eys 56 (4) (2023) 1–39. [39] Y. Song, J. Sohl-Dic kstein, D. P . Kingma, A. Kumar, S. Ermon, B. P o ole, Score- based generativ e mo deling through sto c hastic diﬀerential equations, in: International Conference on Learning Represen tations, 2021. [40] B. Øksendal, Sto c hastic diﬀeren tial equations, in: Sto chastic diﬀerential equations: an in tro duction with applications, Springer, 2003, pp. 38–50. [41] B. D. Anderson, Reverse-time diﬀusion equation mo dels, Sto c hastic Pro cesses and their Applications 12 (3) (1982) 313–326. [42] P . Vincen t, A connection b et ween score matching and denoising auto enco ders, Neural computation 23 (7) (2011) 1661–1674. [43] C. Deng, S. F eng, H. W ang, X. Zhang, P . Jin, Y. F eng, Q. Zeng, Y. Chen, Y. Lin, Op enfwi: Large-scale m ulti-structural b enc hmark datasets for full w av eform inv ersion, A dv ances in Neural Information Pro cessing Systems 35 (2022) 6007–6020. 37 [44] L. I. Rudin, S. Osher, E. F atemi, Nonlinear total v ariation based noise remov al algorithms, Ph ysica D: nonlinear phenomena 60 (1-4) (1992) 259–268. [45] Y. Nesterov, In tro ductory lectures on con v ex optimization: A basic course, V ol. 87, Springer Science & Business Media, 2013. [46] O. Ronneb erger, P . Fisc her, T. Brox, U-net: Conv olutional netw orks for biomedical image segmen tation, in: International Conference on Medical image computing and computer-assisted in terven tion, Springer, 2015, pp. 234–241. [47] B. Engquist, Y. Y ang, Optimal transp ort based seismic inv ersion: Bey ond cycle skipping, Comm unications on Pure and Applied Mathematics 75 (10) (2022) 2201–2244. [48] A. Ric hardson, Deepw av e (Oct. 2025). doi:10.5281/zenodo.3829886 . URL https://doi.org/10.5281/zenodo.3829886 [49] G. S. Martin, R. Wiley , K. J. Marfurt, Marmousi2: An elastic upgrade for marmousi, The leading edge 25 (2) (2006) 156–166. 38

Robust Physics-Guided Diffusion for Full-Waveform Inversion

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment