Towards Infinitely Long Neural Simulations: Self-Refining Neural Surrogate Models for Dynamical Systems

T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surr ogate Models f or Dynamical Systems Qi Liu 1 Laure Zanna 1 Joan Bruna 1 Abstract Recent advances in autoregressiv e neural surro- gate models hav e enabled orders-of-magnitude speedups in simulating dynamical systems. Ho w- e ver , autore gressi ve models are generally prone to distribution drift : compounding errors in autore- gressiv e rollouts that se verely degrade generation quality over long time horizons. Existing work attempts to address this issue by implicitly le ver- aging the inherent trade-off between short-time accuracy and long-time consistency through hy- perparameter tuning. In this work, we introduce a unifying mathematical framework that makes this tradeoff explicit, formalizing and generaliz- ing hyperparameter -based strategies in e xisting approaches. Within this frame work, we propose a robust, h yperparameter-free model implemented as a conditional diffusion model that balances short-time ﬁdelity with long-time consistency by construction. Our model, Self-r eﬁning Neur al Sur - r ogate model ( SNS ), can be implemented as a stan- dalone model that reﬁnes its own autore gressive outputs or as a complementary model to existing neural surrogates to ensure long-time consistency . W e also demonstrate the numerical feasibility of SNS through high-ﬁdelity simulations of comple x dynamical systems ov er arbitrarily long time hori- zons. 1. Introduction Understanding and simulating physical systems is a founda- tional aspect of many scientiﬁc and engineering ﬁelds, in- cluding earth system modeling, neural science, and robotics ( Lorenz , 1963 ; Chariker et al. , 2016 ; Ijspeert et al. , 2013 ). Dynamical systems are the major mathematical tools used to model these systems, which describe ho w the states of the system e volve over time through partial dif ferential equa- 1 Courant Institute School of Mathematics, Computing, and Data Science, New Y ork University , USA. Correspondence to: Qi Liu < ql2221@nyu.edu > . Pr eprint. Marc h 19, 2026. tions (PDEs). Traditionally , these PDEs are solved numeri- cally , which requires ﬁne spatial discretizations to capture the local dynamics accurately . Howe ver , ﬁne spatial dis- cretization imposes constraints on the temporal discretiza- tion as well, such as the Courant-Friedrichs-Le wy ( CFL ) condition for numerical stability , leading to unfeasible com- putational costs ( Lewy et al. , 1928 ). Deep learning techniques have been incorporated in various ways to address this challenge. In particular , autoregressi ve neural surrogate models ha ve been deplo yed to replace nu- merical schemes entirely ( Li et al. , 2020 ; Stachenfeld et al. , 2022 ; K ochkov et al. , 2024 ). Autoregressi ve neural surro- gates, compared to their numerical counterparts, can achie ve orders of magnitude computational speedups by reducing the number of function e valuations and le veraging paral- lel processing on GPUs ( Kurth et al. , 2023 ). These neural surrogate models are being applied to weather forecasting ( Pathak et al. , 2022 ; Bi et al. , 2023 ) and climate modeling ( K ochkov et al. , 2024 ; Subel & Zanna , 2024 ; Dheeshjith et al. , 2025 ). Autoregressi ve models take their o wn outputs as inputs. Despite the tremendous success of autoregressi ve neural surrogate models in achie ving short-term accuracy , the per- formance tends to degenerate o ver longer time horizons ( Chattopadhyay et al. , 2023 ; Bonavita , 2024 ; P arthipan et al. , 2024 ; Bach et al. , 2024 ). Autoregressi ve models are prone to distribution drift : they are trained on data from a dis- tribution, but there is no guarantee that their outputs will remain within it during inference. This motiv ates decom- posing the risk of an autoregressi ve model into conditional appr oximation error and out-of-distribution ( OOD ) error . Many existing ef forts to enhance long rollouts’ performance implicitly balance a tradeof f between these two errors. How- ev er , most prior work relies on heuristic strategies that yield strong empirical results b ut of fer limited theoretical justiﬁ- cation. In this work, we will provide a general mathematical frame- work, based on multi-noise-le vel denoising oracle , that ex- plicitly accounts for the trade-off between conditional ap- pr oximation err or and OOD err or . Estimators of the multi- noise-level denoising or acle create a ﬂexible design space for seeking an optimal balance between the two errors. W e 1 T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surr ogate Models for Dynamical Systems F igure 1. T op 5 r ows: V orticity ﬁelds of the Kolmogorov ﬂo w from a numerical solver , Gaussian approximation of the transition density , A CDM with 200 denoising steps, Thermalizer and SNS reﬁned trajectories for Kolmogoro v ﬂo w ov er a trajectory of 15 , 000 steps. Bottom r ow: kinetic energy spectra at each timestep, a veraged over 20 randomly initialized trajectories. 2 T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surrogate Models f or Dynamical Systems propose our model, Self-reﬁning Neural Surro gate model ( SNS ), which seeks to achiev e the optimal balance between the errors within this frame work. During inference, SNS dynamically determines the noise level for the conditioning and the denoising strategies. W e implement SNS as a con- ditional diffusion model ( Sohl-Dickstein et al. , 2015 ; Song et al. , 2020 ) trained on a denoising diffusion probabilistic model (DDPM) style denoising score matching objectiv e ( Ho et al. , 2020 ). SNS can be used as a standalone model that reﬁnes its own output or as a complementary model alongside existing neural surrogates to enhance long-term performance. W e demonstrate the numerical feasibility of SNS through high ﬁdelity simulations over a long time hori- zon. Main contributions W e propose the multi-noise-level denoising oracle with an explicit representation of the approximation- OOD tradeof f to transform a heuristic goal into a concrete, learnable ob- jecti ve. W e present our self-reﬁning neural surr ogate model that seeks to dynamically balance the tradeoff within this framew ork. 2. Related W ork Diffusion models as surrogates Conditional diffusion models ha ve been utilized by GenCast ( K ohl et al. , 2024 ) and autoregressiv e conditional diffusion model ( A CDM ) ( Price et al. , 2023 ) for probabilistic sim- ulations of dynamical systems. ACDM follows the Con- ditional Diffusive Estimator (CDif fE) frame work ( Batzolis et al. , 2021 ), in which the conditioning variables and tar get outputs are jointly diffused. In contrast, GenCast conditions on clean and unperturbed inputs, following prior work on conditional diffusion models ( Song & Ermon , 2019 ; Karras et al. , 2022 ). Our approach, SNS , is also implemented as a conditional diffusion model; ho wever , a ke y distinction is that both A CDM and GenCast condition their generation on previous autoregressiv e outputs under the assumption that these inputs remain in distribution. As a result, both models remain susceptible to distribution drift arising from the accumulation of out-of-distribution (OOD) errors during rollout. A well-known limitation of dif f usion-based models (includ- ing ours) is their long inference time, which arises from degradation in sample quality when the rev erse process is insufﬁciently discretized. Consequently , substantial ef fort has been dev oted to accelerate dif fusion inference, including consistency models ( Song et al. ) and progressiv e distillation techniques ( Salimans & Ho , 2022 ). T o the best of our kno wl- edge, these approaches hav e not yet been applied to autore- gressiv e conditional diffusion models for f aster simulation. Sev eral recent works propose alternati ve strategies to reduce computational cost. ( Shehata et al. , 2025 ) accelerates infer- ence by skipping portions, especially the end, of the reverse process via T weedie’ s formula. Dyffusion ( Cachay et al. , 2023 ) a voids solving the re verse stochastic differential equa- tion by replacing the forward and backward processes with a learned interpolator and forecaster . F or high-dimensional spaces, ( Gao et al. , 2024 ) reduces inference cost by per - forming diffusion in a learned lo w-dimensional latent space. PDE-Reﬁner ( Lippe et al. , 2023 ), one of the ﬁrst diffusion- based methods for PDE modeling, mitigates computational ov erhead by acting as a complementary reﬁnement mod- ule on top of a point-wise autoregressi ve predictor . Similar to PDE-Reﬁner , our model, SNS , can also work around the computational cost of dif fusion-based approaches by dynam- ically truncating the initial stages of the reverse process with a rough next state estimate from a point-wise autoregressi ve predictor . Mitigating Distribution Drift Recent work has focused on mitigating distrib ution drift in autoregressi ve models by exploiting properties of the sta- tionary distribution ( Jiang et al. , 2023 ; Schiff et al. , 2024 ; Pedersen et al. , 2025 ). In particular , the Thermalizer frame- work ( Pedersen et al. , 2025 ), which dynamically maps out- of-distribution samples back to the stationary distribution through a dif fusion model, is conceptually similar to our model. Howe ver , Thermalizer ’ s objecti ve is solely to ensure long-time stability while we consider the optimal tradeoff one can make to ensure stability while respecting temporal dynamics. Prior work by ( Stachenfeld et al. , 2022 ) demon- strated that employing a non-de generate Gaussian approx- imation to the transition density improv es long-time con- sistency . Moreover , unrolling this Gaussian approximation and minimizing mean squared error (MSE) over multiple time steps has been shown to further enhance long-horizon performance ( Lusch et al. , 2018 ; Um et al. , 2020 ; Vlachas et al. , 2020 ). Conditioning the model on a lar ger temporal context has also proven effecti ve in reducing distribution drift ( Nathaniel & Gentine , 2026 ; Zhang et al. , 2025 ). While our approach offers an alternati ve mechanism for extending the consistency horizon of simulations, our primary objec- tiv e is to provide theoretical insight into the conditions and mechanisms that enable truly inﬁnite-length generation. 3. Problem Setup Background Consider a dynamical system in R n with an unknown for - ward operator F : R n → R n and initial condition x 0 , s.t. ∂ t x ( t ) = F ( x ( t ) , t ) , x (0) = x 0 . 3 T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surrogate Models f or Dynamical Systems For a typical time scale ∆ t , we denote { x t } t ∈ N to be the time-discretized snapshots of the system where x t = x ( t ∆ t ) . Given samples of such snapshots as training data, the goal is to generate dynamically consistent trajectories giv en new initial conditions. The operator F is, in general, nonlinear and exhibits chaotic dynamics, i.e., small perturbations in initial states evolv e to states that are considerably dif ferent ( Lorenz , 1963 ). T o account for the sensiti vity to uncertainty in initial condi- tions and numerical discretization errors, it is more appro- priate to adopt a probabilistic frame work of the system. W e can view { x t } t ∈ N as a discrete realization of a con- tinuous time Markovian stochastic process, { X τ } τ ∈ R , i.e. x t = X t ∆ τ ( ω ) . W e denote the path density of the process p t := p ( x 0 , . . . , x t ) which satisﬁes the recurrence relation: p t +1 = p ( x t +1 , ( t + 1)∆ τ | x t , t ∆ τ ) p t , where the transi- tion density p ( x t +1 , ( t + 1)∆ τ | x t , t ∆ τ ) is the conditional distribution of X ( t +1)∆ τ giv en X t ∆ τ = x t . Here we as- sume that the Markov transition kernel admits a density , i.e. Q ( x t , d ( x t +1 )) = p ( x t +1 , ( t + 1)∆ τ | x t , t ∆ τ ) d x t +1 . In addition, if the system is autonomous and ergodic, the transition density reduces to p ( x t +1 | x t ) and a unique lim- iting distribution e xists which coincides with the stationary distribution. W e denote this distrib ution as µ := µ ∞ = lim t →∞ µ t , where µ t is the marginal distrib ution of x t , i.e. µ t = Q t # ( x 0 , x t ) p ( x 0 ) . W e choose to focus on ergodic systems in this work, but the method can be extended to time-dependent systems as well. Because of the Marko vian properties, it sufﬁces to learn the one-step conditional distri- bution and generate new trajectories by sampling from the learned distribution, ˆ p , autoregressiv ely as follows: giv en x 0 , sample ˆ x t +1 ∼ ˆ p ( x t +1 | ˆ x t ) recursi vely with ˆ x 0 = x 0 . Modeling transition density via diffusion models The transition density has the property that p ( x t +1 | x t ) = δ ( x t +1 − x t ) as ∆ τ → 0 , i.e. the transition density is a Dirac mass at the current snapshot without time e volution. So in the asymptotics of ∆ τ ≪ 1 , it is reasonable to approximate the transition density with a “sharp” gaussian distrib ution: p ( x t +1 | x t ) = N ( x t +1 ; F ( x t ) , Σ(∆ τ )) where Σ(∆ τ ) has vanishing spectral radius as ∆ τ → 0 . At a ﬁxed time discretization ∆ τ , both the mean and the covariance matrix can be learned, and we denote the learned Gaussian kernel: ˆ p = ( x t +1 | x t ) = N ( x t +1 ; F θ ( x t ) , Σ θ ) . W e refer to this approach as the Gaussian approximation and provide more details about the learning objectiv e in B.2 . Conditional dif fusion models can also be used to generate samples from the transition density p ( x t | x t − 1 ) . Diffusion models ( Sohl-Dickstein et al. , 2015 ; Ho et al. , 2020 ; Song et al. , 2020 ), when ﬁrst proposed, enable transport between a source distribution p S , usually Gaussian, and an uncon- ditional target distribution p 0 by a forward process and a F igure 2. Phase space for the forward and r everse pr ocess : Mov- ing to the right and moving to the left corresponds to the forw ard process and the re verse process for x t − 1 . Moving up and do wn corresponds to the forward and the rev erse process for x t , respec- tiv ely . The dashed curved arrows represent a jump in the rev erse process. learned rev erse process. W e will use s for diffusion time as a superscript, which is independent of the physical time t . Giv en x 0 ∼ p 0 , the forward process: d x s = f ( x s , s ) ds + g ( s ) d w s , s ∈ [0 , S ] is a stochastic process that starts at the target distribution and ends at the source distrib ution, i.e. x 0 ∼ p 0 , x S ∼ p S . One could generate ne w samples from the tar get distrib ution by solving the SDE associated with the rev erse process backwards in time (from s = S to s = 0 ) with initial condition x S ∼ p 1 ( Anderson , 1982 ): d x s = [ f ( x s , s ) − g ( s ) 2 ∇ log p s ( x s )] ds + g ( s ) d ¯ w s where p s ( x s ) is the marginal density of x s . Solving the re- verse SDE requires estimating the score, s θ ≈ ∇ log p s ( x s ) , which can be done via various training objectiv es. In par- ticular , ( Ho et al. , 2020 ) proposed a denoising objecti ve. Deﬁning the denoising oracle as: D ( x s , s ) := E [ x | x s ] (1) T weedie’ s formula ( Robbins , 1992 ), relates the denois- ing oracle to the score explicitly via: ∇ log p s ( x s ) = − σ − 2 s ( x s − D ( x s , s )) , where σ 2 s = R s 0 g ( s ′ ) 2 ds ′ . Conditional dif fusion models enable sampling from a condi- tional distribution, p ( x | c ) . One could consider ev olving 4 T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surrogate Models f or Dynamical Systems the conditional rev erse process SDE: d x s = [ f ( x s , s ) − g ( s ) 2 ∇ log p s ( x s | c )] ds + g ( s ) d ¯ w s to generate samples from p ( x | c ) . Multiple methods to estimate the conditional score ∇ log p s ( x s | c ) hav e been proposed, such as via Bayes’ s rule ( Dhariwal & Nichol , 2021 ), classiﬁer-free guidance ( Ho & Salimans , 2022 ), and conditional score matching. In particular, ( Batzolis et al. , 2021 ) considered the loss function, L ( θ ) = 1 2 E s ∼U (0 ,S ) E x 0 , x s ∼ p s ( x s | x 0 ) h λ ( t )   ∇ log p s ( x s | x 0 ) − s θ ( x s , c , s )   2 2 i . and showed that the minimizer of the abov e loss function coincides with the minimizer of the loss function with the score ∇ log p s ( x s | x 0 ) replaced by the conditional score: ∇ log p s ( x s | c ) . Further more, one could consider learning the score of the annealed conditional distribution ∇ log p s ( x s | c s ) , which coincides with the score of the joint distribution p s ( x s , c s ) , and use it as an approximation to ∇ log p s ( x s | c ) as we will discuss in Section 4.1 . Distribution drift in autor egressive models T o make the distinction between short-horizon prediction accuracy and long-horizon physical consistency precise, it is useful to consider a ﬁnite-dimensional Markov system. Let M denote the true transition matrix and ˆ M its learned approximation, with stationary distrib utions µ and ˆ µ , i.e. the Perron (left) eigenv ectors of M , and ˆ M , respectively . This naturally leads to two distinct notions of error . First, the conditional appr oximation err or measures how well one-step transitions are approximated under the true stationary distribution: E cond = X i µ i D ( M i ∥ ˆ M i ) , where M i denotes the i -th row of M and D is a div ergence such as KL or total variation. And µ i is the i-th component of µ . Second, the out-of-distribution (OOD) err or measures the discrepancy between the stationary distrib utions induced by the true and learned transition density , with the same or possibly different di ver gence, ˜ D : E uncond = ˜ D ( µ ∥ ˆ µ ) . While the conditional appr oximation err or captures one- step accurac y with in-distribution conditioning, the uncondi- tional prediction error gov erns long-horizon rollout behavior . Noting that the conditional appr oximation error is a metric on conditional distrib utions and the OOD error is a met- ric on marginal distrib utions, any candidate approximation ˆ M immediately faces an inherent tradeoff between accu- rately reconstructing short-time dynamics and maintaining long-time consistency . The minimizer of the conditional appr oximation err or , by deﬁnition, has to condition on an unperturbed conditioning since any corruption of the condi- tioning decreases the mutual information between the tw o vari ables (Appendix C ). Thus, conditional generation on the exact pre vious state enables accurate approximation of the one-step transition density p ( x t | x t − 1 ) . Howe ver , even small errors in the conditioning state accumulate during rollouts, leading to distribution drift . This mechanism un- derlies the linear growth of pathwise KL diver gence over time. ( Pedersen et al. , 2025 ) Conv ersely , weakening the dependence on the conditioning state impro ves rob ustness by biasing generation towards the marginal or stationary distribution, but at the cost of accuracy in modeling the temporal dynamics. This tradeof f can be formalized by interpolating between conditional and marginal distributions. Introducing noise into the conditioning variable yields a family of intermediate conditional distributions { p s 2 ( x t | x s 2 t − 1 ) } s 2 ∈ [0 ,S ] , which recov er the exact transition density in the limit s 2 → 0 and marginal density p ( x t ) as s 2 → S . Therefore, the choice of noise le vel directly controls the tradeof f between temporal ﬁdelity and long-term stability . Most existing approaches implicitly select a ﬁxed operating point along this tradeof f curve through h yperparameters or architectural choices. In contrast, our approach explicitly models the full phase space via a multi-noise-level denois- ing oracle , enabling adaptiv e selection of the appropriate balance during inference. 4. Methodology 4.1. Multi-noise-level denoising oracle Denoting y t := ( x t , x t − 1 ) , we deﬁne the multi-noise-level denoising oracle to be the conditional e xpectation: D ( x s 1 t , x s 2 t − 1 ) := E x t , x t − 1 ∼ p ( x t , x t − 1 ) [ y t | x s 1 t , x s 2 t − 1 ] (2) where p ( x t , x t − 1 ) is the joint distrib ution of ( x t , x t − 1 ) and x s 1 t , x s 2 t − 1 are independent realizations of the forward pro- cess: d x s = f ( x s , s ) ds + g ( s ) d w s , s ∈ [0 , s i ] i = 1 , 2 with initial conditions x 0 t = x t , x 0 t − 1 = x t − 1 respectiv ely . It is important to note that although the forward processes are independent, x s 1 t and x s 2 t 1 are not independent since the initial conditions are dra wn from their joint distribution. This is the fundamental mathematical object underlying our method that allo ws a tradeoff between conditional ap- pr oximation err or and OOD err or . The motiv ation for the 5 T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surrogate Models f or Dynamical Systems F igure 3. The x-axis corresponds to the diffusion time of the forward process for the conditional variable. The images belo w the x-axis are realizations of the forward process with the same initial condition x 0 t − 1 at different times s . The images in the top row are in direct correspondence with the images below via an estimator of the multi-noise-level denoising oracle ˆ x 0 t = D θ ( x S t , x s t − 1 ) . The y-axis in the L 2 distance between the denoised image and the original image. The degradation in denoising accuracy and the smoothing of ﬁelds with higher noise injection in the conditional variable align with the theory . multi-noise-level denoising oracle is analogous to the trade- off between conditional and mar ginal distribution discussed in Section 3 . In the limit of s 2 → S , the denoising oracle reduces to the unconditional denoising oracle as deﬁned in equation ( 1 ) . In the limit of s 2 → 0 , the denoising oracle can be interpreted as a conditional denoising oracle which cor - responds to the e xpected v alue of the transition density . The interpolation of these limits is exactly the tradeof f between appr oximation error and OOD err or . Figure 3 demonstrates the tradeoff through an estimator of the multi-noise-level denoising oracle . Giv en the multi-noise-le vel denoising or acle and initial condition ˆ x 0 = x 0 , one could use it directly for gen- erating trajectories by following the recurrence relation ( ˆ x t , ˆ x t − 1 ) = D ( z t , x 0 t − 1 ) where z t ∼ p 1 . Howe ver , this generates a trajectory based on point-wise estimates, and im- perfect approximations of the oracle are still subject to OOD err or as a degenerate v ersion of the Gaussian approximation of the transition density discussed in Section 3 . In order to enable probabilistic modeling, we deﬁne the multi-noise-level scor e , ∇ log p s ( y s t ) , to be: ( ∇ x t log p s ( y s t ) , ∇ x t − 1 log p s ( y s t )) where s = ( s 1 , s 2 ) , y s t = ( x s 1 t , x s 2 t − 1 ) and p s ( y s t ) is the marginal distrib ution of ( x s 1 t , x s 2 t − 1 ) in the forward process. Analogous to the univ ariate case, we can relate the multi- noise-level denoising oracle to the multi-noise-le vel score via: ∇ log p s ( y s t ) = − Σ − 1 s ( y t − D ( x s 1 t , x s 2 t − 1 )) where Σ s is a diagonal matrix with entries Σ s ii = R s i 0 g 2 ( s ′ ) ds ′ , i = 1 , 2 . Then, one could consider the cou- pled rev erse process SDEs: d x s 1 t = h f ( x s 1 t , s 1 ) − g ( s 1 ) 2 ∇ x t log p s ( y s t ) i ds 1 + g ( s 1 ) d ¯ w s 1 d x s 2 t − 1 = h f ( x s 2 t − 1 , s 2 ) − g ( s 2 ) 2 ∇ x t − 1 log p s ( y s t ) i ds 2 + g ( s 2 ) d ¯ w s 2 The SDEs are coupled through the multi-noise-le vel score , but it poses no constraints on the coupling of the dif fusion time s 1 and s 2 . The ev olution of the reverse process SDEs can be best understood as traversing the two-dimensional phase space of dif fusion times. Gi ven any point ( ˜ s 1 , ˜ s 2 ) ∈ R + × R + , one could start the re verse process SDEs at any ( x s 1 t , x s 2 t − 1 ) ∼ p s ( x s 1 t , x s 2 t − 1 ) and end at any point ( s ′ 1 , s ′ 2 ) , s.t. s ′ i ≤ ˜ s i , follo wing monotonically decreasing curv es in s 1 and s 2 . Many existing diffusion-based neural surrog ate models can be interpreted as different strategies of trav ersing the phase space. Figure 2 demonstrates the corresponding ev olution of the SDE under ACDM ( K ohl et al. , 2024 ), thermalizer ( Pedersen et al. , 2025 ), and GenCast (clean conditioning) ( Price et al. , 2023 ). Estimating the multi-noise-level scor e can be done by mini- mizing the denoising score matching objectiv e: L ( θ ) := E s 1 ,s 2 ∼U (0 ,S ) , y 0 t , y s ∼ p s ( y s | y 0 t ) ∥∇ log p s ( y s t | y 0 t ) − Φ θ ( y s t , s 1 , s 2 ) ∥ 2 (3) In Appendix C , we show that the minimizer of Equation ( 3 ) is the multi-noise-level scor e and it coincides with the min- imizer of a similar objecti ve where the multi-noise-level scor e is replaced with: ( ∇ x t log p s 1 ( x s 1 t | x s 2 t − 1 ) , ∇ x t − 1 log p s 2 ( x s 2 t − 1 | x s 1 t )) and the expectation is taken ov er ( x s 1 t , x s 2 t − 1 ) ∼ p s ( x s 1 t , x s 2 t − 1 ) . This provides us with another perspective on the re verse process. W e can consider ∇ x t log p s 1 ( x s 1 t | x s 2 t − 1 as an approximation to the score of the transition den- sity , ∇ p ( x t | x t − 1 ) , and independently ev olve d x s 1 t = h f ( x s 1 t , s 1 ) − g ( s 1 ) 2 ∇ x t log p s 1 ( x s 1 t | x s 2 t − 1 ) i ds 1 + g ( s 1 ) d ¯ w s 1 backward in time by drawing realizations of x s 2 t − 1 from the forward process. Analogous to the case for multi-noise-level 6 T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surrogate Models f or Dynamical Systems Algorithm 1 SNS for reﬁnement Require: Initial state x 0 , trained SNS Φ θ , trained surrogate model Ψ θ for t = 1 to T do x t = Ψ θ ( x t − 1 ) ˆ s i = Φ ( i ) ( x t , x t − 1 ) , i = 1 , 2 z 1 , z 2 ∼ N ( 0 , I ) x ˆ s 1 t = √ ¯ α ˆ s 1 x t + √ 1 − ¯ α ˆ s 1 z 1 x ˆ s 2 t − 1 = √ ¯ α ˆ s 2 x t − 1 + √ 1 − ¯ α ˆ s 2 z 2 x s t − 1 ← x ˆ s 2 t − 1 for s = ˆ s 2 to 0 do z ∼ N ( 0 , I ) x s − 1 t − 1 = 1 √ α s  x s t − 1 − 1 − α s √ 1 − ¯ α s Φ (4) θ ( x ˆ s 1 t , x s t − 1 )  + √ β s z end for x t − 1 ← x 0 t − 1 x s t ← x ˆ s 1 t for s = ˆ s 1 down to 0 do z ∼ N ( 0 , I ) x s − 1 t = 1 √ α s  x s t − 1 − α s √ 1 − ¯ α s Φ (3) θ ( x s t , x t − 1 )  + √ β s z end for x t ← x 0 t end for denoising oracle, this approximation of the score introduces a bias but provides a better guarantee in long-term perfor- mance. Once again conﬁrming the interpretation of s 2 as the variable controlling the tradeoff . This naturally raises the follo wing question: what is the optimal choice of start- ing point and traver sal strate gy in the r everse phase space ( s 1 , s 2 ) that best balances conditional appr oximation err or against out-of-distrib ution (OOD) err or? W e address this question by constructing a conditional diffusion model that estimates both the multi-noise-level scor e and the noise lev el all at once. 4.2. Self-reﬁning Neural Surrogate model Giv en an initial condition ˆ x 0 and an estimator of the multi- noise-level scor e , one may generate autore gressi ve snap- shots by follo wing any monotone path in dif fusion-time space that starts at ( S, · ) and terminates at (0 , · ) . The remain- ing degree of freedom is the e volution of the conditioning noise lev el s 2 . In the idealized setting, sampling from the true transition density p ( x t | x t − 1 ) corresponds to ﬁxing s 2 ≡ 0 throughout the rev erse process. Ho we ver , condition- ing on an unperturbed autoregressi ve history is precisely what leads to distribution drift in practice, as conﬁrmed by our numerical experiments and previous w ork ( K ohl et al. , 2024 ). T o mitigate this ef fect, we do not assume the conditioning input to remain in-distrib ution during rollouts. Instead, we dynamically determine an appropriate starting point in the ( s 1 , s 2 ) phase space and e volv e the rev erse process to ward the endpoint (0 , 0) . By le veraging the multi-noise-level scor e , the model progressiv ely reﬁnes both the current state and its conditioning, effecti vely transporting the autoregres- siv e input back toward the data manifold during genera- tion. This self-correcting mechanism moti v ates the name Self-r eﬁning Neural Surr ogate ( SNS ). The full algorithmic implementation is provided in the appendix; below , we give a high-lev el overvie w of the SNS framework. W e estimate the multi-noise-level scor e and the noise lev els with a neural network, Φ θ : R d × R d → R d × R d × R 2 S . Denoting the outputs Φ (3) θ ∈ R d × R d , ( Φ (1) θ , Φ (2) θ ) ∈ R 2 S , the loss function to minimize is: L ( θ ) := E {∥∇ log p s ( y s t | y 0 t ) − Φ (3) θ ( y s t ) ∥ 2 + λ X s 1 ,s 2 [ 1 s 1 log Φ (1) θ ( y s t ) + 1 s 2 log Φ (2) θ ( y s t )] } (4) where S is the number of discretization steps of the forward diffusion process and 1 s i are one-hot vectors encoding the true noise le vels and the e xpectation is taken over s 1 , s 2 ∼ U ( N [0 , S ]) , y s t , y 0 t ∼ p s ( y s t | y 0 t ) . W e ﬁnd that choosing the regularization weight, λ , to k eep the two terms on a comparable scale works well. Giv en the previous autoregressi ve output ˆ x t − 1 , SNS can be used directly for generation by ev olving the re verse pro- cess SDE from ( S, Φ (2) θ ( d t )) where d t = ( z t , ˆ x t − 1 ) , z t ∼ p S . Alternatively , one could also reﬁne the autore gres- siv e output by e volving the re verse coupled SDEs from ( Φ (1) θ ( ˆ y t ) , Φ (2) θ ( ˆ y t )) where ˆ y t = ( ˆ x t , ˆ x t − 1 ) . Note that for reﬁnement, the autoregressi ve output can be obtained by any other surrogate model. Figure 2 illustrates the phase-space trav ersal strategy implemented by SNS . Through numerical experiments, we found that denoising ﬁrst in the conditional variable and then denoising the current frame had the best numerical performance when compared to many other can- didate traversal paths. (Section B.3 ) W e do not claim this path is optimal; rather , we suspect this advantage reﬂects a numerical bias in estimating the multi-noise-level scor e , and that all monotonically decreasing paths should be equiv alent. W e provide a proof that shows the equiv alence of monotoni- cally decreasing paths in distrib ution in Theorem C.1 , and we leav e a more detailed inv estigation for future work. 5. Numerical Considerations W e present the pseudocode for using SNS as a complemen- tary model to provide long-term performance guarantee for fast surrogates targeting minimal conditional appr oxima- tion error (Algorithm 1 ). W e follo w DDPM-style train- 7 T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surrogate Models f or Dynamical Systems F igure 4. T op 6 r ows: T op and bottom layer vorticity ﬁelds of the two-layer QG system from a numerical solver , Gaussian approximation of the transition density , and SNS reﬁned trajectories for Kolmogorov ﬂo w over a trajectory of 20 , 000 steps. Bottom 2 rows: kinetic energy spectra of top and bottom layer at each timestep, a veraged over 5 randomly initialized trajectories, with min-max spread shaded. 8 T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surrogate Models f or Dynamical Systems F igure 5. For a trajectory x t ∈ R d , we report two temporal consistency metrics. The spatio-temporal correlation at lag τ is computed as C ( τ ) =  ⟨ x t − ¯ x t , x t + τ − ¯ x t + τ ⟩∥ x t − ¯ x t ∥ − 1 2 ∥ x t + τ − ¯ x t + τ ∥ − 1 2  t, IC where ¯ x t is the spatial mean of x t , and ⟨·⟩ t, IC denotes av eraging ov er time and initial conditions. The rate of change is computed from the discrete time derivati ve, R ( t ) = ∥ x t +1 − x t / ∆ t ∥ 1 , and we plot the mean of R ( t ) ov er different random initial conditions. The blue curves disappear in the bottom left plot for two different reasons. From 0 to 250, the blue curve is not visible because it o verlaps with the green one, indicating strong temporal accuracy of the point-wise estimate in the short rollouts. From 30000 to 30250, the blue curve is missing because the point-wise estimates are returning NaN from the blow up in scale when the distrib ution drifted too far away from the stationary distrib ution. ing, discretizing the forward process according to the noise schedule { α s , β s } s ∈ N ([0 ,S ]) of ( Ho et al. , 2020 ). The train- ing process and additional numerical details of SNS are provided in Appendix B.2 . SNS represents our ﬁrst attempt to identify an optimal trade- off between Conditional appr oximation error and OOD er- r or within the proposed framework, and in its current form it still faces several numerical challenges. First, although access to the multi-noise-lev el score allows the use of many existing dif fusion-based methods, it also leads to a more dif- ﬁcult optimization problem. In particular , the SNS training objectiv e departs from the con ventional formulation: tradi- tional score estimators condition on both the noisy state and the noise level, whereas SNS operates solely on the noisy states. As a result, SNS requires a ﬁner discretization of diffusion time compared to other autoregressi ve conditional diffusion models during the rev erse process to generate high- ﬁdelity next-frame predictions. This substantially increases the already high computational cost of simulating the full rev erse diffusion process. Consequently , all numerical re- sults in this work rely on a reﬁnement strate gy applied to a pointwise-estimate surrogate model (e.g., a neural net that directly outputs in one pass). While this cost may be re- duced in future work—potentially enabling dif fusion-based models to scale to long time-horizon generation—we restrict our attention here to reﬁnement tasks. Deﬁning an appro- priate e v aluation metric for long-time simulations remains an open problem. W e measure the conditional appr oxi- mation err or with temporal consistency metrics such as spatio-temporal autocorrelation and rate of change in state space as described under Figure 5 . In terms of OOD error , while density-based metrics such as the K ullback–Leibler div ergence and W asserstein distance are natural candidates, estimating KL div ergence in high-dimensional settings is computationally prohibiti ve. Consequently , we restrict our ev aluation to qualitativ e diagnostics until a principled theo- retical frame work for this class of tasks becomes a vailable. In particular , we will assess the long generation consistency through the spectrum of the energy in the state space pro- vided at the bottom of all plots. 5.1. Kolmogor ov ﬂow Simulating ﬂows go verned by the Navier –Stokes equations has long been a central challenge for neural surrogate mod- els. W e ﬁrst consider Kolmogoro v ﬂow , a system on which Thermalizer has previously demonstrated strong perfor- mance. W e implement Algorithm 1 using the same point- wise surrogate model employed by Thermalizer , a Unet trained with a multi-step MSE objecti ve. (Section B.2 ) Such models usually achiev e high ﬁdelity short-time predictions but di verge from the stationary distribution in the long run. As shown in Figure 1 , SNS is able to maintain its di verging trajectories within the training distrib ution over long-time horizons. While A CDM remains numerically stable, it produces ﬁelds with biased large-scale structures, leading to ov erly smooth solutions. It is important to note that both Thermalizer and SNS were trained with a noising schedule of 1000 diffusion steps, whereas A CDM was ev aluated with only 200 steps, 9 T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surrogate Models f or Dynamical Systems making a direct performance comparison inherently unfair . Nev ertheless, the computational cost of using ACDM for generation renders it impractical for long-time simulations. In contrast, the computational cost of SNS and Thermalizer is approximately two orders of magnitude lower than that of A CDM . Figure 5 demonstrate the superior performance of SNS com- pared to thermalizer in temporal consistency . This is ex- pected by the theory since thermalizer only aims to preserv e marginal distrib utions during its projections, while SNS also accounts for the transition density . Although Thermal- izer and SNS exhibit comparable performance in long-time simulations of K olmogorov ﬂo w in terms of OOD error , we emphasize that SNS requires no hyperparameter tuning, whereas Thermalizer relies on tuning two dif fusion hyper- parameters, s init and s stop , which deﬁne the start and end points of its diffusion process. 5.2. Quasigeostrophic T urbulence W e also consider two-layer quasigeostrophic (QG) ﬂows as a test case. This dynamical system is of central importance in oceanic and atmospheric sciences, where it serves as a reduced-order model for a wide range of geophysical phenomena ( Majda & Qi , 2018 ). W e ev aluate all models on the jet conﬁguration, following the experimental setup of ( Ross et al. , 2023 ). All other neural surrogates we considered failed when trained using their respecti ve published procedures. In par- ticular , ACDM is unable to generate coherent samples even in a static denoising setting with 200 denoising steps. W e did not explore ACDM conﬁgurations with more than 200 diffusion steps due to the computational impracticality . The publicly av ailable implementation of Thermalizer fails to produce rollouts due to an internal safe guard that terminates ex ecution when the inferred noise le vel e xceeds a prescribed threshold. W e acknowledge that further ﬁne-tuning of A CDM and Ther- malizer can potentially result in models capable of simu- lating high-ﬁdelity trajectories ov er long horizons. It is also worth noting that the same architecture used for the K olmogorov ﬂow did not succeed with the two-layer QG system at ﬁrst, and a more expressiv e model architecture for the multi-noise-level scor e estimator was needed. W e provide more numerical details and varients of SNS with “easier” optimization objectiv e in Section B . Additional videos for visualization is provided here . And the code used for the numerical experiments in this paper is publicly av ailable: here . 6. Conclusion W e introduced multi-noise-level denoising oracle , a princi- pled way to explicitly balance conditional appr oximation err or and OOD error , formalizing heuristic-based methods for neural simulation ov er a long time horizon. W ithin this frame work, we identiﬁed a central question: ho w to achieve an optimal tradeof f between conditional appr oximation er- r or and OOD err or through controlled perturbations of the conditional variable. T o address this question, we proposed the Self-r eﬁning Neural Surr ogate model ( SNS ), a condi- tional dif fusion model which adapti vely infers the optimal noise le vel of the conditional variable to preserve stability while respecting temporal dynamics. Despite its conceptual succinctness, the current implemen- tation of SNS is restricted by a more challenging training objectiv e. Regardless, we demonstrated competiti ve per - formance with models designed speciﬁcally for stability guarantees in long-time simulation of the Kolmogoro v ﬂow . W e also demonstrated success of SNS in modeling more “complicated” dynamics such as two-layer QG ﬂows. W e vie w SNS as an initial step to ward a broader class of methods enabled by the multi-noise-lev el denoising oracle, aimed at reconciling accurate short-term prediction with long-term distributional consistency . An important direction for fu- ture work is to better understand the interplay between the temporal correlations inherent in the underlying dynamical system and the model-induced distrib ution drift, which we believ e is critical for achie ving truly long-horizon neural simulations. 7. Acknowledgment The authors would like to thank Chris Pedersen, Sara V ar- gas, Matthieu Blanke, F abrizio Falasca, Pa vel Perezhogin, Jiarong W u, Oliver B ¨ uhler , Rudy Morel, Jonathan W eare, Sara Shamekh, and Ryan D ` u for many v aluable discussions. This project is supported by Schmidt Sciences, LLC. 8. Impact Statement This paper aims to help advance the ﬁeld of Machine Learn- ing. There are many potential societal consequences of our work, none of which we feel must be speciﬁcally highlighted here. References Abernathey , R., rochanotes, Ross, A., Jansen, M., Li, Z., Poulin, F . J., Constantinou, N. C., Sinha, A., Balw ada, D., SalahK ouhen, Jones, S., Rocha, C. B., W olfe, C. L. P ., Meng, C., van Kemenade, H., Bourbeau, J., Penn, J., Busecke, J., Bueti, M., and T obias. pyqg/p yqg: v0.7.2, May 2022. URL https://doi.org/10.5281/ 10 T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surrogate Models f or Dynamical Systems zenodo.6563667 . Anderson, B. D. Re verse-time dif fusion equation models. Stochastic Pr ocesses and their Applications , 12(3):313– 326, 1982. Bach, E., Crisan, D., and Ghil, M. Forecast error growth: A dynamic-stochastic model. arXiv e-prints , art. arXi v:2411.06623, Nov ember 2024. doi: 10.48550/ arXiv .2411.06623. Batzolis, G., Stanczuk, J., Sch ¨ onlieb, C., and Etmann, C. Conditional image generation with score-based diffusion models. CoRR , abs/2111.13606, 2021. URL https: //arxiv.org/abs/2111.13606 . Bi, K., Xie, L., Zhang, H., Chen, X., Gu, X., and T ian, Q. Accurate medium-range global weather forecast- ing with 3D neural networks. Nature , 619(7970):533– 538, July 2023. ISSN 1476-4687. doi: 10.1038/ s41586- 023- 06185- 3. Bonavita, M. On some limitations of current machine learning weather prediction models. Geophysical Re- sear ch Letters , 51(12), June 2024. ISSN 1944-8007. doi: 10.1029/2023gl107377. URL http://dx.doi.org/ 10.1029/2023GL107377 . Cachay , S. R., Zhao, B., Joren, H., and Y u, R. D Yffusion: A dynamics-informed dif fusion model for spatiotempo- ral forecasting. In Thirty-seventh Confer ence on Neural Information Pr ocessing Systems , 2023. URL https: //openreview.net/forum?id=WRGldGm5Hz . Chariker , L., Shapley , R., and Y oung, L.-S. Orientation Selectivity from V ery Sparse LGN Inputs in a Com- prehensiv e Model of Macaque V1 Cortex. Journal of Neur oscience , 36(49):12368–12384, December 2016. ISSN 0270-6474, 1529-2401. doi: 10.1523/JNEUROSCI. 2603- 16.2016. Chattopadhyay, A., Sun, Y . Q., and Hassanzadeh, P . Chal- lenges of learning multi-scale dynamics with AI weather models: Implications for stability and one solution. arXiv e-prints , art. arXiv:2304.07029, April 2023. doi: 10.48550/arXiv .2304.07029. Dhariwal, P . and Nichol, A. Diffusion Models Beat GANs on Image Synthesis. In Advances in Neural Information Pr ocessing Systems , volume 34, pp. 8780–8794. Curran Associates, Inc., 2021. Dheeshjith, S., Subel, A., Adcroft, A., Busecke, J., Fernandez-Granda, C., Gupta, S., and Zanna, L. Samu- dra: An AI Global Ocean Emulator for Climate. Geo- physical Resear ch Letters , 52(10):e2024GL114318, May 2025. ISSN 0094-8276, 1944-8007. doi: 10.1029/ 2024GL114318. Dresdner , G., Kochk ov , D., Norgaard, P ., Zepeda-N ´ u ˜ nez, L., Smith, J. A., Brenner, M. P ., and Hoyer , S. Learning to correct spectral methods for simulating turbulent ﬂows, June 2023. Gao, H., Kaltenbach, S., and Koumoutsakos, P . Gen- erativ e learning for forecasting the dynamics of high- dimensional complex systems. Natur e Communica- tions , 15(1):8904, October 2024. ISSN 2041-1723. doi: 10.1038/s41467- 024- 53165- w. Ho, J. and Salimans, T . Classiﬁer -free diffusion guidance. arXiv pr eprint arXiv:2207.12598 , 2022. Ho, J., Jain, A., and Abbeel, P . Denoising diffu- sion probabilistic models. ArXiv , abs/2006.11239, 2020. URL https://api.semanticscholar. org/CorpusID:219955663 . Ijspeert, A. J., Nakanishi, J., Hof fmann, H., Pastor , P ., and Schaal, S. Dynamical Mov ement Primitiv es: Learning Attractor Models for Motor Behaviors. Neural Compu- tation , 25(2):328–373, February 2013. ISSN 0899-7667. doi: 10.1162/NECO a 00393. Jiang, R., Lu, P . Y ., Orlova, E., and W illett, R. T rain- ing neural operators to preserve inv ariant measures of chaotic attractors. In Thirty-seventh Confer ence on Neu- ral Information Pr ocessing Systems , 2023. URL https: //openreview.net/forum?id=8xx0pyMOW1 . Karras, T ., Aittala, M., Aila, T ., and Laine, S. Elucidating the design space of dif fusion-based generati ve models. Advances in neur al information pr ocessing systems , 35: 26565–26577, 2022. K ochkov , D., Smith, J. A., Aliev a, A., W ang, Q., Bren- ner , M. P ., and Hoyer , S. Machine learning–accelerated computational ﬂuid dynamics. Pr oceedings of the Na- tional Academy of Sciences , 118(21), May 2021. ISSN 1091-6490. doi: 10.1073/pnas.2101784118. URL http: //dx.doi.org/10.1073/pnas.2101784118 . K ochkov , D., Y uv al, J., Langmore, I., Nor gaard, P ., Smith, J., Mooers, G., Kl ¨ ower , M., Lottes, J., Rasp, S., D ¨ uben, P ., Hatﬁeld, S., Battaglia, P ., Sanchez-Gonzalez, A., W ill- son, M., Brenner, M. P ., and Hoyer , S. Neural general circulation models for weather and climate. Nature , 632 (8027):1060–1066, August 2024. ISSN 1476-4687. doi: 10.1038/s41586- 024- 07744- y. K ohl, G., Chen, L.-W ., and Thuerey , N. Benchmarking Au- toregressi ve Conditional Diffusion Models for T urbulent Flow Simulation, December 2024. Kurth, T ., Subramanian, S., Harrington, P ., Pathak, J., Mardani, M., Hall, D., Miele, A., Kashinath, K., and 11 T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surrogate Models f or Dynamical Systems Anandkumar , A. Fourcastnet: Accelerating global high- resolution weather forecasting using adaptiv e fourier neural operators. In Pr oceedings of the Platform for Advanced Scientiﬁc Computing Conference , P ASC ’23, New Y ork, NY , USA, 2023. Association for Comput- ing Machinery . ISBN 9798400701900. doi: 10.1145/ 3592979.3593412. URL https://doi.org/10. 1145/3592979.3593412 . Lewy , H., Friedrichs, K., and Courant, R. ¨ Uber die partiellen differenzengleichungen der mathematischen physik. Mathematische Annalen , 100:32–74, 1928. URL http://eudml.org/doc/159283 . Li, Z., K ovachki, N. B., Azizzadenesheli, K., Liu, B., Bhat- tacharya, K., Stuart, A., and Anandkumar , A. Fourier Neural Operator for Parametric Partial Dif ferential Equa- tions. In International Confer ence on Learning Repr esen- tations , October 2020. Lippe, P ., V eeling, B. S., Perdikaris, P ., Turner , R. E., and Brandstetter , J. PDE-reﬁner: Achieving accurate long rollouts with neural PDE solvers. In Thirty-seventh Confer ence on Neural Information Pr ocessing Systems , 2023. URL https://openreview.net/forum? id=Qv6468llWS . Lorenz, E. N. Deterministic Nonperiodic Flow. Journal of the Atmospheric Sciences , March 1963. ISSN 1520-0469. Lusch, B., Kutz, J. N., and Brunton, S. L. Deep learning for univ ersal linear embeddings of nonlinear dynamics. Na- tur e Communications , 9(1):4950, November 2018. ISSN 2041-1723. doi: 10.1038/s41467- 018- 07210- 0. Majda, A. J. and Qi, D. Strategies for reduced-order models for predicting the statistical responses and un- certainty quantiﬁcation in complex turbulent dynami- cal systems. SIAM Review , 60(3):491–549, 2018. doi: 10.1137/16M1104664. Nathaniel, J. and Gentine, P . Generativ e emulation of chaotic dynamics with coherent prior . Computer Methods in Applied Mechanics and Engineering , 448:118410, Jan- uary 2026. ISSN 0045-7825. doi: 10.1016/j.cma.2025. 118410. Parthipan, R., Anand, M., Christensen, H. M., Hosking, J. S., and W ischik, D. J. Deﬁning error accumula- tion in ML atmospheric simulators. arXiv e-prints , art. arXiv:2405.14714, May 2024. doi: 10.48550/arXi v .2405. 14714. Pathak, J., Subramanian, S., Harrington, P ., Raja, S., Chat- topadhyay , A., Mardani, M., Kurth, T ., Hall, D., Li, Z., Azizzadenesheli, K., Hassanzadeh, P ., Kashinath, K., and Anandkumar , A. F ourCastNet: A Global Data-driven High-resolution W eather Model using Adapti ve F ourier Neural Operators, February 2022. Pedersen, C., Zanna, L., and Bruna, J. Thermalizer: Stable autoregressi ve neural emulation of spatiotemporal chaos. In F orty-Second International Confer ence on Mac hine Learning , June 2025. Price, I., Sanchez-Gonzalez, A., Alet, F ., Andersson, T . R., El-Kadi, A., Masters, D., Ewalds, T ., Stott, J., Mohamed, S., Battaglia, P ., et al. Gencast: Diffusion-based ensemble forecasting for medium-range weather . arXiv preprint arXiv:2312.15796 , 2023. Robbins, H. E. An empirical bayes approach to statistics. In Br eakthr oughs in Statistics: F oundations and basic theory , pp. 388–394. Springer , 1992. Ronneberger , O., Fischer, P ., and Brox, T . U-Net: Conv o- lutional Networks for Biomedical Image Segmentation, May 2015. Ross, A., Li, Z., Perezhogin, P ., Fernandez-Granda, C., and Zanna, L. Benchmarking of Machine Learning Ocean Subgrid Parameterizations in an Idealized Model. Journal of Advances in Modeling Earth Systems , 15 (1):e2022MS003258, 2023. ISSN 1942-2466. doi: 10.1029/2022MS003258. Salimans, T . and Ho, J. Progressiv e distillation for fast sampling of diffusion models, 2022. URL https:// arxiv.org/abs/2202.00512 . Schiff, Y ., W an, Z. Y ., Parker , J. B., Hoyer , S., Kuleshov , V ., Sha, F ., and Zepeda-N ´ u ˜ nez, L. Dyslim: Dynamics stable learning by in variant measure for chaotic systems. arXiv pr eprint arXiv:2402.04467 , 2024. Shehata, Y ., Holzschuh, B., and Thuerey , N. Impro ved sam- pling of dif fusion models in ﬂuid dynamics with tweedie’ s formula. In The Thirteenth International Confer ence on Learning Repr esentations , 2025. URL https:// openreview.net/forum?id=0FbzC7B9xI . Sohl-Dickstein, J. N., W eiss, E. A., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. ArXiv , abs/1503.03585, 2015. URL https://api.semanticscholar. org/CorpusID:14888175 . Song, Y . and Ermon, S. Generati ve modeling by estimating gradients of the data distribution. Advances in neural information pr ocessing systems , 32, 2019. Song, Y ., Dhariwal, P ., Chen, M., and Sutske ver , I. Con- sistency Models. In Pr oceedings of the 40th Interna- tional Confer ence on Machine Learning , pp. 32211– 32252. PMLR. URL https://proceedings.mlr. press/v202/song23a.html . 12 T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surrogate Models f or Dynamical Systems Song, Y ., Sohl-Dickstein, J., Kingma, D. P ., K umar, A., Er - mon, S., and Poole, B. Score-Based Generati ve Modeling through Stochastic Dif ferential Equations. In Interna- tional Confer ence on Learning Representations , October 2020. Stachenfeld, K., Fielding, D. B., Kochk ov , D., Cranmer , M., Pfaf f, T ., Godwin, J., Cui, C., Ho, S., Battaglia, P ., and Sanchez-Gonzalez, A. Learned simulators for turbulence. In International Confer ence on Learning Repr esentations , 2022. URL https://openreview.net/forum? id=msRBojTz- Nh . Subel, A. and Zanna, L. Building Ocean Climate Emulators, March 2024. Um, K., Brand, R., Fei, Y . R., Holl, P ., and Thuerey , N. Solver -in-the-loop: Learning from dif ferentiable physics to interact with iterati ve PDE-solv ers. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Pr ocessing Systems , vol- ume 33, pp. 6111–6122. Curran Associates, Inc., 2020. Vlachas, P . R., Pathak, J., Hunt, B. R., Sapsis, T . P ., Girvan, M., Ott, E., and K oumoutsakos, P . Backpropagation algorithms and Reservoir Computing in Recurrent Neural Networks for the forecasting of complex spatiotemporal dynamics. Neural Networks , 126:191–217, June 2020. ISSN 0893-6080. doi: 10.1016/j.neunet.2020.02.016. Zhang, L., Cai, S., Li, M., W etzstein, G., and Agrawala, M. Frame conte xt packing and drift prevention in next-frame- prediction video diffusion models. In The Thirty-ninth Annual Confer ence on Neur al Information Pr ocessing Systems , 2025. URL https://openreview.net/ forum?id=J8JCF64aEn . 13 T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surrogate Models f or Dynamical Systems A. Dynamical Systems W e provide here an ov ervie w of the dynamical systems on which we validate our method. F or a more thorough description, please refer to ( Pedersen et al. , 2025 ; Ross et al. , 2023 ). A.1. Kolmogor ov Flow The gov erning PDEs for the system are the 2D incompressible Navier -Stokes equations with sinusoidal forcing: ∂ t u + ( u · ∇ ) u = ν ∇ 2 u − 1 ρ ∇ p + f ∇ · u = 0 where u = ( u, v ) is the two dimensional v elocity ﬁeld, ν is the kinematic viscosity , ρ is the ﬂuid density , p is the pressure, and f is an external forcing term. W e choose the constant sinusoidal forcing f = (sin(4 y ) , 0) as done in ( K ochkov et al. , 2021 ). Following ( Pedersen et al. , 2025 ), we set p = 1 , ν = 0 . 001 which corresponds to a Reynolds number Re = 10 , 000 . Deﬁning the vorticity as the 2D curl of the velocity ﬁeld, i.e. ω := ∇ H × u . One can obtain the vorticity equation by taking the curl of the abov e equations: ∂ t ω + u · ∇ ω = ν ∇ 2 ω + ( ∇ × f ) · ˆ z . The equations are solved using the pseudo-spectral method with periodic boundary conditions from the open-source code jax-cfd ( Dresdner et al. , 2023 ). All results presented in this paper are based on the vorticity ﬁelds. A.2. T wo Layer Quasigeostr ophic Equations For a two-layer ﬂuid, deﬁning the potential v orticity to be: q i = ∇ 2 ψ i + ( − 1) i f 2 0 g ′ H m ( ψ 1 − ψ 2 ) , m ∈ { 1 , 2 } , (5) where i = 1 and i = 2 denote the upper layer and the lower layer , respectively . H i is the av erage depth of the layer , ψ is the streamfunction, and u i = ∇ ⊥ ψ i . f 0 is the Coriolis frequency . The time evolution of the system is gi ven by ∂ t q i + J ( ψ i , q i ) + β i ∂ x ψ i + U i ∂ x q i = − δ i, 2 r ∇ 2 ψ i + ssd (6) where U i is the mean ﬂo w in x , β i accounts for the beta ef fect, r is the bottom drag coef ﬁcient, and δ i, 2 is the Kronecker delta. W e set the parameters following the jet conﬁguration in ( Ross et al. , 2023 ) and solv e the equations with p yqg with numerical timestep δ t = 3600 . 0 .( Abernathey et al. , 2022 ) The physical time step in Figure 9 are subsampled by a factor of ten so the physical time step is ∆ t = 36000 . 0 . B. Numerical details B.1. Model Architectur e All models are based on a U-Net style encoder –decoder architecture ( Ronneberger et al. , 2015 ) implemented as a residual U-Net with multi-scale skip connections. For the multi-noise-level score estimator, two auxiliary regression heads are attached to the bottleneck of the Unet as done by ( Pedersen et al. , 2025 ). The network ﬁrst projects the inputs to a 64-channel feature map using a 3 × 3 con volution with circular padding, followed by GELU acti vations throughout the network. At the bottleneck, the model applies two residual blocks operating on 512-channel feature maps. The decoder mirrors the encoder structure, using transposed con volutions ( 4 × 4 , stride 2) for upsampling and concatenating the corresponding encoder features via skip connections. Each decoder stage applies a residual block to fuse the upsampled and skip-connected features, progressively reducing the number of channels back to 64. In addition, the model used for the multi-noise-level scor e estimator of the QG ﬂows has attention heads attached to the bottleneck layer . The model did not work well without this addition. 14 T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surrogate Models f or Dynamical Systems F igure 6. Different reasonable tra versal strategies in the re verse SDE phase space B.2. T raining The training of the Gaussian approximation, p θ , of the transition density with point-wise estimate mean and ﬁxed v ariance is done via minimizing the multi-step mean squared error on the residuals of the snapshots, i.e. L ( θ ) = 1 N N X k =1 L − 1 X t =0 ∥ Ψ θ ( x k t ) − ( x k t +1 − x k t ) + σ z ∥ 2 . where z ∈ N (0 , I ) , p θ = N (Ψ θ , σ 2 I ) . W e follow the training conﬁguration proposed by ( Pedersen et al. , 2025 ) to obtain a similar baseline model. W e train SNS using the DDPM style objective with 440,000 training pair y t = ( x t , x t − 1 ) . W e choose { β s } S s =1 to be the cosine variance schedule as proposed in ( Ho et al. , 2020 ) with α s = 1 − β s and ¯ α s = Q s k =1 α k . The forward diffusion process has the transition density: q ( x s | x 0 ) = N  √ ¯ α s x 0 , (1 − ¯ α s ) I  , which allows sampling x s t in closed form as x s 1 t = √ ¯ α s 1 x 0 t + √ 1 − ¯ α s 1 ϵ, x s 2 t − 1 = √ ¯ α s 2 x 0 t − 1 + √ 1 − ¯ α s 2 ϵ ′ , ϵ, ϵ ′ ∼ N (0 , I ) . W e noise the inputs to the neural net independently with s 1 , s 2 ∼ U (1 , S ) , and the loss function we minimize is: L ( θ ) := E {∥ ( ϵ, ϵ ′ ) − Φ (3) θ ( y s t ) ∥ 2 + λ X s 1 ,s 2 [ 1 s 1 log Φ (1) θ ( y s t ) + 1 s 2 log Φ (2) θ ( y s t )] where the terms were deﬁned before Equation ( 4 ). The minimizer gives us access to the multi-noise-le vel score with: ϵ = − √ 1 − ¯ α t ∇ x t log q ( x s | x 0 ) . W e use AdamW with a learning rate of 5 e − 4 and set the weighting parameter λ = 0 . 1 . Each training run requires approximately two days on a single NVIDIA H200 GPU, using a U-Net with 73,727,522 trainable parameters. Empirically , we ﬁnd that augmenting the input with a sample from the inv ariant measure improves con ver gence, e ven though the trained model ultimately exhibits negligible dependence on this signal, indicating that its beneﬁt is primarily during the early stages of training. 15 T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surrogate Models f or Dynamical Systems F igure 7. Histogram of denoising history for the numerical experiments in the main te xt Algorithm 2 SNS for generation Require: Initial state x 0 , trained SNS Φ θ for t = 1 to T do ˆ x t , z ∼ N ( 0 , I ) ˆ s 2 = Φ (2) ( ˆ x t , ˆ x t − 1 ) x ˆ s 2 t − 1 = √ ¯ α ˆ s s x t − 1 + √ 1 − ¯ α ˆ s s z for s = S to 0 do z s ∼ N ( 0 , I ) ˆ x s t = 1 √ α s  ˆ x s t − 1 − α s √ 1 − ¯ α s Φ (3) θ ( ˆ x s t , ˆ x s 2 t − 1 )  + √ β s z s end for end for B.3. SNS T raversal Strategies Mathematically , all continuous transversal strategies that end at the point (0 , 0) following a monotonically decreasing path satisfy the rev erse process SDE. Thus, it appears to be that there should be no difference between different choices of trav ersal. Figure 8 shows some reasonable tra versal strategies in the re verse process phase space. Through numerical experiments, the most robust strate gy appears to be path A in Figure 8 . Despite numerical impracticality , we still present Algorithm 2 for using S N S to directly generate next state prediction. W e also provide a histogram of the denoising steps associated with the two numerical e xperiments presented in the main text in Figure 7 . It is worth noting that, in practice, one could choose an activ ation threshold for the algorithms presented, i.e. SNS only starts denoising when a certain noise level is reached. This can improv e the inference time of SNS while preserving most of its performance due to the extremely high signal-to-noise ratio at small noise le vels. B.4. SNS V ariants W e consider conditioning the multi-noise-level denoising or acle D ( x s 1 t , x s 2 t − 1 ) on a snapshot from the past, x t − r . When r ≫ 1 , the conditional denoising oracle recovers the unconditional oracle due to the chaotic nature of the underlying system. Similarly , in case when r = 1 , the conditional denoising oracle also reduces to the unconditional case as a consequence of the Markovian property of the dynamics. The regime of interest is when x t − 1 is perturbed by noise and r is chosen such that x t − r remains partially coupled to x t − 1 , but is not tri vially redundant. In this setting, the parameter r can be interpreted as controlling the error trade-off. Analogous to the SNS procedure for estimating the noise le vel of x s 2 t − 1 , one could in principle estimate ho w far back in the trajectory one must condition in order to sufﬁciently weak en temporal dependence while maintaining numerical stability . 16 T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surrogate Models f or Dynamical Systems This alternativ e approach is considerably more challenging to implement, as it requires access to historical trajectory data. In practice, storage of long trajectories as training data may already be computationally demanding. Nevertheless, we empirically observe that providing a noise-free past frame at a predetermined lag, r , as an auxiliary conditioning signal substantially simpliﬁes the optimization problem. The key challenge lies in selecting a conditioning frame that is neither too close to the present (leading to excessi ve dependence) nor too distant (resulting in insufﬁciently informati ve conditioning). W e keep this discussion informal, as the appropriate choice of r is heuristic and system-dependent. In our e xperiments, conditioning on a clean frame 100 steps in the past consistently improved optimization performance and was ef fectiv e for stably simulating the two layer quasigeostrophic system using the same architecture backbone used for the K olmogorov ﬂow . (Figure 9 ) C. Proofs Theorem C.1 (Equi valence between monotonic tra versal strate gies) . Starting fr om an initial point ( x s 0 1 t , x s 0 2 t − 1 ) , consider a discr etized traversal path in r everse pr ocess phase space : { ( s k 1 , s k 2 ) } N k =0 , with ( s 0 1 , s 0 2 ) the starting point and ( s N 1 , s N 2 ) = (0 , 0) . Assume the path is monotonically decr easing in each coor dinate: s k +1 i ≤ s k i , i = 1 , 2 . Let ( x s k 1 t , x s k 2 t − 1 ) evolve accor ding to the coupled r everse SDE deﬁned by the multi-noise-level score . Then the resulting joint distribution of the initial and terminal state p  x s N 1 t , x s N 2 t − 1 , x s 0 1 t , x s 0 2 t − 1  is independent of the particular traver sal path in re verse pr ocess phase space, i.e for any two monotonically decr easing paths { ( s k 1 , s k 2 ) } N k =0 , { ( s ′ k 1 , s ′ k 2 ) } N ′ k =0 , connecting ( s 0 1 , s 0 2 ) to ( s N 1 , s N 2 ) , the random variables gener ated by the re verse coupled SDE satisfy p  x s N 1 t , x s N 2 t − 1 , x s 0 1 t , x s 0 2 t − 1  = p  x s ′ N 1 t , x s ′ N 2 t − 1 , x s 0 1 t , x s 0 2 t − 1  , Pr oof. Fix a monotonically decreasing discrete path in re verse process phase space { ( s k 1 , s k 2 ) } N k =0 , ( s 0 1 , s 0 2 ) → ( s N 1 , s N 2 ) = (0 , 0) , s k +1 i ≤ s k i . Since the path is monotonically decreasing, we can parametrize this path with u ∈ [0 , 1] and deﬁne u 7→ s ( u ) : = ( s 1 ( u ) , s 2 ( u )) , s i ( · ) nonincreasing , s (0) = ( s 0 1 , s 0 2 ) , s (1) = (0 , 0) , such that the discretization points are contained in the image { s ( u ) : u ∈ [0 , 1] } (e.g. take s i ( u ) piecewise linear through { s k i } ). Deﬁne the coupled state along the path by the single path-parameterized random variable Y u : =  x s 1 ( u ) t , x s 2 ( u ) t − 1  . then the increments ds 1 and ds 2 along the path can be written as ds 1 = ˙ s 1 ( u ) du, ds 2 = ˙ s 2 ( u ) du, ˙ s i ( u ) ≤ 0 Substituting these relations into the coupled re verse SDE yields a single SDE in the parameter u for Y u (interpreting the stochastic integrals componentwise with respect to the time-changed Bro wnian motions): d x s 1 ( u ) t = h f ( x s 1 ( u ) t , s 1 ( u )) − g ( s 1 ( u )) 2 ∇ x t log p s ( u ) ( y s ( u ) t ) i ˙ s 1 ( u ) du + g ( s 1 ( u )) d ¯ w s 1 ( u ) , d x s 2 ( u ) t − 1 = h f ( x s 2 ( u ) t − 1 , s 2 ( u )) − g ( s 2 ( u )) 2 ∇ x t − 1 log p s ( u ) ( y s ( u ) t ) i ˙ s 2 ( u ) du + g ( s 2 ( u )) d ¯ w s 2 ( u ) . Thus, following a monotone path reduces the e volution to the single random v ariable Y u driv en by the parameter u . 17 T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surrogate Models f or Dynamical Systems Now consider the forw ard process applied independently to each component, d x s = f ( x s , s ) ds + g ( s ) d w s , s ∈ [0 , S ] , and run it along the same monotone parametrization u 7→ ( s 1 ( u ) , s 2 ( u )) , i.e. consider the forward path process Y fwd u : =  x s 1 ( u ) t , x s 2 ( u ) t − 1  with d x s 1 ( u ) t = f ( x s 1 ( u ) t , s 1 ( u )) ds 1 ( u ) + g ( s 1 ( u )) d w s 1 ( u ) , and analogously for x s 2 ( u ) t − 1 . By construction, the reverse coupled SDE uses the multi-noise-le vel score ∇ log p s ( y s t ) , which is precisely the object that deﬁnes the time-rev ersal of the forward marginals at each noise-le vel s = ( s 1 , s 2 ) . Consequently , when the rev erse SDE is initialized at Y 1 = ( x s N 1 t , x s N 2 t − 1 ) , its induced law on path space { Y u : u ∈ [0 , 1] } matches (in rev erse time) the law of the forward process runs along the same path. In particular , the rev erse process produces the correct conditional endpoint distribution associated with that path: p  Y bwd 0 , Y bwd 1  = p  Y f w d 0 , Y f w d 1  , Now take an y two monotone paths s ( u ) = ( s 1 ( u ) , s 2 ( u )) and s ′ ( u ) = ( s ′ 1 ( u ) , s ′ 2 ( u )) connecting the same endpoints ( s 0 1 , s 0 2 ) and (0 , 0) . Under the forward dynamics, the two coordinates ev olve independently giv en their own noise le vels, i.e. the law of x s 1 t depends only on s 1 and the law of x s 2 t − 1 depends only on s 2 , with independent dri ving Brownian motions. Therefore, for any such path, the forward endpoint distrib ution at (0 , 0) factors as p  x s 0 1 t , x s 0 2 t − 1 , x 0 t , x 0 t − 1  =  p  x s 0 1 t | x 0 t  , p  x s 0 2 t − 1 | x 0 t − 1  ⊗ p ( x 0 t , x 0 t − 1 ) , where ⊗ is element-wise multiplication. The factoring clearly does not depend on ho w one jointly trav erses ( s 1 , s 2 ) , only on the endpoints in each coordinate. Thus, we conclude that the rev erse process generates the right distribution independent of the trav ersal path. Equivalence of minimizers: joint score vs. conditional scor es Fix s = ( s 1 , s 2 ) ∈ [0 , S ] 2 and let y s t := ( x s 1 t , x s 2 t − 1 ) ∼ p s denote the noised pair at noise le vels s . Consider the denoising score matching objectiv e L joint ( θ ) := E s ∼U (0 ,S ) 2 E y 0 t ∼ p E y s t ∼ p s ( ·| y 0 t )    ∇ y log q s ( y | y 0 t ) − Φ θ ( y s t , s 1 , s 2 )    2 2 , (7) where y = y s t and p s ( · | · ) is the forward noising k ernel. Also consider the objectiv e L cond ( θ ) := E s ∼U (0 ,S ) 2 E y s t ∼ p s     ∇ x t log p s 1  x s 1 t | x s 2 t − 1  ∇ x t − 1 log p s 2  x s 2 t − 1 | x s 1 t   − Φ θ ( y s t , s 1 , s 2 )    2 2 . (8) Then giv en a parametric class Θ , the sets of minimizers coincide: arg min θ ∈ Θ L joint ( θ ) = arg min θ ∈ Θ , L cond ( θ ) , Notation. Let p s ( y s t | y 0 t ) denote the forward noising k ernel induced by the independent SDEs, and deﬁne the marginal p s ( y s t ) := Z p ( y 0 t ) p s ( y s t | y 0 t ) d y 0 t . Lemma C.2. F ix s = ( s 1 , s 2 ) . F or any measurable ψ ( y ) , the objectives E y 0 t ∼ p 0 E y s t ∼ p s ( ·| y 0 t ) h ∥∇ y log p s ( y s | y 0 t ) − ψ ( y ) ∥ 2 i 18 T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surrogate Models f or Dynamical Systems and E y ∼ p s h ∥∇ y log p s ( y ) − ψ ( y ) ∥ 2 i have the same minimizer over ψ , given by ψ ∗ ( y ) = ∇ y log p s ( y s ) . Pr oof. Fix s and abbre viate y = y s t . By the tower property of conditional expectation, E y 0 t E y | y 0 t [ · ] = E y ∼ p s E y 0 t | y [ · ] . Hence the ﬁrst objectiv e can be written as E y ∼ p s E y 0 t | y h ∥∇ y log p s ( y | y 0 t ) − ψ ( y ) ∥ 2 i . For each ﬁx ed y , the minimizer of the inner expectation is the conditional mean, ψ ∗ ( y ) = E y 0 t | y  ∇ y log p s ( y | y 0 t )  . Using the identity ∇ p = p ∇ log p , we compute ∇ y p s ( y ) = ∇ y Z p ( y 0 t ) q s ( y | y 0 t ) d y 0 t = Z p ( y 0 t ) p s ( y | y 0 t ) ∇ y log p s ( y | y 0 t ) d y 0 t . Since p ( y 0 t ) p s ( y | y 0 t ) = p s ( y ) p ( y 0 t | y ) , we obtain ∇ y log p s ( y ) = E y 0 t | y  ∇ y log p s ( y | y 0 t )  , which prov es the claim. Theorem C.3 (Minimizer of the multi-noise-le vel DSM objectiv e) . The minimizer of the objective ( 3 ) is given (for almost every s ) by Φ ∗ ( y s t , s ) = ∇ y log p s ( y s t ) . Pr oof. The objectiv e ( 3 ) is an expectation ov er s ∼ U (0 , S ) 2 of the objectiv es considered in Lemma C.2 . Since each s -slice is minimized by ∇ y log p s ( y ) , the full objective is minimized by the same function almost e verywhere in s . Corollary C.4 (Equi valence with conditional scores) . F or y = ( x s 1 t , x s 2 t − 1 ) , ∇ y log p s ( y ) =  ∇ x t log p s 1 ( x s 1 t | x s 2 t − 1 ) , ∇ x t − 1 log p s 2 ( x s 2 t − 1 | x s 1 t )  . Consequently , r eplacing the multi-noise-level scor e in ( 3 ) by the pair of conditional scores yields an objective with the same minimizer . Pr oof. By the factorization log p s ( x s 1 t , x s 2 t − 1 ) = log p s 1 ( x s 1 t | x s 2 t − 1 ) + log p s 2 ( x s 2 t − 1 ) , the second term does not depend on x s 1 t , yielding the ﬁrst identity . The second follo ws analogously . Corollary C.5. Minimizer of conditional appr oximation error r equires clean conditioning . 19 T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surrogate Models f or Dynamical Systems F igure 8. Each ro w in a rollout from the same initial condition using point-wise estimate surrogate and SNS reﬁnement, b ut each following a dif ferent traver sal strate gy . The spectra are av eraged over 20 snapshots at time t with independent initial conditions. On the right is the L 2 distance between the temporal mean of numerical ﬁelds and the mean of each run, and similarly for the L 2 distance between the densities. 20 T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surrogate Models f or Dynamical Systems Suppose d is the conditional KL di vergence: R cond ( q ) := E x ∼ µ h KL  p ( · | x ) ∥ q ( · | x )  i , (9) where x = x t − 1 and p ( · | x ) = p ( x t | x t − 1 ) . It is immediate that the unique minimizer (a.e. in x ) is q ∗ ( · | x ) = p ( · | x ) , with minimum value 0 . Now suppose we do not condition on the clean state x , but only on a corrupted observation ˜ x ∼ r ( ˜ x | x ) (e.g. Gaussian noising). Any predictor based on the corrupted input has the form q ( x t | ˜ x ) . The best achiev able predictor in this restricted class is: q ∗ ( x t | ˜ x ) = p ( x t | ˜ x ) := Z p ( x t | x ) p ( x | ˜ x ) dx. (10) Moreov er , the irreducible gap between conditioning on x and on ˜ x is exactly a conditional mutual information: inf q ( ·| ˜ x ) E h KL  p ( · | x ) ∥ q ( · | ˜ x )  i = E h KL  p ( · | x ) ∥ p ( · | ˜ x )  i = I ( X t ; X t − 1 | ˜ X t − 1 ) , (11) where the expectation is under the joint la w p ( x t − 1 ) p ( x t | x t − 1 ) r ( ˜ x t − 1 | x t − 1 ) . In particular , if the corruption is non-degenerate, then typically I ( X t ; X t − 1 | ˜ X t − 1 ) > 0 , so conditioning on ˜ x t − 1 cannot attain the minimum of ( 9 ) . Equality holds if and only if ˜ X t − 1 is a suf ﬁcient statistic for X t − 1 with respect to X t , i.e. X t ⊥ X t − 1 | ˜ X t − 1 ; the degenerate case ˜ X t − 1 = X t − 1 recov ers the clean-conditioning optimum. 21 T owards Inﬁnitely Long Neural Simulations: Self-Reﬁning Neural Surrogate Models f or Dynamical Systems F igure 9. 2 layer Quasigeostrophic T urbulence with SNS varients 22

Towards Infinitely Long Neural Simulations: Self-Refining Neural Surrogate Models for Dynamical Systems

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment