Benchmarking neural surrogates on realistic spatiotemporal multiphysics flows
Predicting multiphysics dynamics is computationally expensive and challenging due to the severe coupling of multi-scale, heterogeneous physical processes. While neural surrogates promise a paradigm shift, the field currently suffers from an “illusion of mastery”, as repeatedly emphasized in top-tier commentaries: existing evaluations overly rely on simplified, low-dimensional proxies, which fail to expose the models’ inherent fragility in realistic regimes. To bridge this critical gap, we present REALM (REalistic AI Learning for Multiphysics), a rigorous benchmarking framework designed to test neural surrogates on challenging, application-driven reactive flows. REALM features 11 high-fidelity datasets spanning from canonical multiphysics problems to complex propulsion and fire safety scenarios, alongside a standardized end-to-end training and evaluation protocol that incorporates multiphysics-aware preprocessing and a robust rollout strategy. Using this framework, we systematically benchmark over a dozen representative surrogate model families, including spectral operators, convolutional models, Transformers, pointwise operators, and graph/mesh networks, and identify three robust trends: (i) a scaling barrier governed jointly by dimensionality, stiffness, and mesh irregularity, leading to rapidly growing rollout errors; (ii) performance primarily controlled by architectural inductive biases rather than parameter count; and (iii) a persistent gap between nominal accuracy metrics and physically trustworthy behavior, where models with high correlations still miss key transient structures and integral quantities. Taken together, REALM exposes the limits of current neural surrogates on realistic multiphysics flows and offers a rigorous testbed to drive the development of next-generation physics-aware architectures.
💡 Research Summary
The paper introduces REALM (Realistic AI Learning for Multiphysics), a comprehensive benchmarking suite designed to evaluate neural surrogate models on truly challenging multiphysics flow problems. Recognizing that most existing evaluations rely on low‑dimensional, simplified proxies, the authors assemble 11 high‑fidelity datasets that span canonical PDE‑ODE coupling, high‑Mach reacting flows, propulsion engine combustors, and fire‑hazard scenarios. These datasets cover both 2‑D and 3‑D domains, regular and irregular meshes, cell counts from 2⁴ to 1.2 × 10⁷, and up to 40 physical channels (density, temperature, velocity components, pressure, and dozens of species mass fractions). Each trajectory is generated by state‑of‑the‑art CFD/DNS simulations, requiring hundreds to thousands of CPU/GPU hours, thereby placing the benchmark in a regime where acceleration would be practically valuable.
REALM defines a unified training and evaluation pipeline. First, a multi‑scale preprocessing step compresses the dynamic range of species mass fractions using a Box‑Cox transform and then applies a global z‑score normalization across all channels. The surrogate’s task is to learn a time‑advancing operator that maps the full field at time t to the field at t + Δt on the native mesh. Training uses a short‑horizon autoregressive scheme, back‑propagating loss only through the final step to improve long‑term stability under stiff reaction‑transport coupling while keeping computational cost low. Crucially, the pipeline is architecture‑agnostic, allowing spectral/Fourier operators (FNO), DeepONet, convolutional backbones, transformer‑style models, point‑wise MLPs, and mesh/graph neural networks to be trained under identical conditions.
The authors benchmark more than a dozen representative models, controlling for parameter count and FLOPs. Evaluation metrics include standard statistical errors (L2, R²), physical conservation checks (mass, energy, species), reconstruction of key phenomena (flame front location, pressure peaks, total heat release), and rollout error growth over multiple steps. Three robust trends emerge:
-
Scaling Barrier – Dimensionality, stiffness, and mesh irregularity jointly create a scaling barrier. As the problem moves from 2‑D to 3‑D and chemical time scales shrink to 10⁻¹² s, rollout errors explode exponentially, indicating that current surrogates cannot simply be scaled up by adding parameters.
-
Inductive Bias Dominates – Architectural inductive biases, not raw parameter count, dictate performance. Spectral operators excel at preserving high‑frequency content in stiff reaction zones, while graph‑based models handle irregular meshes more robustly. Purely parameter‑heavy models without appropriate bias underperform.
-
Accuracy‑Trust Gap – High correlation metrics (e.g., R² > 0.95) do not guarantee physically trustworthy predictions. Models with strong statistical scores still miss critical transient structures, such as flame front propagation speed or peak over‑pressures, leading to 20‑30 % errors in integral quantities.
These findings suggest that future neural surrogates for realistic multiphysics must embed physical conservation laws, adopt multi‑resolution or adaptive mesh strategies, and be evaluated with physics‑aware metrics rather than solely statistical ones. The REALM framework itself provides a reproducible, application‑driven testbed that can drive the development of next‑generation physics‑aware architectures, moving the field beyond the “illusion of mastery” that has plagued prior benchmark efforts.
Comments & Academic Discussion
Loading comments...
Leave a Comment