Representation Learning for Spatiotemporal Physical Systems
Machine learning approaches to spatiotemporal physical systems have primarily focused on next-frame prediction, with the goal of learning an accurate emulator for the system’s evolution in time. However, these emulators are computationally expensive to train and are subject to performance pitfalls, such as compounding errors during autoregressive rollout. In this work, we take a different perspective and look at scientific tasks further downstream of predicting the next frame, such as estimation of a system’s governing physical parameters. Accuracy on these tasks offers a uniquely quantifiable glimpse into the physical relevance of the representations of these models. We evaluate the effectiveness of general-purpose self-supervised methods in learning physics-grounded representations that are useful for downstream scientific tasks. Surprisingly, we find that not all methods designed for physical modeling outperform generic self-supervised learning methods on these tasks, and methods that learn in the latent space (e.g., joint embedding predictive architectures, or JEPAs) outperform those optimizing pixel-level prediction objectives. Code is available at https://github.com/helenqu/physical-representation-learning.
💡 Research Summary
This paper challenges the prevailing focus in machine learning for spatiotemporal physical systems on next‑frame prediction, which aims to build accurate surrogate models that can replace costly numerical solvers. While such pixel‑level autoregressive emulators can generate realistic future frames, they are expensive to train and suffer from error accumulation during long roll‑outs, limiting their usefulness for many scientific tasks that require a deeper understanding of the underlying dynamics.
To address this gap, the authors propose evaluating learned representations through downstream scientific tasks—specifically, the estimation of governing physical parameters of the system. Parameter inference provides a quantifiable proxy for how much physically relevant information a representation retains, because the parameters directly control the evolution of the PDEs governing the system.
The study compares two families of self‑supervised learning (SSL) methods: (1) masked autoencoding (MAE) applied to video (VideoMAE), which reconstructs masked pixel values, and (2) Joint Embedding Predictive Architectures (JEPAs), which predict future latent embeddings rather than raw pixels. JEPAs are trained with a VICReg‑style loss that balances similarity, variance, and covariance regularization, preventing representation collapse while encouraging the latent space to capture high‑level dynamics. Concretely, a context window of k frames is encoded, and a predictor network is tasked with producing the latent representation of the subsequent k frames.
Experiments are conducted on three PDE‑driven benchmarks from “The Well”: (i) active matter (rod‑like particles in a Stokes fluid) with dipole strength α and alignment strength ζ; (ii) shear flow (incompressible Navier‑Stokes) with Reynolds number Re and Schmidt number Sc; and (iii) Rayleigh‑Bénard convection with Rayleigh number ν and Prandtl number κ. For each dataset, the authors pre‑train separate models (JEPA, VideoMAE) from scratch, freeze the encoders, and fine‑tune a linear regression head to predict the physical parameters, measuring mean‑squared error (MSE).
Results show that JEPAs consistently outperform VideoMAE across all three systems. On active matter, JEPA reduces MSE from 0.160 (MAE) to 0.079, a 51 % relative improvement; on shear flow, MSE drops from 0.67 to 0.38 (43 % improvement); on Rayleigh‑Bénard, MSE improves from 0.18 to 0.13 (28 %). The authors also benchmark against two physics‑specific models: DISCO (operator meta‑learning) and MPP (pixel‑level autoregressive surrogate). DISCO, which also operates in latent space, achieves performance comparable to JEPA (e.g., 0.057 vs 0.079 on active matter) and markedly better than MPP, which struggles even with end‑to‑end fine‑tuning (e.g., 0.59 MSE on shear flow).
A key observation is the data‑efficiency of latent‑prediction models. When fine‑tuning on only 10 % of the available labeled data (≈3.2k examples), JEPA attains an MSE of 0.57 on shear flow—already better than VideoMAE’s best result with the full dataset (0.67). With 50 % data, JEPA reaches 0.40 MSE, only 5 % away from its full‑data optimum (0.38), whereas VideoMAE’s performance degrades more sharply (0.75 vs 0.67). This demonstrates that latent‑space predictive learning yields representations that are both physically informative and sample‑efficient.
The paper’s analysis leads to two overarching conclusions. First, learning objectives that operate in latent space (JEPAs, DISCO) produce representations that retain more physically meaningful information than pixel‑level reconstruction or autoregressive objectives. Second, downstream scientific tasks such as parameter inference provide a more meaningful benchmark for representation quality than raw frame prediction, because they directly test whether the model has captured the governing dynamics rather than merely reproducing visual appearance.
Consequently, the authors argue for a shift in the field: instead of focusing on building ever more accurate next‑frame emulators, future work should explore self‑supervised latent‑prediction frameworks as a foundation for scientific machine learning. Such approaches promise better physical fidelity, lower computational cost, and greater robustness when only limited labeled data are available, opening a pathway toward more trustworthy AI tools for physics, engineering, and the natural sciences.
Comments & Academic Discussion
Loading comments...
Leave a Comment