LASS-ODE: Scaling ODE Computations to Connect Foundation Models with Dynamical Physical Systems
Foundation models have transformed language, vision, and time series data analysis, yet progress on dynamic predictions for physical systems remains limited. Given the complexity of physical constraints, two challenges stand out. $(i)$ Physics-computation scalability: physics-informed learning can enforce physical regularization, but its computation (e.g., ODE integration) does not scale to extensive systems. $(ii)$ Knowledge-sharing efficiency: the attention mechanism is primarily computed within each system, which limits the extraction of shared ODE structures across systems. We show that enforcing ODE consistency does not require expensive nonlinear integration: a token-wise locally linear ODE representation preserves physical fidelity while scaling to foundation-model regimes. Thus, we propose novel token representations that respect locally linear ODE evolution. Such linearity substantially accelerates integration while accurately approximating the local data manifold. Second, we introduce a simple yet effective inter-system attention that augments attention with a common structure hub (CSH) that stores shared tokens and aggregates knowledge across systems. The resulting model, termed LASS-ODE (\underline{LA}rge-\underline{S}cale \underline{S}mall \underline{ODE}), is pretrained on our $40$GB ODE trajectory collections to enable strong in-domain performance, zero-shot generalization across diverse ODE systems, and additional improvements through fine-tuning.
💡 Research Summary
LASS-ODE tackles two major obstacles that have prevented foundation‑model‑scale learning for physical dynamical systems: (1) the computational bottleneck of physics‑aware training, and (2) the inefficiency of knowledge sharing across heterogeneous ODE systems.
The authors observe that enforcing ODE consistency with traditional Neural ODE or PINN approaches requires costly nonlinear integration, which does not scale to billions of training samples. Their key insight is that ODE consistency can be achieved without full integration by representing the dynamics locally as a linear system on a per‑token basis. The continuous trajectory is split into many short time intervals; each interval becomes a token whose internal dynamics are approximated by a first‑order Taylor expansion (a matrix A and bias b). Because the interval is short, the linear approximation faithfully tracks the true nonlinear flow, yet the integration reduces to a simple matrix‑vector multiplication, eliminating the need for expensive ODE solvers during training and inference.
To enable cross‑system knowledge transfer, LASS‑ODE introduces a Common Structure Hub (CSH). CSH consists of a small set of global tokens that are concatenated to every system’s token sequence. Standard self‑attention is then applied over the augmented sequence, allowing each system’s tokens to attend to the shared hub tokens. This design captures common dynamical motifs—such as Hamiltonian conservation, damping patterns, or forcing structures—without adding specialized cross‑attention modules or retrieval pipelines, and incurs only a fixed, modest memory overhead.
Data handling is carefully engineered for heterogeneity. Each state channel is processed independently (channel‑wise independence), enabling the model to accept systems with varying dimensionalities. A time‑aware GRU ingests raw observations together with the normalized time‑step Δt, producing a global temporal embedding per channel. Token embeddings are then modulated by a Radial Basis Function (RBF) time encoding that encodes the token’s start time. The RBF vector is passed through layer‑norm and an MLP to generate scaling (γ) and shifting (β) parameters, which multiplicatively modulate the GRU embedding, ensuring strong locality in time. A learnable channel‑encoding lookup table adds a unique identifier for each channel, allowing the model to distinguish different state components even when they are processed independently.
The transformer backbone incorporates intra‑system self‑attention (without causal masking) to capture coupling across channels and across all tokens, and inter‑system attention via the CSH. A Mixture‑of‑Experts (MoE) layer further improves parameter efficiency at large scale.
The decoder reconstructs full trajectories by first predicting token‑wise linear ODE parameters from the processed embeddings and then solving a piece‑wise linear ODE over the entire normalized interval
Comments & Academic Discussion
Loading comments...
Leave a Comment