A Mechanistic Analysis of Transformers for Dynamical Systems

Reading time: 5 minute
...

📝 Original Info

  • Title: A Mechanistic Analysis of Transformers for Dynamical Systems
  • ArXiv ID: 2512.21113
  • Date: 2025-12-24
  • Authors: Gregory Duthé, Nikolaos Evangelou, Wei Liu, Ioannis G. Kevrekidis, Eleni Chatzi

📝 Abstract

Transformers are increasingly adopted for modeling and forecasting time-series, yet their internal mechanisms remain poorly understood from a dynamical systems perspective. In contrast to classical autoregressive and state-space models, which benefit from well-established theoretical foundations, Transformer architectures are typically treated as black boxes. This gap becomes particularly relevant as attention-based models are considered for general-purpose or zero-shot forecasting across diverse dynamical regimes. In this work, we do not propose a new forecasting model, but instead investigate the representational capabilities and limitations of single-layer Transformers when applied to dynamical data. Building on a dynamical systems perspective we interpret causal self-attention as a linear, history-dependent recurrence and analyze how it processes temporal information. Through a series of linear and nonlinear case studies, we identify distinct operational regimes. For linear systems, we show that the convexity constraint imposed by softmax attention fundamentally restricts the class of dynamics that can be represented, leading to oversmoothing in oscillatory settings. For nonlinear systems under partial observability, attention instead acts as an adaptive delay-embedding mechanism, enabling effective state reconstruction when sufficient temporal context and latent dimensionality are available. These results help bridge empirical observations with classical dynamical systems theory, providing insight into when and why Transformers succeed or fail as models of dynamical systems.

💡 Deep Analysis

Figure 1

📄 Full Content

Understanding and modeling dynamical systems using data -in the form of observations-is a central problem in nonlinear science, with applications ranging from fluid mechanics and structural dynamics [1,35,40] to neuroscience, chemical kinetics, weather and power systems and beyond [39,22,9,50]. Classical approaches rely either on use of explicit governing equations or on well-established datadriven identification frameworks, such as autoregressive and state-space models, for which stability, observability, and identifiability properties are well understood [7,28]. These frameworks provide a principled connection between data, latent state representations, and the underlying geometry of dynamical systems, including attractors and invariant manifolds.

More recently, machine-learning architectures originally developed for sequence modeling have been increasingly applied to dynamical systems modeling, particularly in what concerns purely data-driven inference, raising fundamental questions about their expressive power and their relationship to classical dynamical systems theory. Among these architectures, the Transformer model [49] has emerged as a dominant paradigm. Originally introduced for natural language processing, Transformers are now widely used in computer vision, speech processing, and scientific machine learning [26], including the modeling and forecasting of dynamical systems [21,45,19]. Their defining feature is the attention mechanism, which enables flexible aggregation of information across a temporal context through parallel rather than sequential computation. This property has led to strong empirical performance in step-ahead prediction tasks, including for nonlinear and weakly chaotic systems [48,10].

A growing body of work has explored the use of Transformers for dynamical and physical systems. Early studies demonstrated that attention-based models can learn surrogate evolution maps when provided with suitable spatiotemporal tokenizations. Geneva and Zabaras [21], for example, modeled diverse dynamical systems using a “vanilla” Transformer architecture, relying on Koopman-based embeddings to project high-dimensional states into lower-dimensional token representations. Subsequent work investigated the direct application of Transformers to chaotic time-series forecasting, showing that autoregressive prediction is feasible when the Lyapunov exponent is sufficiently low [48]. More recent efforts have extended these ideas towards large pretrained scientific foundation models. Aurora, for instance, is proposed as a foundation model for the Earth system, trained on heterogeneous atmospheric and oceanic datasets and equipped with an encoder-processor-decoder architecture to evolve a latent three-dimensional spatial representation forward in time [6]. These studies indicate that Transformers, or Transformer-like operator processors, can act as general temporal integrators across complex physical systems, often at substantially reduced computational cost compared to traditional numerical pipelines.

In parallel, operator-style Transformer architectures have been developed specifically for scientific computing [43]. Poseidon introduces a multiscale operator Transformer pretrained on diverse fluiddynamics PDE datasets and leverages time-conditioned layers together with semigroup training to enable continuous-in-time evaluation [24]. This places Transformers within the broader operatorlearning lineage that includes Fourier-and Graph Neural Operators. Related theoretical work has clarified connections between attention mechanisms and classical numerical integration or projection schemes. Li et al. [34,31] introduced the Fourier Neural Operator framework, which learns mappings between function spaces using spectral convolution kernels and can be interpreted as performing data-driven Galerkin projections. Building on this perspective, Cao et al. [8] showed that self-attention can be interpreted as a learnable integral operator, capable of recovering Fourier-or Galerkin-type behavior depending on positional encoding and kernelization. These results position attention mechanisms and neural operators within a shared theoretical space as flexible, possibly nonlocal (in space and even possibly in time) integrators/solvers.

A second, rapidly growing stream concerns time-series foundation models. Chronos treats time series as tokenized sequences via scaling and quantization and reuses T5-style Transformers to obtain zero-shot probabilistic forecasts across domains [2]. Subsequent models, including Chronos-Bolt, improved speed and accuracy, reinforcing the view that a single pretrained Transformer can generalize across dynamical regimes provided that the data are cast into a language-like format and extended the foundation models to multivariate systems [3]. This paradigm aligns closely with recent zero-shot and universal forecasting studies for chaotic systems [51,33,23], as well as with position papers calling

📸 Image Gallery

1D_Transformer_Attn_Other_Latent_No_PE.png 1D_Transformer_Attn_Other_Latent_PE.png 1D_Transformer_Latent_Space.png 1D_Transformer_Latent_Space_Z.png 1D_Transformer_Traj.png 2D_Transformer_Latent_Space_No_PE.png 2D_Transformer_Latent_Space_PE.png 2D_Transformers_Latent_Space.png 2D_full_obs_seq4.png 2D_partial_obs_seq4.png 2D_partial_obs_seq9.png All_Models_1D_Transformer_LearnedPE_Latent_Space.png All_Models_1D_Transformer_NoPE_Latent_Space.png All_Models_2D_Transformer_LearnedPE_Latent_Space.png All_Models_2D_Transformer_NoPE_Latent_Space.png All_Models_XZ_i_Alphas_2D_LearnedPE.png All_Models_XZ_i_Alphas_2D_NoPE.png All_Models_XZ_i_Alphas_3D_LearnedPE.png All_Models_XZ_i_Alphas_3D_NoPE.png CI_2D_latent_space_chafee_infante.png CI_2D_z1_z2_projection.png CI_boxplot_mse_missing_observations.png CI_surf_Traj_5.png CI_z1_z2_projection.png CI_z2_z3_projection.png Chafee_InfanteCI_3D_Traj_5.png attn_ci_Transformer_2D_Learned_PE.png attn_ci_Transformer_2D_No_PE.png attn_ci_Transformer_3D_Learned_PE.png attn_ci_Transformer_3D_No_PE.png attn_vdp_RUN_ID_TRANSFORMER_1D_LEARNED.png attn_vdp_RUN_ID_TRANSFORMER_2D_LEARNED.png attn_vdp_TRANS_2D_NONE.png boxplot_mse_full_observations.png boxplot_mse_missing_observations.png case1.png case2.png combined_re_comparison.png cover.png cylinder_gh_recon_no_re.png cylinder_gh_recon_w_re.png cylinder_intro.png cylinder_no_re_reynolds_2dlatent.png cylinder_w_re_reynolds_2dlatent.png cylinder_xz_no_re.png cylinder_xz_w_re.png data_phase_portait.png latent_space_chafee_infante.png transformer_setup_t4d.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut