Scalable Spatio-Temporal SE(3) Diffusion for Long-Horizon Protein Dynamics

Scalable Spatio-Temporal SE(3) Diffusion for Long-Horizon Protein Dynamics
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Molecular dynamics (MD) simulations remain the gold standard for studying protein dynamics, but their computational cost limits access to biologically relevant timescales. Recent generative models have shown promise in accelerating simulations, yet they struggle with long-horizon generation due to architectural constraints, error accumulation, and inadequate modeling of spatio-temporal dynamics. We present STAR-MD (Spatio-Temporal Autoregressive Rollout for Molecular Dynamics), a scalable SE(3)-equivariant diffusion model that generates physically plausible protein trajectories over microsecond timescales. Our key innovation is a causal diffusion transformer with joint spatio-temporal attention that efficiently captures complex space-time dependencies while avoiding the memory bottlenecks of existing methods. On the standard ATLAS benchmark, STAR-MD achieves state-of-the-art performance across all metrics–substantially improving conformational coverage, structural validity, and dynamic fidelity compared to previous methods. STAR-MD successfully extrapolates to generate stable microsecond-scale trajectories where baseline methods fail catastrophically, maintaining high structural quality throughout the extended rollout. Our comprehensive evaluation reveals severe limitations in current models for long-horizon generation, while demonstrating that STAR-MD’s joint spatio-temporal modeling enables robust dynamics simulation at biologically relevant timescales, paving the way for accelerated exploration of protein function.


💡 Research Summary

The paper introduces STAR‑MD (Spatio‑Temporal Autoregressive Rollout for Molecular Dynamics), a novel generative framework that can produce physically plausible protein trajectories spanning microsecond timescales, even for large proteins. The authors identify three major shortcomings of existing data‑driven MD acceleration methods: (1) spatial and temporal information are processed separately, limiting the ability to capture coupled motions; (2) video‑style diffusion models either denoise all frames simultaneously or use fixed‑size windows, leading to prohibitive memory and compute costs for long sequences; (3) long‑horizon autoregressive rollouts suffer from error accumulation, causing structural collapse.

STAR‑MD addresses these issues with three key technical contributions. First, it employs a causal diffusion transformer equipped with joint spatio‑temporal (S × T) attention. Tokens represent residue‑frame pairs (i, ℓ), and 2‑D rotary position embeddings encode both residue index and frame index, allowing the model to directly learn non‑separable relationships such as how motion at one site influences distant residues at earlier times. This joint attention replaces the common “space‑then‑time” factorization and provides a more expressive representation of protein dynamics.

Second, the architecture is designed for scalability. During training, clean historical frames and the current noisy frame are concatenated, and a block‑wise causal attention pattern ensures that each frame only attends to earlier clean frames, preserving the autoregressive structure while still enabling parallel processing of the entire sequence in a single forward pass. At inference time, previously generated clean frames are cached as key‑value pairs (KV‑cache), so that each new frame can be denoised by attending only to O(N L) single‑residue features rather than O(N² L) pairwise states. This reduces memory from quadratic to linear in the number of residues and frames, and eliminates the cubic spatial cost (O(N³ L)) typical of Pairformer‑based models.

Third, to mitigate drift in long rollouts, the authors introduce contextual noise perturbation. During training, historical frames are deliberately corrupted with a small amount of forward diffusion noise (τ sampled from U


Comments & Academic Discussion

Loading comments...

Leave a Comment