Protein Autoregressive Modeling via Multiscale Structure Generation
We present protein autoregressive modeling (PAR), the first multi-scale autoregressive framework for protein backbone generation via coarse-to-fine next-scale prediction. Using the hierarchical nature of proteins, PAR generates structures that mimic sculpting a statue, forming a coarse topology and refining structural details over scales. To achieve this, PAR consists of three key components: (i) multi-scale downsampling operations that represent protein structures across multiple scales during training; (ii) an autoregressive transformer that encodes multi-scale information and produces conditional embeddings to guide structure generation; (iii) a flow-based backbone decoder that generates backbone atoms conditioned on these embeddings. Moreover, autoregressive models suffer from exposure bias, caused by the training and the generation procedure mismatch, and substantially degrades structure generation quality. We effectively alleviate this issue by adopting noisy context learning and scheduled sampling, enabling robust backbone generation. Notably, PAR exhibits strong zero-shot generalization, supporting flexible human-prompted conditional generation and motif scaffolding without requiring fine-tuning. On the unconditional generation benchmark, PAR effectively learns protein distributions and produces backbones of high design quality, and exhibits favorable scaling behavior. Together, these properties establish PAR as a promising framework for protein structure generation.
💡 Research Summary
Protein Autoregressive Modeling (PAR) introduces the first multi‑scale autoregressive framework for generating protein backbone structures. Existing generative approaches for proteins—primarily diffusion or flow‑based models—operate at a single resolution and often discretize the continuous 3‑D coordinates into tokens, which can degrade fine‑grained fidelity. PAR overcomes these limitations by (1) downsampling a protein backbone into a hierarchy of coarse‑to‑fine representations, (2) using an autoregressive transformer to predict the next finer scale conditioned on all previously generated coarser scales, and (3) employing a flow‑based decoder that directly models continuous Cα atom positions conditioned on the transformer’s embeddings.
Multi‑scale downsampling: Given a backbone (x \in \mathbb{R}^{L \times 3}), a deterministic operation (Down(\cdot)) creates a set of scales ({x_1, x_2, …, x_n}) where each (x_i) contains (size_i) 3‑D centroids obtained by linear interpolation along the sequence dimension. The scale set can be fixed (e.g., 64, 128, 256 residues) or proportionally defined (e.g., (L/4, L/2, L)). This hierarchical representation preserves pairwise distances and directional relationships, as proved in the supplementary material.
Autoregressive transformer: A non‑equivariant transformer (T_\theta) receives a concatenation of a learnable BOS token and upsampled versions of all previously generated scales (each upsampled to the current scale size). It outputs a scale‑wise conditioning vector (z_i = T_\theta(\text{BOS}, Up(x_1), …, Up(x_{i-1}))). By conditioning on the entire coarse context, the model captures bidirectional residue interactions that a naïve left‑to‑right token ordering would miss.
Flow‑based decoder: The decoder (v_\theta) is a continuous normalizing flow (implemented via flow‑matching) that maps a standard Gaussian to the distribution of Cα coordinates at scale (i). During training, each scale is corrupted with Gaussian noise (\epsilon_i) and blended with a time variable (t_i) to form (x_i^{t}=t_i x_i + (1-t_i)\epsilon_i). The loss minimizes the L2 distance between the flow output and the denoised target, while conditioning on (z_i) through adaptive layer‑norm and a learned scale embedding. Sampling proceeds by integrating the ODE (or SDE) defined by the flow, sequentially from the coarsest to the finest scale.
Mitigating exposure bias: Autoregressive models suffer when the training context (ground‑truth) differs from the inference context (model predictions). PAR introduces two remedies: (i) noisy‑context learning, where the transformer receives deliberately corrupted previous scales, encouraging robustness to imperfect inputs; and (ii) scheduled sampling, gradually replacing ground‑truth with model‑generated embeddings during training. These strategies substantially reduce error accumulation and improve generated structure quality.
Zero‑shot generalization and applications: Because the transformer learns a flexible conditional embedding, PAR can accept human‑provided prompts—such as desired topology, length constraints, or fixed structural motifs—without any fine‑tuning. In motif‑scaffolding experiments, the model successfully anchors a given motif and autoregressively fills the surrounding backbone, demonstrating practical utility for protein design pipelines.
Performance and scaling: On unconditional generation benchmarks, PAR achieves a Frechet Protein Structure Distance (FPSD) of 161.0, comparable to state‑of‑the‑art diffusion models. Moreover, the authors observe a clear scaling law: increasing training compute consistently lowers FPSD, indicating that the model benefits from larger datasets and longer training. Sampling speed is also improved; the multi‑scale approach yields a ~2.5× speed‑up over single‑scale flow baselines because each coarse step reduces the dimensionality of subsequent refinements.
Limitations and future work: Currently PAR models only Cα atoms; side‑chain placement and full‑atom energetics must be added in downstream steps. The linear interpolation downsampling may become suboptimal for very long proteins, suggesting adaptive or learned downsampling strategies. Finally, the transformer is non‑equivariant, so rotational and translational invariance must be learned rather than built‑in; integrating equivariant architectures could further enhance data efficiency.
In summary, PAR contributes (1) a novel hierarchical autoregressive generation pipeline, (2) a seamless integration of flow‑based continuous decoding, (3) effective exposure‑bias mitigation, and (4) strong zero‑shot conditional generation and scaling behavior. These advances open a new avenue for fast, high‑fidelity protein backbone design, with promising extensions toward full‑atom modeling, functional constraint integration, and large‑scale protein engineering applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment