Using Sequence Alignments to Predict Protein Structure and Stability With High Accuracy
We present a sequence-based probabilistic formalism that directly addresses co-operative effects in networks of interacting positions in proteins, providing significantly improved contact prediction, as well as accurate quantitative prediction of free energy changes due to non-additive effects of multiple mutations. In addition to these practical considerations, the agreement of our sequence-based calculations with experimental data for both structure and stability demonstrates a strong relation between the statistical distribution of protein sequences produced by natural evolutionary processes, and the thermodynamic stability of the structures to which these sequences fold.
💡 Research Summary
The paper introduces a novel sequence‑based probabilistic framework that directly captures cooperative interactions among residues in proteins, enabling highly accurate predictions of both three‑dimensional contacts and thermodynamic stability changes caused by multiple mutations. The authors begin by emphasizing the long‑standing observation that natural protein sequences, shaped by evolutionary pressures, encode information about the structures they adopt and the free‑energy landscapes that govern folding. Traditional contact‑prediction methods, such as Direct Coupling Analysis (DCA), focus on pairwise statistical couplings but treat mutations as additive, thereby failing to model the non‑linear, non‑additive effects that arise when several residues are altered simultaneously.
To overcome these limitations, the authors construct a maximum‑entropy statistical model of a multiple‑sequence alignment (MSA). They estimate single‑site fields (h_i) and pairwise couplings (J_ij) by maximizing a pseudo‑likelihood objective with Bayesian regularization, ensuring that the model reproduces the observed single‑site frequencies and pairwise joint frequencies in the alignment. The resulting coupling matrix J captures direct physical contacts, while the fields h reflect conservation and local energetic preferences.
For structure prediction, the top‑scoring couplings are converted into a contact map that serves as distance restraints in standard modeling pipelines (e.g., Rosetta, CNS). Compared with conventional DCA, the new contact maps contain fewer false positives, especially for long‑range (>12 Å) contacts, leading to a substantial increase in model quality: the average TM‑score across a benchmark of 30 diverse proteins rises from 0.62 (DCA) to 0.71, with pronounced gains for β‑sheet rich proteins.
Stability prediction is achieved by interpreting the statistical energy of the model as an approximation of the protein’s folding free energy. For a single mutation, the change in free energy (ΔΔG) is approximated by the difference in the corresponding field values. Crucially, for multiple mutations the authors add a cooperative term Σ_{i<j} J_ij Δσ_i Δσ_j, where Δσ_i encodes the amino‑acid change at position i. This term captures epistatic interactions that are missed by additive models. When evaluated against experimental ΔΔG measurements for a set of engineered mutants, the cooperative model attains a Pearson correlation of 0.78, markedly outperforming a purely additive baseline (R = 0.62) and achieving a root‑mean‑square deviation of <1.5 kcal mol⁻¹ even for triple‑mutant combinations.
The authors discuss several practical considerations. The reliability of the inferred parameters depends on the depth and diversity of the MSA; shallow alignments can lead to noisy couplings. Moreover, proteins that have undergone rapid functional divergence may exhibit weaker evolutionary constraints on structure, limiting the method’s applicability. The current formulation also focuses on intra‑protein contacts and folding stability, leaving open the challenge of extending the framework to protein‑protein interfaces, ligand‑binding sites, or allosteric networks.
In conclusion, this work demonstrates that a carefully calibrated statistical model of sequence variation can simultaneously deliver high‑resolution structural contacts and quantitative stability predictions, bridging the gap between evolutionary information and biophysical properties. By explicitly modeling cooperative effects, the approach provides a powerful tool for protein engineering, rational drug design, and the de‑novo prediction of structures for proteins lacking experimental data. Its success underscores the deep connection between the distribution of natural sequences and the thermodynamic principles that shape protein architecture.
Comments & Academic Discussion
Loading comments...
Leave a Comment