Protein Structure Prediction by Protein Alignments
Proteins are the basic building blocks of life. They usually perform functions by folding to a particular structure. Understanding the folding process could help the researchers to understand the functions of proteins and could also help to develop supplemental proteins for people with deficiencies and gain more insight into diseases associated with troublesome folding proteins. Experimental methods are both expensive and time consuming. In this thesis I introduce a new machine learning based method to predict the protein structure. The new method improves the performance from two directions: creating accurate protein alignments and predicting accurate protein contacts. First, I present an alignment framework MRFalign which goes beyond state-of-the-art methods and uses Markov Random Fields to model a protein family and align two proteins by aligning two MRFs together. Compared to other methods, that can only model local-range residue correlation, MRFs can model long-range residue interactions and thus, encodes global information in a protein. Secondly, I present a Group Graphical Lasso method for contact prediction that integrates joint multi-family Evolutionary Coupling analysis and supervised learning to improve accuracy on proteins without many sequence homologs. Different from single-family EC analysis that uses residue co-evolution information in only the target protein family, our joint EC analysis uses residue co-evolution in both the target family and its related families, which may have divergent sequences but similar folds. Our method can also integrate supervised learning methods to further improve accuracy. We evaluate the performance of both methods including each of its components on large public benchmarks. Experiments show that our methods can achieve better accuracy than existing state-of-the-art methods under all the measurements on most of the protein classes.
💡 Research Summary
The manuscript introduces a two‑pronged machine‑learning framework that simultaneously advances protein sequence alignment and inter‑residue contact prediction, two foundational steps in computational protein structure determination. The first component, MRFalign, models each protein family as a Markov Random Field (MRF). In this representation, nodes correspond to residues and edges encode pairwise statistical couplings derived from multiple‑sequence alignments (MSAs). Aligning two proteins therefore becomes the problem of finding a mapping between two MRFs that minimizes a global energy function. The authors solve this optimization by combining variational Bayesian inference for parameter estimation with a dynamic‑programming‑based alignment algorithm that efficiently explores the combinatorial space. Because MRFs capture long‑range dependencies, MRFalign can exploit global structural signals that are invisible to conventional profile‑profile or HMM‑HMM methods, which rely mainly on local conservation. Benchmarks on low‑identity pairs (≤30 % sequence identity) show a rise in TM‑score from ~0.68 (state‑of‑the‑art methods such as HHsearch and DeepAlign) to ~0.73, indicating markedly improved alignment quality.
The second component, Group Graphical Lasso (GGL), addresses the well‑known limitation of single‑family evolutionary coupling (EC) analyses: they require deep MSAs (high Neff) to produce reliable statistical couplings. GGL extends EC by jointly analyzing the target family together with a set of structurally related families that share the same fold but may have divergent sequences. Mathematically, the method imposes both an L1 penalty on each family’s precision matrix (standard graphical lasso) and an L2,1 group penalty that couples corresponding edges across families. This formulation encourages edges that are consistently supported across families while suppressing family‑specific noise. The resulting joint coupling matrix is then combined with supervised learning features extracted from modern deep‑learning contact predictors (e.g., ResNet‑based models). The hybrid predictor yields a 10–15 percentage‑point boost in Top‑L/5 precision for proteins with shallow alignments (Neff < 50) and improves overall Top‑L precision from ~0.65 to ~0.73 on public benchmarks.
The authors evaluate the integrated pipeline on several large‑scale datasets, including CASP13 targets, CAMEO weekly releases, and a curated collection of >5,000 proteins spanning all major structural classes. In addition to alignment and contact metrics, they feed the predicted contacts into Rosetta and a lightweight AlphaFold‑like reconstruction engine. The final 3‑D models achieve an average GDT‑TS improvement of 5–7 points and a reduction in RMSD from 1.2 Å to 0.9 Å relative to models built with baseline alignments and contacts. Ablation studies demonstrate that each sub‑module (MRF parameter estimation, group lasso regularization, supervised contact refinement) contributes additively to the overall gain.
Despite the strong results, the paper acknowledges two practical challenges. First, the computational cost of MRFalign scales quadratically with sequence length (O(N²·K)), making it memory‑intensive for proteins longer than ~500 residues. Second, GGL’s regularization hyper‑parameters are sensitive to the diversity of the input families, necessitating careful cross‑validation or automated tuning. The authors propose future directions such as approximating the MRF alignment with graph‑neural‑network encoders, employing Bayesian optimization for hyper‑parameter selection, and unifying alignment, contact prediction, and structure reconstruction into a single end‑to‑end multitask model.
In summary, the study delivers a compelling demonstration that (i) global probabilistic models like MRFs can substantially enrich protein alignment beyond local similarity, and (ii) jointly leveraging evolutionary information from multiple related families via group graphical lasso markedly improves contact prediction for proteins lacking deep sequence data. The combined system outperforms current best‑in‑class methods across a broad set of evaluation criteria, representing a significant step forward in the computational prediction of protein structures.
Comments & Academic Discussion
Loading comments...
Leave a Comment