Direct-coupling analysis of residue co-evolution captures native contacts across many protein families
The similarity in the three-dimensional structures of homologous proteins imposes strong constraints on their sequence variability. It has long been suggested that the resulting correlations among amino acid compositions at different sequence positions can be exploited to infer spatial contacts within the tertiary protein structure. Crucial to this inference is the ability to disentangle direct and indirect correlations, as accomplished by the recently introduced Direct Coupling Analysis (DCA) (Weigt et al. (2009) Proc Natl Acad Sci 106:67). Here we develop a computationally efficient implementation of DCA, which allows us to evaluate the accuracy of contact prediction by DCA for a large number of protein domains, based purely on sequence information. DCA is shown to yield a large number of correctly predicted contacts, recapitulating the global structure of the contact map for the majority of the protein domains examined. Furthermore, our analysis captures clear signals beyond intra- domain residue contacts, arising, e.g., from alternative protein conformations, ligand- mediated residue couplings, and inter-domain interactions in protein oligomers. Our findings suggest that contacts predicted by DCA can be used as a reliable guide to facilitate computational predictions of alternative protein conformations, protein complex formation, and even the de novo prediction of protein domain structures, provided the existence of a large number of homologous sequences which are being rapidly made available due to advances in genome sequencing.
💡 Research Summary
The paper presents a fast and scalable implementation of Direct Coupling Analysis (DCA) called mfDCA, which is based on a mean‑field approximation. Traditional DCA, implemented via message‑passing (mpDCA), is computationally intensive and unsuitable for large‑scale studies. By solving the inverse statistical problem analytically in a single step, mfDCA becomes 1,000–10,000 times faster while preserving, and in many cases improving, predictive accuracy.
The authors first construct multiple sequence alignments (MSAs) for protein families using Pfam hidden Markov models, then apply a sequence‑weighting scheme that down‑weights sequences sharing more than 80 % identity, yielding an effective sequence number (Meff). From the MSA they compute single‑site frequencies fi(A) and pairwise frequencies fij(A,B). The mean‑field approximation yields direct coupling parameters Jij, from which Direct Information (DI) is derived for every residue pair. DI quantifies the strength of a direct statistical coupling after removing indirect correlations.
To evaluate performance, the study analyzes 131 bacterial domain families (totaling 856 high‑resolution crystal structures). For each family, the top‑ranked DI pairs (separated by at least five residues along the sequence) are compared to structural contacts defined as any atom‑atom distance below 8 Å. Across all families, the top 20 DI pairs contain on average 84 % true contacts; the top 10 achieve even higher precision. By contrast, simple mutual information (MI) recovers only about 65 % of contacts, and a recent approximate Bayesian method, while better than MI, still falls short of mfDCA’s performance.
Distance histograms for high‑DI pairs show a characteristic double‑peak: one around 3–5 Å (typical of hydrogen bonds and secondary‑structure contacts) and another around 7–8 Å (suggestive of water‑mediated or long‑range interactions). This pattern is absent in the background distribution of all residue pairs, underscoring the specificity of the DI signal.
The authors also explore the dependence on sequence depth. Random subsampling experiments indicate that an effective sequence count of roughly 1,000 is sufficient for near‑optimal true‑positive rates, with performance improving monotonically as more sequences become available.
Beyond intra‑domain contacts, mfDCA uncovers biologically relevant non‑canonical signals: alternative conformations of the same domain, ligand‑binding sites, and inter‑domain or inter‑subunit contacts in oligomeric assemblies. These findings suggest that DCA‑derived contacts can guide de novo domain modeling, conformational ensemble generation, and the prediction of protein‑protein interfaces, provided a sufficiently large and diverse sequence database.
In summary, mfDCA offers a computationally efficient route to extract high‑confidence structural constraints from sequence data alone. Its ability to process hundreds of families rapidly makes it a valuable tool for large‑scale structural genomics, enabling researchers to leverage the ever‑growing pool of genomic sequences for protein structure and interaction inference.
Comments & Academic Discussion
Loading comments...
Leave a Comment