Composite structural motifs of binding sites for delineating biological functions of proteins
Most biological processes are described as a series of interactions between proteins and other molecules, and interactions are in turn described in terms of atomic structures. To annotate protein functions as sets of interaction states at atomic resolution, and thereby to better understand the relation between protein interactions and biological functions, we conducted exhaustive all-against-all atomic structure comparisons of all known binding sites for ligands including small molecules, proteins and nucleic acids, and identified recurring elementary motifs. By integrating the elementary motifs associated with each subunit, we defined composite motifs which represent context-dependent combinations of elementary motifs. It is demonstrated that function similarity can be better inferred from composite motif similarity compared to the similarity of protein sequences or of individual binding sites. By integrating the composite motifs associated with each protein function, we define meta-composite motifs each of which is regarded as a time-independent diagrammatic representation of a biological process. It is shown that meta-composite motifs provide richer annotations of biological processes than sequence clusters. The present results serve as a basis for bridging atomic structures to higher-order biological phenomena by classification and integration of binding site structures.
💡 Research Summary
The authors address a fundamental challenge in protein function annotation: the limited predictive power of sequence similarity alone. They propose a structural‑centric approach that captures the diversity of protein–ligand interactions at atomic resolution and leverages this information to infer functional similarity more accurately than traditional methods.
First, they extracted all protein subunits from the Protein Data Bank (PDB) that contain at least one ligand—whether a small molecule, another protein, or nucleic acid. A binding site was defined as any protein atom within 5 Å of any ligand atom. This yielded 197,690 subunits and over 770,000 distinct binding sites (410,254 non‑polymer, 346,288 protein‑protein, and 20,338 nucleic‑acid sites). Using the GIRAF structure‑search engine, they performed exhaustive all‑against‑all structural comparisons of these sites. Complete‑linkage clustering, with a minimum cluster size of ten members, produced 5,869 small‑molecule, 7,678 protein‑protein, and 398 nucleic‑acid clusters. These clusters, termed “elementary motifs,” represent recurrent atomic arrangements of binding pockets independent of the chemical identity of the bound ligand.
Next, the set of elementary motifs present in a given subunit defines its “composite motif.” In total, 5,738 composite motifs were identified, each shared by at least ten subunits. A composite motif can consist of from one up to twenty elementary motifs, though 90 % contain five or fewer. The authors measured similarity between two composite motifs as the fraction of shared elementary motifs. Importantly, this similarity showed little correlation with sequence identity, indicating that proteins with divergent sequences can nevertheless share the same combination of binding‑site architectures.
To test functional relevance, the authors mapped UniProt keyword annotations onto each subunit and computed functional similarity using the Jaccard index. They compared three similarity measures: (i) sequence identity (BLAST), (ii) raw binding‑site similarity (structural alignment of individual sites), and (iii) composite‑motif similarity. Across random, non‑redundant samples, composite‑motif similarity consistently correlated more strongly with functional similarity than either sequence or single‑site similarity. The relationship held even when considering only higher‑level Biological Process keywords, which are less directly tied to ligand binding. Moreover, composite motifs comprising multiple elementary motifs yielded slightly better functional discrimination than those built from a single elementary motif, suggesting that the combinatorial context of binding sites encodes functional specificity.
The authors also examined outliers: fifteen composite motifs were associated with completely unrelated functions. Most of these cases stemmed from annotation errors, engineered constructs, or simple coiled‑coil motifs that lack functional specificity. A few genuine examples involved remote homologs sharing a single elementary motif yet participating in distinct complexes, underscoring that a single motif is insufficient for precise function prediction.
Illustrative case studies highlight the concept. Glycine oxidase and glycerol‑3‑phosphate dehydrogenase share an NAD/FAD binding elementary motif and the same Rossmann‑like fold, but differ in additional protein‑protein interaction motifs, oligomeric state, and ligand‑specific motifs, leading to distinct enzymatic activities and metabolic contexts. Similar patterns are observed for D‑3‑phosphoglycerate dehydrogenase versus CtBP3, β‑trypsin versus coagulation factor VII, and cytochrome b₂ versus glycolate oxidase, where differences in composite motif composition explain divergent biological roles despite shared core structural elements.
Finally, the authors aggregated composite motifs that are linked to the same UniProt functional annotation to create “meta‑composite motifs.” Each meta‑composite motif can be visualized as a time‑independent diagram of a biological process, capturing the repertoire of binding‑site combinations employed by all proteins participating in that process. Compared with conventional sequence‑based clusters, meta‑composite motifs reveal richer internal structure, such as alternative interaction partners, oligomeric states, and ligand preferences, thereby providing a more nuanced view of pathway architecture.
In summary, this work demonstrates that (1) elementary motifs derived from atomic binding‑site geometry are reproducible structural building blocks; (2) the specific combination of these motifs into composite motifs encodes functional information that surpasses sequence similarity; and (3) integrating composite motifs across a biological process yields meta‑composite motifs that serve as detailed, structure‑based annotations of cellular pathways. The methodology offers a powerful framework for improving protein function prediction, guiding drug‑target identification, and constructing more accurate systems‑biology models that are grounded in three‑dimensional molecular interactions.
Comments & Academic Discussion
Loading comments...
Leave a Comment