DISPROTBENCH: Uncovering the Functional Limits of Protein Structure Prediction Models in Intrinsically Disordered Regions

DISPROTBENCH: Uncovering the Functional Limits of Protein Structure Prediction Models in Intrinsically Disordered Regions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Intrinsically disordered regions (IDRs) play central roles in cellular function, yet remain poorly evaluated by existing protein structure prediction benchmarks. Current evaluations largely focus on well-folded domains, overlooking three fundamental challenges in realistic biological settings: the structural complexity of proteins, the resulting low availability of reliable ground truth, and prediction uncertainty that can propagate into high-risk downstream failures, such as in drug discovery, protein-protein interaction modeling, and functional annotation. We present DisProtBench, an IDR-centric benchmark that explicitly incorporates prediction uncertainty into the evaluation of protein structure prediction models (PSPMs). To address structural complexity and ground-truth scarcity, we curate and unify a large-scale, multi-modal dataset spanning disease-relevant IDRs, GPCR-ligand interactions, and multimeric protein complexes. To assess predictive uncertainty, we introduce Functional Uncertainty Sensitivity (FUS), a novel prediction uncertainty-stratified metric that quantifies downstream task performance under prediction uncertainty. Using this benchmark, we conduct a systematic evaluation of state-of-the-art PSPMs and reveal clear, task-dependent failure modes. Protein-protein interaction prediction degrades sharply in IDRs, while structure-based drug discovery remains comparatively robust. These effects are largely invisible to standard global accuracy metrics, which overestimate functional reliability under prediction uncertainty. We have open-sourced our benchmark and the codebase at https://github.com/Susan571/DisProtBench.


💡 Research Summary

The paper introduces DisProtBench, a novel benchmark specifically designed to evaluate protein structure prediction models (PSPMs) in the context of intrinsically disordered regions (IDRs). Traditional benchmarks such as CASP focus on global folding accuracy using metrics like RMSD or GDT‑TS, which assume a single, well‑defined native structure. This assumption breaks down for IDRs, which are characterized by conformational heterogeneity, transient interfaces, and a scarcity of reliable experimental structures. Consequently, existing evaluations can over‑estimate model reliability and mask failure modes that are critical for downstream applications such as protein‑protein interaction (PPI) prediction, drug discovery, and functional annotation.

DisProtBench addresses three intertwined challenges: (i) structural complexity, (ii) limited ground‑truth availability, and (iii) prediction uncertainty. To capture structural complexity, the authors curate a large, multi‑modal dataset comprising three complementary subsets. The “DisProt‑Based” subset gathers human proteins with long IDRs (≥20 residues) from UniProt, links them to disease‑associated missense variants from ClinVar, and incorporates protein‑protein interaction data from HINT, yielding roughly 1,000 proteins and 10,000 PPIs. The “Individual Protein” subset focuses on GPCR‑ligand interactions, extracting ~71,000 high‑affinity (pKi 6‑9) pairs from ChEMBL and BindingDB, representing about 100,000 drug‑target interactions. The “Protein Interaction” subset contains high‑quality multimeric complexes (median pLDDT ≥70, favorable pDockQ) where predicted interfaces overlap annotated IDRs, providing ~1,000 complexes for multimeric analysis. By integrating ordered domains, disordered segments, and functional contexts, the benchmark reflects realistic biological scenarios where disorder plays a functional role.

Because many IDRs lack a definitive reference structure, DisProtBench shifts the evaluation focus from pure geometric accuracy to functional impact. The authors introduce Functional Uncertainty Sensitivity (FUS), a metric that stratifies predictions by their internal confidence scores (e.g., AlphaFold’s pLDDT) and measures downstream task performance within each confidence band. For PPI prediction, FUS computes the area under the ROC curve (AUC) for interface detection using predicted structures, while for drug discovery it correlates docking scores with experimental binding affinities. This uncertainty‑aware approach reveals how low‑confidence regions translate into functional risk, something that global RMSD cannot capture.

Eight state‑of‑the‑art PSPMs are benchmarked: AlphaFold2, AlphaFold3, OpenFold, UniFold (all Evoformer‑based with MSA inputs), Boltz, Chai (sequence‑only transformers with coarse‑grained representations), Proteinx (transformer plus ligand modeling), and ESMFold (sequence‑only with language‑model embeddings). Each model is run on the same protein sequences, and its pLDDT (or analogous confidence metric) is used to define uncertainty bins. The authors then evaluate PPI interface recovery and GPCR‑ligand docking performance across these bins.

Key findings include: (1) Global accuracy metrics remain high (80‑90% RMSD/GDT‑TS) for most models, but FUS drops dramatically in low‑confidence IDR regions, especially for PPI tasks where AUC can fall below 0.3. (2) AlphaFold2 and AlphaFold3, despite leading global scores, suffer the steepest performance loss when pLDDT < 50, indicating that their confidence estimates are meaningful indicators of functional reliability. (3) Drug discovery tasks are comparatively robust; docking performance shows little variation across confidence bins because ligand‑binding pockets are often ordered, even when adjacent IDRs exist. (4) Coarse‑grained models (Boltz, Chai) exhibit slightly better resilience in PPI prediction under uncertainty, suggesting that reduced representation may mitigate over‑fitting to static conformations. (5) The visual analytics interface allows users to overlay pLDDT heatmaps on predicted structures, inspect residue‑level contributions to interface prediction, and trace failures back to specific disordered segments, providing a diagnostic capability absent from prior benchmarks.

All data, code, and the web‑based evaluation portal are openly released on GitHub (https://github.com/Susan571/DisProtBench), ensuring reproducibility and encouraging community extensions. The authors argue that DisProtBench fills a critical gap by offering an IDR‑centric, uncertainty‑aware benchmark that aligns model evaluation with real‑world functional demands. Future directions include incorporating Bayesian uncertainty estimates, ensemble predictions, and integrating multi‑omics data to further refine functional assessments in disordered protein regions.


Comments & Academic Discussion

Loading comments...

Leave a Comment