Assessing the potential of deep learning for protein-ligand docking
The effects of ligand binding on protein structures and their in vivo functions carry numerous implications for modern biomedical research and biotechnology development efforts such as drug discovery. Although several deep learning (DL) methods and benchmarks designed for protein-ligand docking have recently been introduced, to date no prior works have systematically studied the behavior of the latest docking and structure prediction methods within the broadly applicable context of (1) using predicted (apo) protein structures for docking (e.g., for applicability to new proteins); (2) binding multiple (cofactor) ligands concurrently to a given target protein (e.g., for enzyme design); and (3) having no prior knowledge of binding pockets (e.g., for generalization to unknown pockets). To enable a deeper understanding of docking methods’ real-world utility, we introduce PoseBench, the first comprehensive benchmark for broadly applicable protein-ligand docking. PoseBench enables researchers to rigorously and systematically evaluate DL methods for apo-to-holo protein-ligand docking and protein-ligand structure prediction using both primary ligand and multi-ligand benchmark datasets, the latter of which we introduce for the first time to the DL community. Empirically, using PoseBench, we find that (1) DL co-folding methods generally outperform comparable conventional and DL docking baseline algorithms, yet popular methods such as AlphaFold 3 are still challenged by prediction targets with novel binding poses; (2) certain DL co-folding methods are highly sensitive to their input multiple sequence alignments, while others are not; and (3) DL methods struggle to strike a balance between structural accuracy and chemical specificity when predicting novel or multi-ligand protein targets. Code, data, tutorials, and benchmark results are available at https://github.com/BioinfoMachineLearning/PoseBench.
💡 Research Summary
The paper introduces PoseBench, the first comprehensive benchmark designed to evaluate deep‑learning (DL) methods for protein‑ligand docking under three realistic and broadly applicable conditions: (1) using predicted apo protein structures as the docking target, (2) simultaneously docking multiple (co‑factor) ligands to a single protein, and (3) performing blind docking without any prior knowledge of binding pockets. Existing benchmarks typically rely on experimentally resolved holo structures, single‑ligand scenarios, or pre‑identified pockets, limiting their relevance to real‑world drug discovery pipelines. PoseBench fills this gap by assembling four curated datasets—Astex Diverse (85 complexes), DockGen‑E (122 complexes, up to 2019), PoseBusters (130 out‑of‑distribution complexes filtered from a larger set of 308), and a CASP15 multi‑ligand set (13 complexes)—that collectively span well‑studied, recent, and truly novel protein‑ligand interactions.
Eight state‑of‑the‑art DL models are benchmarked: DiffDock‑L, DynamicBind, NeuralPLexer (blind docking), and three co‑folding approaches—Chai‑1, Boltz‑1, and AlphaFold‑3 (including a single‑sequence variant). For comparison, a conventional pipeline (P2Rank for pocket prediction followed by AutoDock Vina) is also evaluated. Performance is measured using three primary metrics: (i) heavy‑atom centroid RMSD ≤ 2 Å (and a stricter ≤ 1 Å), (ii) chemical validity as defined by the PoseBusters suite (PB‑Valid), and (iii) a newly introduced Wasserstein‑based PLIF‑WM score that quantifies how well predicted protein‑ligand interaction fingerprints (PLIFs) match the ground truth at the amino‑acid level.
Key findings are as follows. First, co‑folding methods consistently outperform conventional docking and blind‑DL docking in both structural accuracy and chemical validity across the Astex Diverse and DockGen‑E datasets. Chai‑1, Boltz‑1, and AlphaFold‑3 achieve pocket‑identification success rates above 50 % and maintain a favorable balance between RMSD and PLIF‑WM scores. However, AlphaFold‑3’s performance degrades sharply when input multiple‑sequence alignments (MSAs) are removed, indicating a strong dependence on evolutionary information. In contrast, Boltz‑1’s single‑sequence version remains competitive, suggesting that some architectures can learn robust, MSA‑independent representations.
Second, when moving to more challenging, out‑of‑distribution data (PoseBusters) and especially to the multi‑ligand CASP15 set, overall success rates drop dramatically for all methods. Only AlphaFold‑3 shows a modest ability to generalize to multi‑ligand complexes, but even it fails to achieve high PLIF‑WM scores, highlighting a persistent trade‑off between accurate backbone placement and precise chemical interactions. The low PLIF‑WM values across datasets reveal that current DL models, while good at predicting ligand pose geometry, often miss critical residue‑level contacts, limiting their utility for tasks that require detailed interaction fingerprints (e.g., structure‑based lead optimization).
Third, the benchmark uncovers that MSA‑based features, while beneficial for structural accuracy, can lead to over‑fitting on common pocket types present in the training data. This “over‑fitting” manifests as reduced performance on novel pocket families (e.g., ECOD‑defined domains in DockGen‑E) and on recent PDB entries not seen during training. The authors argue that future model development should explicitly address MSA dependence, perhaps by integrating self‑supervised protein language models or by augmenting training with synthetic, diverse pocket conformations.
Finally, the authors provide extensive auxiliary analyses, including runtime and memory profiling, and they release all code, data, and tutorials to facilitate reproducibility. PoseBench thus serves both as a diagnostic tool for current DL docking methods and as a roadmap for future research directions: (i) designing MSA‑agnostic architectures, (ii) incorporating explicit multi‑ligand interaction modeling, (iii) expanding training with out‑of‑distribution examples, and (iv) adopting interaction‑level evaluation metrics like PLIF‑WM as standard benchmarks.
In summary, PoseBench demonstrates that while deep‑learning co‑folding approaches have surpassed traditional docking in many respects, significant challenges remain in generalizing to novel binding sites, handling multiple ligands, and faithfully reproducing residue‑level interaction patterns. The benchmark provides a rigorous, publicly available platform to drive the next generation of protein‑ligand docking algorithms toward broader applicability in drug discovery and enzyme engineering.
Comments & Academic Discussion
Loading comments...
Leave a Comment