.Preprint. February 17, 2026. ing prediction is typically framed as either regression over experimental affinity values (e.g., IC 50 , K d ), often called protein-ligand scoring, or binary classification over bind/nobind labels.
In contrast to crystallography and sequence-based data in structural studies, data quality remains a pervasive challenge for PLI modeling. Large repositories such as BindingDB (Liu et al., 2007) and ChEMBL (Gaulton et al., 2012) aggregate millions of affinity measurements curated from published articles and patents. However, these measurements originate from thousands of different labs, assays, and experimental protocols, resulting in data that is notoriously difficult to standardize and riddled with biases (Kramer et al., 2012;Harren et al., 2023;Volkov et al., 2022;Blevins & Quigley, 2025). Datasets that do employ consistent experimental protocols are generally too limited in their coverage of protein or chemical space to train generalizable models (Davis et al., 2011;Metz et al., 2011).
Despite these training data challenges, effective computational models for PLI binding prediction do exist. Classical physics-based methods have long provided the foundation for affinity estimation, from rigorous free energy perturbation (FEP) calculations to faster endpoint approximations like MM-PBSA and MM-GBSA (Wang et al., 2015;Genheden & Ryde, 2015). However, these approaches remain too computationally demanding for large-scale virtual screening, motivating the development of machine learning alternatives. ML approaches have proliferated over the past two decades, spanning architectures from random forests (Ballester & Mitchell, 2010) to deep learning models including sequence-based methods ( Öztürk et al., 2018), graph neural networks (Nguyen et al., 2021), and structurebased approaches such as Boltz-2, which trains binding affinity and binary classification heads on top of a pretrained AlphaFold3-like architecture (Passaro et al., 2025).
Evaluating these approaches is equally challenging. Few high-quality benchmark datasets exist because they inherit the same problems as training data. Data leakage is particularly problematic: it is pervasive, difficult to detect, and lacks standardized mitigation strategies (Graber et al., 2025). Curated benchmarks with stronger leakage protections (tem-poral, assay, or protein-family splits) tend to be too small for comprehensive evaluation (Gilson et al., 2025). DELs (Brenner & Lerner, 1992;Gironda-Martínez et al., 2021) offer a potential solution by unifying experimental protocols across protein targets while exploring vast regions of chemical space in single experiments.
DELs are libraries of small molecules where each compound is covalently attached to a unique DNA barcode, enabling the synthesis and screening of up to billions of compounds in a single experiment. In a DEL screen, the entire library can be incubated with an immobilized protein target, nonbinders are washed away, and the remaining compounds are identified by sequencing their DNA tags; enrichment over background indicates binding. This massively parallel approach generates binding data at a scale orders of magnitude larger than traditional high-throughput screens, though the resulting enrichment scores are noisy proxies for true binding affinity.
The massive scale of DEL data makes it well-suited for training machine learning models. Several DEL datasets have been publicly released (Quigley et al., 2024;Lim et al., 2024;Iqbal et al., 2025), and multiple groups have reported success training ML models on this data (McCloskey et al., 2020;Iqbal et al., 2025;Lim et al., 2024). However, publicly available DEL datasets are severely limited in protein diversity, restricting prior work to single-protein models. While these models have demonstrated experimental validation of virtual screening hits, such validation has been confined to small numbers of compounds against the same protein target used for training. What remains to be tested is whether PLI representations learned from DEL data can transfer more broadly: across held-out protein targets, unseen chemical scaffolds, and binding measurements from entirely different experimental systems.
In this work, we present Hermes: a fast, sequence-only PLI binding prediction