The vast majority of biological sequences encode unknown functions and bear little resemblance to experimentally characterized proteins, limiting both our understanding of biology and our ability to harness functional potential for the bioeconomy. Predicting enzyme function from sequence remains a central challenge in computational biology, complicated by low sequence diversity and imbalanced label support in publicly available datasets. Models trained on these data can overestimate performance and fail to generalize. To address this, we introduce GRIMM (Genetic stRatification for Inference in Molecular Modeling), a benchmark for enzyme function prediction that employs genetic stratification: sequences are clustered by similarity and clusters are assigned exclusively to training, validation, or test sets. This ensures that sequences from the same cluster do not appear in multiple partitions. GRIMM produces multiple test sets: a closed-set test with the same label distribution as training (Test-1) and an open-set test containing novel labels (Test-2), serving as a realistic out-of-distribution proxy for discovering novel enzyme functions. While demonstrated on enzymes, this approach is generalizable to any sequence-based classification task where inputs can be clustered by similarity. By formalizing a splitting strategy often used implicitly, GRIMM provides a unified and reproducible framework for closed- and open-set evaluation. The method is lightweight, requiring only sequence clustering and label annotations, and can be adapted to different similarity thresholds, data scales, and biological tasks. GRIMM enables more realistic evaluation of functional prediction models on both familiar and unseen classes and establishes a benchmark that more faithfully assesses model performance and generalizability.
A persistent challenge in biological sequence modeling is the limited generalizability of Bio-AI models when applied beyond the conditions under which they are evaluated. In practice, sequence-based models predicting biological characteristics from DNA or amino acid sequences are often trained and assessed on datasets that can contain substantial overlap in sequence similarity between training, validation, and test partitions, even though real-world biological applications routinely introduce novel sequences that are evolutionarily distant or absent from the training data. This disconnect between evaluation protocols and deployment conditions leads to inflated performance estimates and precludes accurate assessment of model behavior on truly novel, out-of-distribution (OOD) sequences. Here in the context of genomics, we specifically use the term out-of-distribution (OOD) to describe protein sequences that differ substantially from the training data in terms of amino acid sequence similarity, and that lie outside the regions of sequence space well represented during model training Shih and Colleagues [2025].
A key contributor to inflated performance estimates is redundancy within commonly used biological sequence datasets. Traditional splitting strategies frequently permit homologous sequences to appear across partitions, introducing leakage that reduces the effective difficulty of the prediction task Florensa et al. [2024], Shih and Colleagues [2025]. As a result, standard benchmarks often fail to capture the challenges associated with generalization to OOD sequences, particularly those occupying sparsely sampled or underexplored regions of sequence space Koh et al. [2021].
To address these limitations, we introduce GRIMM (Genetic stRatification for Inference in Molecular Modeling), a methodology for constructing similarity-aware train-test splits that yield pseudo-OOD evaluation sets approximating the biological novelty encountered in real-world deployment. In this approach, sequences are grouped by sequence similarity clusters-such as UniRef50 Suzek et al. [2007], uni [2021] or any user-provided cluster ID-and each cluster is assigned exclusively to one partition for each of the classification labels of interest. This creates a clear separation between training and evaluation sequences that maximizes sequence dissimilarity between sets. We define two test sets: Test-1, a closed-set evaluation containing sequences from the same labels as training; and Test-2, an open-set, pseudo-OOD evaluation containing sequences from labels absent in training derived from orphaned clusters, representing novel regions of sequence space. We define Test-2 as “pseudo” OOD rather than a true OOD set as it is produced by the clustering procedure of publicly available data rather than true, never-before-seen OOD data.
We demonstrate GRIMM using the Enzyme Commission (EC) classification system as a concrete example, but the framework is broadly applicable to other structured biological labeling systems, including Gene Ontology terms and protein family annotations. By focusing on the construction of reproducible, pseudo-OOD data splits, GRIMM provides a general framework for evaluating model generalization in biological prediction tasks. Notably, this methodology formalizes practices that researchers may already employ implicitly-such as clustering sequences by similarity to reduce leakage-into a consistent framework with explicit closed-set and open-set definitions. Our methodology exposes limitations of traditional splitting strategies, enables the efficient creation of challenging evaluation partitions, and supports systematic benchmarking of computational methods on both familiar and evolutionarily novel sequences. As part of this work, we also release a five-fold EC functional prediction dataset on amino acid sequences of proteins of experimentally characterized function curated from the SwissProt database UniProt Consortium [2025], publicly available via GitHub and the HuggingFace API.
Several prior studies have emphasized the importance of careful dataset design for evaluating Bio-AI models and protein function prediction specifically. Work on annotation quality and benchmarking has shown that homologous leakage between training and test sets can substantially inflate accuracy and obscure true generalization performance, particularly when random or non-stringent dataset splits are used Schnoes et al. [2009], Gerlt et al. [2016], Radivojac et al. [2013], Salzberg [2019], Florensa et al. [2024]. Related analyses have also highlighted the limitations of homologybased annotation transfer in the “twilight zone” of sequence identity, where alignment signals become unreliable and evolutionary divergence frequently leads to functional shifts Rost [1999], Tawfik [2020], Khersonsky and Tawfik [2006]. In parallel, metagenomic and microbiome studies consistently reveal that the majority of enzyme sequence space remains sparsely ch
This content is AI-processed based on open access ArXiv data.