Accurate de novo sequencing of the modified proteome with OmniNovo

Reading time: 5 minute
...

📝 Original Info

  • Title: Accurate de novo sequencing of the modified proteome with OmniNovo
  • ArXiv ID: 2512.12272
  • Date: 2025-12-13
  • Authors: Yuhan Chen, Shang Qu, Zhiqiang Gao, Yuejin Yang, Xiang Zhang, Sheng Xu, Xinjie Mao, Liujia Qian, Jiaqi Wei, Zijie Qiu, Chenyu You, Lei Bai, Ning Ding, Tiannan Guo, Bowen Zhou, Siqi Sun

📝 Abstract

Post-translational modifications (PTMs) serve as a dynamic chemical language regulating protein function, yet current proteomic methods remain blind to a vast portion of the modified proteome. Standard database search algorithms suffer from a combinatorial explosion of search spaces, limiting the identification of uncharacterized or complex modifications. Here we introduce OmniNovo, a unified deep learning framework for reference-free sequencing of unmodified and modified peptides directly from tandem mass spectra. Unlike existing tools restricted to specific modification types, OmniNovo learns universal fragmentation rules to decipher diverse PTMs within a single coherent model. By integrating a mass-constrained decoding algorithm with rigorous false discovery rate estimation, OmniNovo achieves state-of-the-art accuracy, identifying 51\% more peptides than standard approaches at a 1\% false discovery rate. Crucially, the model generalizes to biological sites unseen during training, illuminating the dark matter of the proteome and enabling unbiased comprehensive analysis of cellular regulation.

💡 Deep Analysis

📄 Full Content

Accurate de novo sequencing of the modified proteome with OmniNovo Yuhan Chen1,2, Shang Qu2,3, Zhiqiang Gao2, Yuejin Yang2,4, Xiang Zhang5, Sheng Xu2,4, Xinjie Mao2,6,7, Liujia Qian6, Jiaqi Wei2,8, Zijie Qiu2, Chenyu You9, Lei Bai2, Ning Ding2,3*, Tiannan Guo6*, Bowen Zhou2,3*, Siqi Sun4* 1Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai, China. 2Shanghai Artificial Intelligence Laboratory, Shanghai, China. 3Department of Electronic Engineering, Tsinghua University, Beijing, China. 4Research Institute of lntelligent Complex Systems, Fudan University, Shanghai, China. 5Department of computer science, University of British Columbia, Vancouver, Canada. 6School of Medicine, Westlake University, Hangzhou, Zhejiang, China. 7Shanghai Innovation Institute, Shanghai, China. 8Department of computer science, Zhejiang University, Hangzhou, Zhejiang, China. 9Department of computer science, Stony Brook University, New York, America. Abstract Post-translational modifications (PTMs) serve as a dynamic chemical language regulating pro- tein function, yet current proteomic methods remain blind to a vast portion of the modified proteome. Standard database search algorithms suffer from a combinatorial explosion of search spaces, limiting the identification of uncharacterized or complex modifications. Here we intro- duce OmniNovo, a unified deep learning framework for reference-free sequencing of unmodified and modified peptides directly from tandem mass spectra. Unlike existing tools restricted to specific modification types, OmniNovo learns universal fragmentation rules to decipher diverse PTMs within a single coherent model. By integrating a mass-constrained decoding algorithm with rigorous false discovery rate estimation, OmniNovo achieves state-of-the-art accuracy, identifying 51% more peptides than standard approaches at a 1% false discovery rate. Cru- cially, the model generalizes to biological sites unseen during training, illuminating the dark matter of the proteome and enabling unbiased comprehensive analysis of cellular regulation. Introduction Protein sequencing forms the foundation of modern proteomics, enabling the unbiased characteriza- tion of proteoforms and the discovery of novel biological mechanisms[1, 2]. While the primary amino acid sequence provides the structural blueprint, post-translational modifications (PTMs) consti- tute a dynamic chemical language that regulates protein function, stability, and interaction[3, 4]. 1 arXiv:2512.12272v1 [q-bio.QM] 13 Dec 2025 Common modifications such as phosphorylation and methylation are pivotal in cellular signaling, transcriptional control, and cell cycle regulation[5–10]. However, the biological impact of these events is often determined by a precise molecular logic—specifically, the exact residue modified and the combinatorial presence of multiple PTMs on a single peptide [6, 11, 12] Consequently, the ability to accurately sequence and localize diverse co-existing PTMs is not merely a technical convenience but a prerequisite for mapping the functional landscape of the cell[2, 6]. Despite the ubiquity of mass spectrometry for high-throughput sequencing, identifying mod- ified peptides remains a fundamental bottleneck. The conventional approach relies on matching experimental spectra against static reference databases[13–19]. This strategy is inherently biased: it is effectively blind to the “dark matter” of the proteome, failing to identify non-canonical pep- tides, unexpected mutations, or modifications not explicitly pre-defined in the search space[20–23]. Furthermore, attempting to capture the full spectrum of biological variation triggers a com- binatorial explosion; including even a modest number of variable modifications expands the search space exponentially, rendering database search computationally prohibitive and statistically underpowered[24, 25]. To overcome these reference-dependent limitations, deep learning-based de novo sequencing has emerged as a powerful alternative[26, 27]. Recent transformer-based models [22, 28–30, 30– 37], such as Casanovo[34], InstaNovo[37] and π-PrimeNovo[36], have leveraged large-scale training data to achieve high accuracy on unmodified peptides. However, extending these successes to the modified proteome has proven difficult. Current solutions are fragmented: tools like InstaNovo-P[38] are restricted to single modification types (e.g., phosphorylation) or rely on fine-tuning separate models for specific PTM classes[32, 36]. This “one-model-per-PTM” paradigm is scalable neither computationally nor biologically, as it typically models modifications as fixed residue-PTM pairs. Such rigid representations prevent models from learning generalizable spectral rules, limiting their ability to resolve complex peptides where diverse modifications coexist. Here we introduce OmniNovo, a unified deep learning framework engineered to decipher the comprehensive language of modified peptides

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut