Learning the Unseen: Data-Augmented Deep Learning for PTM Discovery with Prosit-PTM
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Post-translational modifications (PTMs) are critical regulators of protein function, yet confidently identifying and localizing PTM sites across proteomes remains a challenging task. Integrating peptide property predictions into spectrum interpretation improves identification performance, but training data enabling zero-shot prediction across diverse PTMs are scarce. Here, we present a major expansion of the ProteomeTools dataset, comprising over 977,000 synthetic peptides, covering 22 PTM–residue combinations. Furthermore we developed Prosit-PTM, a model with chemically-informed encoding and amino acid substitution-based augmentation trained with our novel ground-truth dataset, that achieves accurate zero-shot predictions. Applied to modified peptides, Prosit-PTM enhances PTM-site localization in phosphoproteomics, increases identification of multiply modified peptides in histones, and enables data-driven rescoring for unseen modifications such as HLA peptides. Furthermore, the learned embeddings of amino acids and modifications capture physicochemical relationships underlying PTM-driven HLA presentation. Prosit-PTM is integrated into multiple open-source tools enabling PTM-aware rescoring, site localization, spectral library generation, and beyond.