Learning the Unseen: Data-Augmented Deep Learning for PTM Discovery with Prosit-PTM

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Post-translational modifications (PTMs) are critical regulators of protein function, yet confidently identifying and localizing PTM sites across proteomes remains a challenging task. Integrating peptide property predictions into spectrum interpretation improves identification performance, but training data enabling zero-shot prediction across diverse PTMs are scarce. Here, we present a major expansion of the ProteomeTools dataset, comprising over 977,000 synthetic peptides, covering 22 PTM–residue combinations. Furthermore we developed Prosit-PTM, a model with chemically-informed encoding and amino acid substitution-based augmentation trained with our novel ground-truth dataset, that achieves accurate zero-shot predictions. Applied to modified peptides, Prosit-PTM enhances PTM-site localization in phosphoproteomics, increases identification of multiply modified peptides in histones, and enables data-driven rescoring for unseen modifications such as HLA peptides. Furthermore, the learned embeddings of amino acids and modifications capture physicochemical relationships underlying PTM-driven HLA presentation. Prosit-PTM is integrated into multiple open-source tools enabling PTM-aware rescoring, site localization, spectral library generation, and beyond.

Article activity feed