Task-Specialized Protein Language Models Decode the Sequence Grammar of Post-Translational Modification Sites

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Post-translational modifications (PTMs) regulate protein signaling, localization, degradation, and cellular decision-making, yet the sequence determinants that distinguish modified from chemically eligible but unmodified residues remain difficult to decode at proteome scale. Here, we examine whether adapting a general protein language model to PTM-site prediction can reveal the biochemical logic underlying residue-level modification. We fine-tune ESM2, a protein language model trained on tens of millions of evolutionarily diverse protein sequences, for phosphorylation, acetylation, and ubiquitination-site prediction. To address the pronounced class imbalance inherent in proteome-wide PTM annotation, we combine parameter-efficient fine-tuning with focal-loss training. The resulting task-specialized models show that PTM recognition depends on model capacity, annotation depth, and modification chemistry: phosphorylation benefits from larger models, whereas acetylation and ubiquitination peak at intermediate scale. More importantly, the fine-tuned phosphorylation model exposes three layers of biological organization: it recovers canonical kinase-recognition motifs without kinase-label supervision, resolves pathway-level functional relationships among proteins from sequence-derived embeddings, and preserves evolutionary signatures of homologous phosphorylation sites across 200 eukaryotic species. These results establish task-specialized protein language models as interpretable instruments for probing PTM-site biochemistry, kinase specificity, functional organization, and evolutionary conservation.

Article activity feed