Task-Specialized Protein Language Models Decode the Sequence Grammar of Post-Translational Modification Sites
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Post-translational modifications (PTMs) regulate protein signaling, localization, degradation, and cellular decision-making, yet the sequence determinants that distinguish modified from chemically eligible but unmodified residues remain difficult to decode at proteome scale. Here, we examine whether adapting a general protein language model to PTM-site prediction can reveal the biochemical logic underlying residue-level modification. We fine-tune ESM2, a protein language model trained on tens of millions of evolutionarily diverse protein sequences, for phosphorylation, acetylation, and ubiquitination-site prediction. To address the pronounced class imbalance inherent in proteome-wide PTM annotation, we combine parameter-efficient fine-tuning with focal-loss training. The resulting task-specialized models show that PTM recognition depends on model capacity, annotation depth, and modification chemistry: phosphorylation benefits from larger models, whereas acetylation and ubiquitination peak at intermediate scale. More importantly, the fine-tuned phosphorylation model exposes three layers of biological organization: it recovers canonical kinase-recognition motifs without kinase-label supervision, resolves pathway-level functional relationships among proteins from sequence-derived embeddings, and preserves evolutionary signatures of homologous phosphorylation sites across 200 eukaryotic species. These results establish task-specialized protein language models as interpretable instruments for probing PTM-site biochemistry, kinase specificity, functional organization, and evolutionary conservation.