Extending Prot2Token: Aligning Protein Language Models for Unified and Diverse Protein Prediction Tasks
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Comprehensive protein function and property prediction remains a major challenge due to the vast diversity of sequences, structural variations, and limited labeled data. Existing models are often specialized to be task-specific, requiring independent training, which limits scalability. To address this, we extend Prot2Token, a unified autoregressive framework that focuses on the post-training alignment of pre-trained protein language models (PLMs), to new applications. Our approach enables next-token prediction across new applications of proteinprediction tasks, including protein-protein structure similarity, 3D structure prediction, mutation stability, post-translational modifications (PTMs), substratekinase phosphorylation sites, protein-protein affinity, and protein-ion binding sites. We introduce a self-supervised pre-training stage for the decoder, enhancing model initialization and improving downstream predictions. By integrating a causal autoregressive transformer with a pre-trained ESM-2 encoder, our model effectively aligns diverse protein tasks within a single framework. Additionally, we discuss the opportunities and limitations of this approach, providing insights for future research in optimizing PLMs as a general tool for broader biological applications. Code is available on GitHub Repository.