PhosF3C: A Feature Fusion Architecture with Fine-Tuned Protein Language Model and Conformer for prediction of general phosphorylation site
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Protein phosphorylation, a key post-translational modification (PTM), provides essential insight into protein properties, making its prediction highly significant. Using the emerging capabilities of large language models (LLMs), we apply LoRA fine-tuning to ESM2, a powerful protein large language model, to efficiently extract features with minimal computational resources, optimizing task-specific text alignment. Additionally, we integrate the conformer architecture with the Feature Coupling Unit (FCU) to enhance local and global feature exchange, further improving prediction accuracy. Our model achieves state-of-the-art (SOTA) performance, obtaining AUC scores of 79.5%, 76.3%, and 71.4% at the S, T, and Y sites of the general data sets. Based on the powerful feature extraction capabilities of LLMs, we conduct a series of analyses on protein representations, including studies on their structure, sequence, and various chemical properties (such as Hydrophobicity (GRAVY), Surface Charge, and Isoelectric Point). We propose a test method called Linear Regression Tomography (LRT) which is a top-down method using representation to explore the model's feature extraction capabilities, offering a pathway to improved interpretability. Our resources, including data and code, are publicly accessible at \url{https://github.com/SkywalkerLuke/PhosF3C}