Fine-Tuning Protein Language Models Enhances the Identification and Interpretation of the Transcription Factors

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Transcription factors (TFs) are pivotal regulators of gene expression and play essential roles in diverse cellular activities. The three-dimensional organization of the genome and transcriptional regulation are predominantly orchestrated by TFs. By recruiting the transcriptional machinery to gene enhancers or promoters, TFs can either activate or repress transcription, thereby controlling gene activity and various biological pathways. Accurate identification of TFs is vital for elucidating gene regulatory mechanisms within cells. However, experimental identification remains labor-intensive and time-consuming, highlighting the necessity for efficient computational approaches. In this study, we present a two-layer predictive framework utilizing protein language models (pLMs) via full fine-tuning and parameter-efficient fine-tuning. The initial layer robustly classifies and identifies transcription factors, while the subsequent layer predicts TFs with a binding preference for methylated DNA (TFPMs). Our approach further incorporates attention weights and protein sequence motifs to enhance interpretability and predictive capability. By leveraging attention mechanisms, we highlight biologically relevant regions of the protein sequences that contribute most strongly to the predictions. Additionally, motif analysis facilitates the identification of conserved sequence patterns that are critical for TF recognition and function. Across both TF and TFPM classification tasks, the inclusion of these features allowed our methods to consistently surpass contemporary models, as demonstrated by independent test results.

Article activity feed