Prediction of lncRNA-protein interacting pairs using LLM embeddings based on evolutionary information
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Interactions of long non-coding RNAs (lncRNAs) with proteins is responsible for numerous cellular processes, including transcriptional regulation, chromatin remodeling, cell differentiation, and intracellular signaling. In the past, numerous computational methods have been developed for predicting lncRNA–protein interacting (LPI) pairs. This study describes a highly accurate and reliable method for predicting LPI pairs built on largest possible non-redundant dataset having 262,244 interacting and equal number of non-interacting pairs. Initially, similarity-based approach BLAST has been tried which have poor discriminative power, due to low sequence similarity. Subsequently, we developed CNN based models and machine learning based models using traditional features and embedding. Our CatBoost model developed using embedding generated by DNABERT-2 and ESM-2-t30 achieved AUC of 0.989 with MCC 0.915 on an independent dataset. Our method performs better than existing methods on an independent dataset. We developed standalone software and web server lncrnaPI for predicting LPI pairs, scanning lncRNA interacting proteins in proteome and protein interacting lncRNA in genomes ( https://webs.iiitd.edu.in/raghava/lncrnapi/ ).
HIGHLIGHTS
-
Discrimination of LncRNA-protein interacting and non-interacting pairs.
-
Non-redundant dataset of 262,244 interacting and 262,244 non-interacting pairs.
-
Embedding of LncRNA using DNABERT-2 and protein using ESM-2-t30.
-
Rapid scanning of lncRNA interacting proteins at genome scale.
-
A web server and software for predicting LncRNA-protein interacting pairs.