Predicting non-coding RNA function using Artificial Intelligence

David da Costa Correia
Francisco M. Couto
Hugo Martiniano

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Non-coding RNAs (ncRNAs) represent the majority of human gene products, and are involved in various important biological processes, being considered relevant disease biomarkers and therapeutic agents. However, information about these biomolecules remains sparsely distributed, mostly in the form of scientific research articles. It is then of pivotal importance to aggregate and summarize the existing information.

Natural Language Processing (NLP) methods applied to text mining can be used to generate collections of annotated sentences expressing relations between entities, called relational corpora.

In this work we developed a text mining pipeline to generate a ncRNA-phenotype relational corpus (ncoRP) using Distant Supervision Relation Extraction (DSRE), comprising 21,608 annotated articles, 2,835 unique ncRNAs, 1,118 unique phenotypes and 35,295 unique relations, with a precision of 0.761 and F1-score of 0.593, calculated through human validation. DSRE methods require a set of pre-documented relations to function, as such, a high-fidelity ncRNA-phenotype relation dataset, consisting of 214,300 unique relations, was created by the aggregation of five comprehensive ncRNA-disease functional annotation databases. Then, both ncoRP and the relation dataset represent important contributions towards solving the problem with the sparseness of information about ncRNAs.

Large Language Models (LLMs) are an emergent type of language model, showing great capabilities in general task-solving through text generation, without the requirement of fine-tuning with large datasets. In this work, a LLM RE methodology is proposed and evaluated, achieving an F1-score of 0.978 by combining the RE task with a preceding sentence filtering task and applying prompting principles such as in-context learning and Chain-of-Thought self-explanation.

Version published to 10.1101/2024.12.30.630736v2 on bioRxiv
Mar 11, 2025
Version published to 10.1101/2024.12.30.630736v1 on bioRxiv
Dec 30, 2024

GeneChat: A Multi-Modal Large Language Model for Gene Function Prediction

This article has 3 authors:
1. Shashi Dhanasekar
2. Akash Saranathan
3. Pengtao Xie
This article has no evaluationsLatest version Jun 6, 2025
A Large-Scale Foundation Model for RNA Enables Diverse Function and Structure Prediction

This article has 10 authors:
1. Eric Xing
2. Shuxian Zou
3. Tianhua Tao
4. Sazan Mahbub
5. Caleb Ellington
6. Robin Algayres
7. Dian Li
8. Yonghao Zhuang
9. Hongyi Wang
10. Le Song
This article has no evaluationsLatest version May 7, 2025
Evaluating Large Language Models for Gene-to-Phenotype Mapping: The Critical Role of Full-Text Database Access

This article has 8 authors:
1. Nicolas Matthew Suhardi
2. Anastasia Oktarina
3. Julia Retzky
4. Damanpreet Dhillon
5. Dona Ninan
6. Mathias P.G. Bostrom
7. Xu Yang
8. Vincentius Jeremy Suhardi
This article has no evaluationsLatest version Jun 12, 2025

Listed in

Abstract

Article activity feed

Related articles

GeneChat: A Multi-Modal Large Language Model for Gene Function Prediction

A Large-Scale Foundation Model for RNA Enables Diverse Function and Structure Prediction

Evaluating Large Language Models for Gene-to-Phenotype Mapping: The Critical Role of Full-Text Database Access