Comprehensive Datasets for RNA Design, Machine Learning, and Beyond

Jan Badura
Agnieszka Rybarczyk
Tomasz Zok

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

RNA molecules are essential in regulating biological processes such as gene expression, cellular differentiation, and development. Accurately predicting RNA secondary structures and designing sequences that fold into specific configurations remain significant challenges in computational biology, with far-reaching implications for medicine, synthetic biology, and biotechnology. While machine learning methodologies have been proposed to enhance prediction capabilities, they require high-quality training data. The lack of standardized benchmark datasets further hinders the development and evaluation of these tools. To address this, we created a comprehensive dataset of over 320 thousand instances from experimentally validated sources to establish a new community-wide benchmark for RNA design and modeling algorithms. Our dataset comprises numerous challenging structures for which state-of-the-art RNA inverse folders provide results of varying accuracy. We demonstrated the potential of the dataset by testing it with several popular open-source RNA design algorithms. Furthermore, we illustrated how our dataset can be used to train machine learning models that consider both RNA sequence and structure, potentially advancing RNA design and prediction capabilities.

Version published to 10.21203/rs.3.rs-6146242/v1 on Research Square
Mar 12, 2025

Benchmarking Pre-trained Genomic Language Models for RNA Sequence-Related Predictive Applications

This article has 6 authors:
1. Ningyuan You
2. Chang Liu
3. Hai Lin
4. Sai Wu
5. Gang Chen
6. Ning Shen
This article has no evaluationsLatest version Mar 10, 2025
Ab initio RNA structure prediction with composite language model and denoised end-to-end learning

This article has 4 authors:
1. Yang Li
2. Chenjie Feng
3. Xi Zhang
4. Yang Zhang
This article has no evaluationsLatest version Mar 11, 2025
GeneBag: training a cell foundation model for broad-spectrum cancer diagnosis and prognosis with bulk RNA-seq data

This article has 9 authors:
1. Kun Tang
2. Yuhu Liang
3. Dan Li
4. Dong Luo
5. Augix Xu
6. Pengchao Luo
7. Yan Shao
8. Jianbo Yang
9. Xuejun Gong
This article has no evaluationsLatest version Mar 19, 2025

Comprehensive Datasets for RNA Design, Machine Learning, and Beyond

Listed in

Abstract

Article activity feed

Benchmarking Pre-trained Genomic Language Models for RNA Sequence-Related Predictive Applications

Ab initio RNA structure prediction with composite language model and denoised end-to-end learning

GeneBag: training a cell foundation model for broad-spectrum cancer diagnosis and prognosis with bulk RNA-seq data

Listed in

Abstract

Article activity feed

Related articles

Benchmarking Pre-trained Genomic Language Models for RNA Sequence-Related Predictive Applications

Ab initio RNA structure prediction with composite language model and denoised end-to-end learning

GeneBag: training a cell foundation model for broad-spectrum cancer diagnosis and prognosis with bulk RNA-seq data