Diverse Database and Machine Learning Model to Narrow the Generalization Gap in RNA Structure Prediction

Silvi Rouskin
Alberic de Lajart
Yves Martin des Taillades
Colin Kalicki
Federico Fuchs Wightman
Justin Aruda
Dragui Salazar
Matthew Allan
Casper L’Esperance-Kerckhoff
Alex Kashi
Fabrice Jossinet

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Understanding macromolecular structures of proteins and nucleic acids is critical for discerning their functions and biological roles. Advanced techniques—crystallography, NMR, and CryoEM—have facilitated the determination of over 180,000 protein structures, all cataloged in the Protein Data Bank (PDB). This comprehensive repository has been pivotal in developing deep learning algorithms for predicting protein structures directly from sequences. In contrast, RNA structure prediction has lagged, and suffers from a scarcity of structural data. Here, we present the secondary structure models of 1098 pri-miRNAs and 1456 human mRNA regions determined through chemical probing. We develop a novel deep learning architecture, inspired from the Evoformer model of Alphafold and traditional architectures for secondary structure prediction. This new model, eFold, was trained on our newly generated database and over 300,000 secondary structures across multiple sources. We benchmark eFold on two new test sets of long and diverse RNA structures and show that our dataset and new architecture contribute to increasing the prediction performance, compared to similar state-of-the-art methods. All together, our results reveal that merely expanding the database size is insufficient for generalization across families, whereas incorporating a greater diversity and complexity of RNAs structures allows for enhanced model performance.

Version published to 10.21203/rs.3.rs-4159627/v1 on Research Square
Apr 4, 2024

Deep Learning Approaches for Accurate RNA 3D Structure Prediction from Primary Sequences

This article has 1 author:
1. Nnaemeka Kingsley Ugwumba
This article has no evaluationsLatest version Jan 29, 2026
The Evolution of the AlphaFold Architecture

This article has 1 author:
1. Y.C.B.J. Dissanayaka
This article has no evaluationsLatest version Jan 9, 2026
Artificial Intelligence–Driven Structural Mining Enables Functional Inference in the Human Dark Proteome

This article has 7 authors:
1. Valentina Carbonari
2. Annamaria Defilippo
3. Ugo Lomoio
4. Caterina Francesca Perri
5. Barbara Puccio
6. Pierangelo Veltri
7. Pietro Hiram Guzzi
This article has no evaluationsLatest version Dec 23, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Deep Learning Approaches for Accurate RNA 3D Structure Prediction from Primary Sequences

The Evolution of the AlphaFold Architecture

Artificial Intelligence–Driven Structural Mining Enables Functional Inference in the Human Dark Proteome