LncRNA-BERT: An RNA Language Model for Classifying Coding and Long Non-Coding RNA

Luuk Romeijn
Davy Cats
Katherine Wolstencroft
Hailiang Mei

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Understanding (novel) RNA transcripts generated in next generation sequencing experiments requires accurate classification, given the increasing evidence that long non-coding RNAs (lncRNAs) play crucial regulatory roles. Recent developments in Large Language Models present opportunities for classifying RNA coding potential with sequence-based algorithms that can overcome the limitations of classical approaches that assess coding potential based on a set of predefined features. We present lncRNA-BERT, an RNA language model pre-trained and fine-tuned on human RNAs collected from the GENCODE, RefSeq, and NONCODE databases to classify lncRNAs. LncRNA-BERT matches and outperforms state-of-the-art classifiers on three test datasets, including the cross-species RNAChallenge benchmark. The pre-trained lncRNA-BERT model distinguishes coding from long non-coding RNA without supervised learning which confirms that coding potential is a sequenceintrinsic characteristic. LncRNA-BERT has been shown to benefit from pre-training on human data from GENCODE, RefSeq, and NONCODE, improving upon configurations pre-trained on the commonly used RNAcentral dataset. In addition, we propose a novel Convolutional Sequence Encoding method that is shown to be more effective and efficient than K-mer Tokenization and Byte Pair Encoding for training with long RNA sequences that are otherwise above the common context window size. lncRNA-BERT is available at https://github.com/luukromeijn/lncRNA-Py .

Version published to 10.1101/2025.01.09.632168v2 on bioRxiv
Jan 17, 2025
Version published to 10.1101/2025.01.09.632168v1 on bioRxiv
Jan 13, 2025

Decoding RNA-RNA Interactions: The Role of Low-Complexity Repeats and a Deep Learning Framework for Sequence-Based Prediction

This article has 14 authors:
1. Adriano Setti
2. Giorgio Bini
3. Valentino Maiorca
4. Flaminia Pellegrini
5. Gabriele Proietti
6. Dimitrios Miltiadis-Vrachnos
7. Alexandros Armaos
8. Julie Martone
9. Michele Monti
10. Giancarlo Ruocco
11. Emanuele Rodolà
12. Irene Bozzoni
13. Alessio Colantoni
14. Gian Gaetano Tartaglia
This article has no evaluationsLatest version Feb 16, 2025
Predicting non-coding RNA function using Artificial Intelligence

This article has 3 authors:
1. David da Costa Correia
2. Francisco M. Couto
3. Hugo Martiniano
This article has no evaluationsLatest version Dec 30, 2024
To be, or not to be an intron evidence from entropy-based machine learning

This article has 5 authors:
1. Alessio Mancini
2. Emanuela Merelli
3. Marco Piangerelli
4. Sandra Pucciarelli
5. Leonardo Vito
This article has no evaluationsAppears in 1 listLatest version Feb 3, 2025

Listed in

Abstract

Article activity feed

Related articles

Decoding RNA-RNA Interactions: The Role of Low-Complexity Repeats and a Deep Learning Framework for Sequence-Based Prediction

Predicting non-coding RNA function using Artificial Intelligence

To be, or not to be an intron evidence from entropy-based machine learning