LncRNA-BERT: An RNA Language Model for Classifying Coding and Long Non-Coding RNA
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Understanding (novel) RNA transcripts generated in next generation sequencing experiments requires accurate classification, given the increasing evidence that long non-coding RNAs (lncRNAs) play crucial regulatory roles. Recent developments in Large Language Models present opportunities for classifying RNA coding potential with sequence-based algorithms that can overcome the limitations of classical approaches that assess coding potential based on a set of predefined features. We present lncRNA-BERT, an RNA language model pre-trained and fine-tuned on human RNAs collected from the GENCODE, RefSeq, and NONCODE databases to classify lncRNAs. LncRNA-BERT matches and outperforms state-of-the-art classifiers on three test datasets, including the cross-species RNAChallenge benchmark. The pre-trained lncRNA-BERT model distinguishes coding from long non-coding RNA without supervised learning which confirms that coding potential is a sequenceintrinsic characteristic. LncRNA-BERT has been shown to benefit from pre-training on human data from GENCODE, RefSeq, and NONCODE, improving upon configurations pre-trained on the commonly used RNAcentral dataset. In addition, we propose a novel Convolutional Sequence Encoding method that is shown to be more effective and efficient than K-mer Tokenization and Byte Pair Encoding for training with long RNA sequences that are otherwise above the common context window size. lncRNA-BERT is available at https://github.com/luukromeijn/lncRNA-Py .