ntEmbd: Deep learning embedding for nucleotide sequences

Saber Hafezqorani
Ka Ming Nip
Inanc Birol

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Enabled by the explosion of data and substantial increase in computational power, deep learning has transformed fields such as computer vision and natural language processing (NLP) and it has become a successful method to be applied to many transcriptomic analysis tasks. A core advantage of deep learning is its inherent capability to incorporate feature computation within the machine learning models. This results in a comprehensive and machine-readable representation of sequences, facilitating the downstream classification and clustering tasks. Compared to machine translation problems in NLP, feature embedding is particularly challenging for transcriptomic studies as the sequences are string of thousands of nucleotides in length, which make the long-term dependencies between features from different parts of the sequence even more difficult to capture. This highlights the need for nucleotide sequence embedding methods that are capable of learning input sequence features implicitly. Here we introduce ntEmbd, a deep learning embedding tool that captures dependencies between different features of the sequences and learns a latent representation for given nucleotide sequences. We further provide two sample use cases, describing how learned RNA features can be used in downstream analysis. The first use case demonstrates ntEmbd ’ s utility in classifying coding and noncoding RNA benchmarked against existing tools, and the second one explores the utility of learned representations in identifying adapter sequences in nanopore RNA-seq reads. The tool as well as the trained models are freely available on GitHub at https://github.com/bcgsc/ntEmbd

Version published to 10.1101/2024.04.30.591806 on bioRxiv
May 2, 2024

Convolutional Deep Learning Approach to identify DNA Sequences for Gene Prediction

This article has 2 authors:
1. Jesus Antonio Motta
2. Pedro David Gomez
This article has no evaluationsLatest version Jan 27, 2026
A Survey on Efficient Protein Language Models

This article has 8 authors:
1. Shouren Wang
2. Debargha Ganguly
3. Vinooth Kulkarni
4. Wang Yang
5. Zhuoran Qiao
6. Daniel Blankenberg
7. Vipin Chaudhary
8. Xiaotian Han
This article has no evaluationsLatest version Dec 24, 2025
DNABERT2-CAMP: A Hybrid Transformer-CNN Model for E. coli Promoter Recognition

This article has 4 authors:
1. Hua-Lin Xu
2. Xiu-Jun Gong
3. Hua Yu
4. Ying-Kai Wang
This article has no evaluationsLatest version Dec 28, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Convolutional Deep Learning Approach to identify DNA Sequences for Gene Prediction

A Survey on Efficient Protein Language Models

DNABERT2-CAMP: A Hybrid Transformer-CNN Model for E. coli Promoter Recognition