OmniNA: A foundation model for nucleotide sequences

Xilin Shen
Xiangchun Li

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

Foundation models have demonstrated exceptional efficacy across diverse downstream tasks. However, within the realms of genomics and transcriptomics, a notable gap persists in the availability of models that afford a comprehensive understanding of nucleotide sequence principles across various species. Here, we present OmniNA, a foundation generative model designed for comprehensive nucleotide sequence learning. The model was pre-trained on 91.7 million nucleotide sequences and the corresponding annotations encompassing 1076.2 billion bases and 197 million words spanning a multitude of species. We demonstrated OmniNA gains the capacity to understand the semantics of the nucleotide sequence and textual annotations by analyzing the learned representation of the pre-trained model. OmniNA can be fine-tuned to align multiple nucleotide learning tasks with natural language paradigms. We demonstrate OmniNA-1.7B surpasses or rivals state-of-the art methods in 17 nucleotide tasks, encompassing nucleotide sequences detection and species classification. The model’s understanding of nucleotide grammars enhances its capability to reveal the mutation effect of nucleotide sequence on DNA and RNA processing. We hereby release the OmniNA-1.7B model as an open-source contribution to the research community. This foundation model signifies a step toward advancing our comprehension of nucleotide sequences across diverse species and holds substantial promise to facilitating genomics and transcriptomics research.

Arcadia Science
Feb 14, 2024

The data for pre-training is openly available at https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/.

Can you provide which date you accessed this data?

Read the original source
Arcadia Science
Feb 14, 2024
Employing a sliding window approach, we truncate sequences longer than 3000 bp to 3000 bp. Sequences with length less than 200 bp are excluded

I'm curious about the method employed here and the rationale behind it.

What was the size of the sliding window? So that mean nucleotide at position X in a genome would be captured many times, each time multiple times for a sliding window, capturing all 3k bases before and after it in a series of different windows?

Why eliminate sequences less than 200bp in length if they are "real"? Many non coding RNAs or peptides are encoded by short sequences. Does this decision limit your embedding space to sequences > 200 bp long?
Read the original source
Arcadia Science
Feb 14, 2024

91.7 million (M) nucleotide sequences

It would be super helpful at this point to breakdown a bit more they type of data this model was trained on -- genomes, transcriptomes, thinks in GenBank, etc.

Read the original source
Arcadia Science
Feb 14, 2024

Existing models, such as Enformer and TIGER 7,8, have made contributions to specific genome tasks.

Would it make sense to include citations to Nucleotide Transformer and DNABERT as well? Unless I'm misunderstanding the application space, these seem like seminal efforts in this area.

Read the original source
Version published to 10.1101/2024.01.14.575543v1 on bioRxiv
Jan 15, 2024

Gener anno : A Genomic Foundation Model for Metagenomic Annotation

This article has 6 authors:
1. Qiuyi Li
2. Wei Wu
3. Yiheng Zhu
4. Fuli Feng
5. Jieping Ye
6. Zheng Wang
This article has no evaluationsLatest version Jun 5, 2025
GeneChat: A Multi-Modal Large Language Model for Gene Function Prediction

This article has 3 authors:
1. Shashi Dhanasekar
2. Akash Saranathan
3. Pengtao Xie
This article has no evaluationsLatest version Jun 6, 2025
Genomic Touchstone: Benchmarking Genomic Language Models in the Context of the Central Dogma

This article has 24 authors:
1. Yihui Wang
2. Zhiyuan Cai
3. Qian Zeng
4. Yihang Gao
5. Jiarui Ouyang
6. Yingxue Xu
7. Shu Yang
8. Sunan He
9. Yuxiang Nie
10. Yu Cai
11. Fengtao Zhou
12. Cheng Jin
13. Xi Wang
14. Zhi Xie
15. Danqing Zhu
16. Ting Xie
17. Kwang-Ting Cheng
18. Can Yang
19. Xi Fu
20. Jiguang Wang
21. Kang Zhang
22. Jianhua Yao
23. Raul Rabadan
24. Hao Chen
This article has no evaluationsLatest version Jun 30, 2025

OmniNA: A foundation model for nucleotide sequences

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Gener anno : A Genomic Foundation Model for Metagenomic Annotation

GeneChat: A Multi-Modal Large Language Model for Gene Function Prediction

Genomic Touchstone: Benchmarking Genomic Language Models in the Context of the Central Dogma

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

Gener anno : A Genomic Foundation Model for Metagenomic Annotation

GeneChat: A Multi-Modal Large Language Model for Gene Function Prediction

Genomic Touchstone: Benchmarking Genomic Language Models in the Context of the Central Dogma