OmniNA: A foundation model for nucleotide sequences
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (Arcadia Science)
Abstract
Foundation models have demonstrated exceptional efficacy across diverse downstream tasks. However, within the realms of genomics and transcriptomics, a notable gap persists in the availability of models that afford a comprehensive understanding of nucleotide sequence principles across various species. Here, we present OmniNA, a foundation generative model designed for comprehensive nucleotide sequence learning. The model was pre-trained on 91.7 million nucleotide sequences and the corresponding annotations encompassing 1076.2 billion bases and 197 million words spanning a multitude of species. We demonstrated OmniNA gains the capacity to understand the semantics of the nucleotide sequence and textual annotations by analyzing the learned representation of the pre-trained model. OmniNA can be fine-tuned to align multiple nucleotide learning tasks with natural language paradigms. We demonstrate OmniNA-1.7B surpasses or rivals state-of-the art methods in 17 nucleotide tasks, encompassing nucleotide sequences detection and species classification. The model’s understanding of nucleotide grammars enhances its capability to reveal the mutation effect of nucleotide sequence on DNA and RNA processing. We hereby release the OmniNA-1.7B model as an open-source contribution to the research community. This foundation model signifies a step toward advancing our comprehension of nucleotide sequences across diverse species and holds substantial promise to facilitating genomics and transcriptomics research.
Article activity feed
-
The data for pre-training is openly available at https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/.
Can you provide which date you accessed this data?
-
Employing a sliding window approach, we truncate sequences longer than 3000 bp to 3000 bp. Sequences with length less than 200 bp are excluded
I'm curious about the method employed here and the rationale behind it.
- What was the size of the sliding window? So that mean nucleotide at position X in a genome would be captured many times, each time multiple times for a sliding window, capturing all 3k bases before and after it in a series of different windows?
- Why eliminate sequences less than 200bp in length if they are "real"? Many non coding RNAs or peptides are encoded by short sequences. Does this decision limit your embedding space to sequences > 200 bp long?
-
91.7 million (M) nucleotide sequences
It would be super helpful at this point to breakdown a bit more they type of data this model was trained on -- genomes, transcriptomes, thinks in GenBank, etc.
-
Existing models, such as Enformer and TIGER 7,8, have made contributions to specific genome tasks.
Would it make sense to include citations to Nucleotide Transformer and DNABERT as well? Unless I'm misunderstanding the application space, these seem like seminal efforts in this area.
-