Genomic Tokenizer: Toward a biology-driven tokenization in transformer models for DNA sequences

Bell Raj Eapen

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Summary

Transformer models are revolutionizing sequence analysis across various domains, from natural language processing to genomics. These models rely on tokenizers to split input sequences into manageable chunks — a straightforward task in natural language but more challenging for long DNA sequences that lack distinct “words.” Most biological tokenizers are data-driven and do not align with the “central dogma of molecular biology”: DNA is transcribed into RNA, which is then translated into proteins, with each three-letter codon specifying a particular amino acid, some of which are synonymous for the amino acids they represent. Start codons signal the beginning of protein synthesis, while stop codons signal its termination. The Genomic Tokenizer (GT) incorporates this biological process flow into a standard tokenizer interface within the HuggingFace transformer package. GT can be used to pre-train foundational transformer models on DNA sequences. We compare the performance of GT with two alternate tokenization strategies and discuss its potential applications.

Availability and implementation

The source code of GT is available from https://github.com/dermatologist/genomic-tokenizer under the MPL-2.0 license. It can be installed from Python Package Index (PyPI) and used as a tokenizer in transformer model training pipelines.

Version published to 10.1101/2025.04.02.646836v1 on bioRxiv
Apr 9, 2025

GeneChat: A Multi-Modal Large Language Model for Gene Function Prediction

This article has 3 authors:
1. Shashi Dhanasekar
2. Akash Saranathan
3. Pengtao Xie
This article has no evaluationsLatest version Jun 6, 2025
codonGPT: Reinforcement learning on a generative language model optimizes RNA sequences under biological constraints

This article has 2 authors:
1. Binita Rajbanshi
2. Anuj Guruacharya
This article has no evaluationsLatest version Jun 27, 2025
Genomic Touchstone: Benchmarking Genomic Language Models in the Context of the Central Dogma

This article has 24 authors:
1. Yihui Wang
2. Zhiyuan Cai
3. Qian Zeng
4. Yihang Gao
5. Jiarui Ouyang
6. Yingxue Xu
7. Shu Yang
8. Sunan He
9. Yuxiang Nie
10. Yu Cai
11. Fengtao Zhou
12. Cheng Jin
13. Xi Wang
14. Zhi Xie
15. Danqing Zhu
16. Ting Xie
17. Kwang-Ting Cheng
18. Can Yang
19. Xi Fu
20. Jiguang Wang
21. Kang Zhang
22. Jianhua Yao
23. Raul Rabadan
24. Hao Chen
This article has no evaluationsLatest version Jun 30, 2025

Listed in

Abstract

Summary

Availability and implementation

Article activity feed

Related articles

GeneChat: A Multi-Modal Large Language Model for Gene Function Prediction

codonGPT: Reinforcement learning on a generative language model optimizes RNA sequences under biological constraints

Genomic Touchstone: Benchmarking Genomic Language Models in the Context of the Central Dogma