Genomic Tokenizer: Toward a biology-driven tokenization in transformer models for DNA sequences
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Summary
Transformer models are revolutionizing sequence analysis across various domains, from natural language processing to genomics. These models rely on tokenizers to split input sequences into manageable chunks — a straightforward task in natural language but more challenging for long DNA sequences that lack distinct “words.” Most biological tokenizers are data-driven and do not align with the “central dogma of molecular biology”: DNA is transcribed into RNA, which is then translated into proteins, with each three-letter codon specifying a particular amino acid, some of which are synonymous for the amino acids they represent. Start codons signal the beginning of protein synthesis, while stop codons signal its termination. The Genomic Tokenizer (GT) incorporates this biological process flow into a standard tokenizer interface within the HuggingFace transformer package. GT can be used to pre-train foundational transformer models on DNA sequences. We compare the performance of GT with two alternate tokenization strategies and discuss its potential applications.
Availability and implementation
The source code of GT is available from https://github.com/dermatologist/genomic-tokenizer under the MPL-2.0 license. It can be installed from Python Package Index (PyPI) and used as a tokenizer in transformer model training pipelines.