PatchDNA: A Flexible and Biologically-Informed Alternative to Tokenization for DNA

Alice Del Vecchio
Chantriolnt-Andreas Kapourani
Abdullah M. Athar
Agnieszka Dobrowolska
Andrew Anighoro
Benjamin Tenmann
Lindsay Edwards
Cristian Regep

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

DNA language models are emerging as powerful tools for representing genomic sequences, with recent progress driven by self-supervised learning. However, performance on downstream tasks is sensitive to tokenization strategies reflecting the complex encodings in DNA, where both regulatory elements and single-nucleotide changes can be functionally significant. Yet existing models are fixed to their initial tokenization strategy; single-nucleotide encodings result in long sequences that challenge transformer architectures, while fixed multi-nucleotide schemes like byte pair encoding struggle with character level modeling. Drawing inspiration from the Byte Latent Transformer’s combining of bytes into patches, we propose that ‘patching’ provides a competitive and more efficient alternative to tokenization for DNA sequences. Furthermore, patching eliminates the need for a fixed vocabulary, which offers unique advantages to DNA. Leveraging this, we propose a biologically informed strategy, using evolutionary conservation scores as a guide for ‘patch’ boundaries. By prioritizing conserved regions, our approach directs computational resources to the most functionally relevant parts of the DNA sequence. We show that models up to an order of magnitude smaller surpass current state-of-the-art performance in existing DNA benchmarks. Importantly, our approach provides the flexibility to change patching without retraining, overcoming a fundamental limitation of current tokenization methods.

Version published to 10.1101/2025.11.28.691095 on bioRxiv
Nov 29, 2025

Explicit Dynamic Cross-Strand Interactions for DNA Sequence Language Modeling

This article has 12 authors:
1. Xiao Luo
2. Cheng Yang
3. Yuansheng Liu
4. Lei Ling
5. Fengxin Li
6. Changjian Chen
7. Long Wang
8. Feng Yu
9. Liang Qiao
10. Xiangxiang Zeng
11. Kenli Li
12. Alexander Schönhuth
This article has no evaluationsLatest version Jan 8, 2026
In-Context Learning in Genomic Language Models as a Biological Evaluation Task

This article has 2 authors:
1. Aadit Kapoor
2. Wendy Lee
This article has no evaluationsLatest version Dec 9, 2025
DNABERT2-CAMP: A Hybrid Transformer-CNN Model for E. coli Promoter Recognition

This article has 4 authors:
1. Hua-Lin Xu
2. Xiu-Jun Gong
3. Hua Yu
4. Ying-Kai Wang
This article has no evaluationsLatest version Dec 28, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Explicit Dynamic Cross-Strand Interactions for DNA Sequence Language Modeling

In-Context Learning in Genomic Language Models as a Biological Evaluation Task

DNABERT2-CAMP: A Hybrid Transformer-CNN Model for E. coli Promoter Recognition