Genomic Data Classification via Universal Compression
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Efficient and accurate DNA sequence classification is a crucial task in genomic data analysis. In this work, we construct a lightweight DNA classifier based on the LZ78 lossless universal compressor, and optimize its performance through hyperparameter tuning. This classifier outperforms the state-of-the-art DNABERT-2 model on the Genomic Understanding Evaluation suite, while drastically reducing computational costs. Unlike DNABERT-2, which requires two weeks of multi-GPU training, our classifier can be trained in about 30 minutes or less on a modern CPU with a fraction of the training data. It also offers up to 128× inference time speedup. Across GUE, Genomic Benchmarks, BEND, DART-Eval, and GUE+, this classifier is competitive on a broad range of tasks, and consistently surpasses leading genomic language models by large margins on the challenging Epigenetic Mark Prediction (EMP) tasks. We also benchmark computational efficiency against DNABERT-2 (a state-of-the-art, parameter-efficient gLM): our CPU-only training completes in minutes with a fraction of the data, and inference is up to 128x faster. We establish that our LZ78-based classifier provides a fast, data-frugal, CPU-only alternative for composition-driven genomic classification, complementing genomic language models and reserving their capacity for sparse, position-specific motif-dominated tasks. Additionally, we open-source our pipeline for compression-based classification. Future work aims to enhance its robustness and extend its applicability to more complex genomic tasks.