Genomic Data Classification via Universal Compression
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Efficient and accurate DNA sequence classification is a crucial task in genomic data analysis. In this work, we construct a lightweight DNA classifier based on the LZ78 lossless universal compressor, and optimize its performance through hyperparameter tuning. This classifier outperforms the state-of-the-art DNABERT-2 model on the Genomic Understanding Evaluation suite, while drastically reducing computational costs. Unlike DNABERT-2, which requires two weeks of multi-GPU training, our classifier can be trained in about 30 minutes or less on a modern CPU with a fraction of the training data. It also offers up to 128× inference time speedup. These results highlight the potential of LZ78 for scalable and efficient genomics classification, particularly in resource-constrained environments. Additionally, we open-source our pipeline for compression-based classification. Future work aims to enhance its robustness and extend its applicability to more complex genomic tasks.