Genomic Data Classification via Universal Compression
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Efficient and accurate DNA sequence classification is a crucial task in genomic data analysis. In this work, we construct a lightweight DNA classifier based on the LZ78 lossless universal compressor, and optimize its performance through hyperparameter tuning. This classifier outperforms the state-of-the-art DNABERT-2 model on the Genomic Understanding Evaluation suite, while drastically reducing computational costs. Unlike DNABERT-2, which requires two weeks of multi-GPU training, our classifier can be trained in about 30 minutes or less on a modern CPU with a fraction of the training data. It also offers up to 128× inference time speedup. These results highlight the potential of LZ78 for scalable and efficient genomics classification, particularly in resource-constrained environments. Additionally, we open-source our pipeline for compression-based classification. Future work aims to enhance its robustness and extend its applicability to more complex genomic tasks.