Genomic Data Classification via Universal Compression

Yasmine Omri
Naomi Sagan
Eugene Min
Heewoong Choi
Taesup Moon
Tsachy Weissman

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Efficient and accurate DNA sequence classification is a crucial task in genomic data analysis. In this work, we construct a lightweight DNA classifier based on the LZ78 lossless universal compressor, and optimize its performance through hyperparameter tuning. This classifier outperforms the state-of-the-art DNABERT-2 model on the Genomic Understanding Evaluation suite, while drastically reducing computational costs. Unlike DNABERT-2, which requires two weeks of multi-GPU training, our classifier can be trained in about 30 minutes or less on a modern CPU with a fraction of the training data. It also offers up to 128× inference time speedup. Across GUE, Genomic Benchmarks, BEND, DART-Eval, and GUE+, this classifier is competitive on a broad range of tasks, and consistently surpasses leading genomic language models by large margins on the challenging Epigenetic Mark Prediction (EMP) tasks. We also benchmark computational efficiency against DNABERT-2 (a state-of-the-art, parameter-efficient gLM): our CPU-only training completes in minutes with a fraction of the data, and inference is up to 128x faster. We establish that our LZ78-based classifier provides a fast, data-frugal, CPU-only alternative for composition-driven genomic classification, complementing genomic language models and reserving their capacity for sparse, position-specific motif-dominated tasks. Additionally, we open-source our pipeline for compression-based classification. Future work aims to enhance its robustness and extend its applicability to more complex genomic tasks.

Version published to 10.21203/rs.3.rs-6363017/v2 on Research Square
Mar 3, 2026
Version published to 10.21203/rs.3.rs-6363017/v1 on Research Square
Apr 9, 2025

Predicting Genome-Wide Approximate Match Frequencies with Hit Frequency Vectors

This article has 2 authors:
1. Nathalie Gocht
2. Alexander Schliep
This article has no evaluationsLatest version Feb 26, 2026
Spark4VCF: A Novel Big Data Framework to Accelerate Genomics Analysis

This article has 10 authors:
1. Vinh Chi Duong
2. Thien Khac Nguyen
3. Giang Minh Vu
4. Sang Van Nguyen
5. Quan Nguyen
6. Vu Hoang Pham
7. Cuong Dinh Le
8. Toan Dang Dao
9. Nam Sy Vo
10. Tham Hong Hoang
This article has no evaluationsLatest version Mar 2, 2026
Oncogene and Tumor Suppressor Gene Classification Using Protein Language Model Embeddings and a Novel Optimization Strategy

This article has 1 author:
1. Ahmet Emir Şaşmazlar
This article has no evaluationsLatest version Mar 11, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Predicting Genome-Wide Approximate Match Frequencies with Hit Frequency Vectors

Spark4VCF: A Novel Big Data Framework to Accelerate Genomics Analysis

Oncogene and Tumor Suppressor Gene Classification Using Protein Language Model Embeddings and a Novel Optimization Strategy