Optimizing Genomic Data Compression with Genetic Algorithms

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The rapid growth of genomic data, especially in clinical settings, has highlighted the need for efficient lossless compression. Preserving full sequence information is critical, as even small data losses can affect analyses or diagnoses. With increasing dataset size and complexity, optimized compression is vital for storage, transmission, and processing. Fixed-parameter compression tools often fall short across diverse genomes, demonstrating the need for adaptive tuning strategies tailored to each dataset. In this paper, we investigate the use of metaheuristic search algorithms, particularly Genetic Algorithms (GAs), to optimize parameter configurations for JARVIS3, a genomic sequence compression tool. Due to JARVIS3's large and complex parameter space, exhaustive search is infeasible. To tackle this, we developed OptimJV3, a modular framework that integrates multiple GA variants, including a multi-objective approach that also considers computational time. We evaluated OptimJV3 on various genomic datasets (FASTA format), namely Human Chromosome Y (CY), Cassava, and the full Human Genome (HG), and observed notable compression gains. For large sequences like Cassava and HG, we applied sampling-based optimization: optimal parameters were first identified from smaller segments and then used for full sequence compression. These small segments proved effective for guiding parameter tuning. Further improvements were achieved by increasing the number of GA generations, especially for HG. For example, using a 10 MB sample and 500 generations, OptimJV3 reached a compression rate of 1.431 bits per base on HG - among the best results reported so far.

Article activity feed