BINSEQ: A Family of High-Performance Binary Formats for Nucleotide Sequences

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Modern genomics produces billions of sequencing records per run, which are typically stored as gzip-compressed FASTQ files. While this format is widely used, it is not optimal for high-throughput processing due to its reliance on single-threaded decompression and sequential parsing of irregularly sized records. This limitation is particularly problematic for applications that would benefit from parallel processing, such as read mapping, variant calling, and de novo assembly. Here, we present BINSEQ, a family of simple binary formats that enable high-throughput parallel processing of sequencing data. The BINSEQ family consists of two complementary implementations: BQ, optimized for fixed-length reads using a two-bit or four-bit encoding scheme with true random record access capability, and VBQ, designed for variable-length sequences with optional quality scores and block-based compression. We demonstrate that BINSEQ files are up to 90x faster than compressed FASTQ for parallel processing and can reduce analysis time from hours to minutes for large-scale genome and transcriptome analyses, particularly for resource-intensive applications like alignment, mapping, and de novo assembly. To facilitate adoption we provide high-performance libraries for reading and writing BINSEQ formats, native parallelization strategies with convenient APIs, and a command-line tool for conversion to and from traditional formats.

Author Summary

Modern sequencing technologies routinely generate billions of reads per experiment, yet the methods for storing and accessing this data have not kept pace. Sequencing reads remain predominantly stored in FASTQ, a text-based format designed for far smaller datasets. FASTQ’s sequential parsing requirements and practical need for compression create a fundamental mismatch with modern multi-core architectures, where data access rather than computation has become the primary bottleneck. We address this problem with BINSEQ, a family of binary formats engineered for random access and native parallelization. Systematic benchmarking across applications of varying computational complexity demonstrates that BINSEQ achieves 90-fold improvements in data access and maintains substantial advantages in compute-intensive tasks such as genome alignment, reducing runtimes from hours to minutes. We present two complementary implementations: BQ, optimized for simplicity and maximal throughput, and VBQ, designed for flexibility while maintaining high performance. By reconsidering the relationship between storage architecture and parallel processing capabilities, BINSEQ provides a practical solution to a critical infrastructure challenge in high-throughput genomics.

Article activity feed