BINSEQ: A Family of High-Performance Binary Formats for Nucleotide Sequences

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Modern genomics produces billions of sequencing records per run, which are typically stored as gzip-compressed FASTQ files. While this format is widely used, it is not optimal for high-throughput processing due to its reliance on single-threaded decompression and sequential parsing of irregularly sized records. This limitation is particularly problematic for applications that would benefit from parallel processing, such as read mapping, variant calling, and de novo assembly. Here, we present BINSEQ, a family of simple binary formats that enable high-throughput parallel processing of sequencing data. The BINSEQ family consists of two complementary implementations: BINSEQ, optimized for fixed-length reads using a two-bit encoding scheme with true random record access capability, and VBINSEQ, designed for variable-length sequences with optional quality scores and block-based organization. We demonstrate that BINSEQ files are up to 32x faster than compressed FASTQ for parallel processing and can reduce analysis time from hours to minutes for large-scale genome and transcriptome analyses, particularly for resource-intensive applications like alignment, mapping, and de novo assembly. To facilitate adoption we provide high-performance libraries for reading and writing BINSEQ formats, native parallelization strategies with convenient APIs, and a command-line tool for conversion to and from traditional formats.

Article activity feed