BINSEQ: A Family of High-Performance Binary Formats for Nucleotide Sequences

Noam Teyssier
Alexander Dobin

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Modern genomics produces billions of sequencing records per run, which are typically stored as gzip-compressed FASTQ files. While this format is widely used, it is not optimal for high-throughput processing due to its reliance on single-threaded decompression and sequential parsing of irregularly sized records. This limitation is particularly problematic for applications that would benefit from parallel processing, such as read mapping, variant calling, and de novo assembly. Here, we present BINSEQ, a family of simple binary formats that enable high-throughput parallel processing of sequencing data. The BINSEQ family consists of two complementary implementations: BINSEQ, optimized for fixed-length reads using a two-bit encoding scheme with true random record access capability, and VBINSEQ, designed for variable-length sequences with optional quality scores and block-based organization. We demonstrate that BINSEQ files are up to 32x faster than compressed FASTQ for parallel processing and can reduce analysis time from hours to minutes for large-scale genome and transcriptome analyses, particularly for resource-intensive applications like alignment, mapping, and de novo assembly. To facilitate adoption we provide high-performance libraries for reading and writing BINSEQ formats, native parallelization strategies with convenient APIs, and a command-line tool for conversion to and from traditional formats.

Version published to 10.1101/2025.04.08.647863v1 on bioRxiv
Apr 15, 2025

polars-bio – fast, scalable and out-of-core operations on large genomic interval datasets

This article has 4 authors:
1. Marek Wiewiórka
2. Pavel Khamutou
3. Marek Zbysiński
4. Tomasz Gambin
This article has no evaluationsLatest version Mar 25, 2025
Columba: Fast Approximate Pattern Matching with Optimized Search Schemes

This article has 4 authors:
1. Luca Renders
2. Lore Depuydt
3. Travis Gagie
4. Jan Fostier
This article has no evaluationsLatest version Mar 31, 2025
cONcat: Computational reconstruction of concatenated fragments from long Oxford Nanopore reads

This article has 5 authors:
1. Alexander J. Petri
2. Mai Thi-Huyen Nguyen
3. Anjali Rajwar
4. Erik Benson
5. Kristoffer Sahlin
This article has no evaluationsLatest version Mar 14, 2025

Listed in

Abstract

Article activity feed

Related articles

polars-bio – fast, scalable and out-of-core operations on large genomic interval datasets

Columba: Fast Approximate Pattern Matching with Optimized Search Schemes

cONcat: Computational reconstruction of concatenated fragments from long Oxford Nanopore reads