Powerful read processing with matchbox

Jakob Schuster
Kathleen Zeglinski
Lucinda Xiao
Olivia Voulgaris
Sarahi Mendoza Rivera
Stephin J. Vervoort
Matthew E. Ritchie
Quentin Gouil
Michael B. Clark

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The wide variety of protocols and applications for DNA and RNA sequencing makes flexible tools for read processing an important step in sequence analysis. Beyond trimming and demultiplexing, custom read-level processing is commonly required for data exploration, QC and analysis. Existing tools are often task-specific and don’t generalise to new bioinformatic problems. Thus, there is a need for a tool flexible enough to handle the full variety of read processing tasks, and fast and scalable enough to retain high performance on growing sequencing datasets. We introduce matchbox , a read processor that enables fluent manipulation and analysis of FASTA/FASTQ/SAM/BAM files. With a lightweight scripting language designed around error-tolerant pattern-matching, users can write their own matchbox scripts to tackle a wide variety of bioinformatic problems, and incorporate them into existing pipelines and work-flows. We demonstrate matchbox ’s versatility in a number of contexts: demultiplexing long-read scRNA-seq data with 10X or SPLiT-seq barcodes; restranding RNA-seq reads; assessing CRISPR editing efficiency; and haplotyping macrosatel-lite repeat regions. matchbox achieves a computational performance comparable to existing tools, while addressing a broader range of bioinformatic needs, representing a new state-of-the-art in sequence processing. matchbox is implemented in Rust and available open-source at https://github.com/jakob-schuster/matchbox .

Version published to 10.1101/2025.11.09.685711 on bioRxiv
Nov 11, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed