Sassy: Searching Short DNA Strings in the 2020s

Rick Beeloo
Ragnar Groot Koerkamp

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Motivation

Approximate string matching (ASM) is the problem of finding all occurrences of a pattern P in a text T while allowing up to k errors. ASM was researched extensively around the 1990s, but with the rise of large-scale datasets, focus shifted towards inexact approaches based on seed-chain-extend . These methods often provide large speedups in practice, but do not guarantee finding all matches with ≤ k errors. However, many applications, such as CRISPR off-target detection, require exhaustive results with no false negatives.

Methods

We introduce Sassy, a library and tool for ASM of short (up to ≈1000 bp) patterns in large texts. Sassy builds on earlier tools that use Myers’ bitpacking, such as Edlib. Algorithmically, the two main novelties are to split each sequence into 4 parts that are searched in parallel, and to use bitvectors in the text direction ( horizontally ) rather than the pattern direction ( vertically ). This allows significant speedups for short queries, especially when k is small, as has complexity O ( k⌈n/W⌉ ) when searching random text, where W = 256 is the SIMD width. Practically speaking, Sassy is the only recent index-free tool that is designed specifically for ASM, rather than the more common semi-global alignment. In addition, Sassy supports the IUPAC alphabet, which is essential for primer design and for matching ambiguous bases in assemblies. Separately, we also introduce the concept of overhang cost : a variant of ‘overlap’ alignment where e.g. a suffix of the pattern is matched against a prefix of the text, where each character of the pattern that extends beyond the text incurs a cost of e.g. α = 0.5. This is important when matches are near contig or read ends.

Results

Compared to Edlib, Sassy is 4 × to 15 × faster for sequences up to length 1000, and has throughputs exceeding 2 GB/s, whereas Edlib remains below 130 MB/s. Likewise, Sassy is up to 10 × faster than parasail when searching short strings. Sassy is also readily applicable to biological problems such as CRISPR off-target detection. When searching 61 guide sequences in a human genome, Sassy is 100 × faster than SWOffinder and only slightly slower (for k ≤ 3) than CHOPOFF, for which building its index takes 20 minutes. Sassy also scales well to larger k ∈ {4, 5}, unlike CHOPOFF whose index took over 10 hours to build.

Availibility

Sassy is available as Rust library and binary at https://github.com/RagnarGrootKoerkamp/sassy .

Version published to 10.1101/2025.07.22.666207 on bioRxiv
Jul 26, 2025

FastGA: Fast Genome Alignment

This article has 3 authors:
1. Gene Myers
2. Richard Durbin
3. Chenxi Zhou
This article has no evaluationsLatest version Jun 19, 2025
High-Accuracy, Ultrafast DNA Barcode Identification via Statistical Sketching and Approximate Nearest Neighbor Search

This article has 1 author:
1. Justin Boone
This article has no evaluationsLatest version Jul 17, 2025
Basecalling for DNA Storage

This article has 4 authors:
1. Advait Menon
2. Samira Brunmayr
3. Omer Sella
4. Thomas Heinis
This article has no evaluationsLatest version Jul 31, 2025

Listed in

Abstract

Motivation

Methods

Results

Availibility

Article activity feed

Related articles

FastGA: Fast Genome Alignment

High-Accuracy, Ultrafast DNA Barcode Identification via Statistical Sketching and Approximate Nearest Neighbor Search

Basecalling for DNA Storage