Sassy: Searching Short DNA Strings in the 2020s

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Motivation

Approximate string matching (ASM) is the problem of finding all occurrences of a pattern P in a text T while allowing up to k errors. ASM was researched extensively around the 1990s, but with the rise of large-scale datasets, focus shifted towards inexact approaches based on seed-chain-extend . These methods often provide large speedups in practice, but do not guarantee finding all matches with ≤ k errors. However, many applications, such as CRISPR off-target detection, require exhaustive results with no false negatives.

Methods

We introduce Sassy, a library and tool for ASM of short (up to ≈1000 bp) patterns in large texts. Sassy builds on earlier tools that use Myers’ bitpacking, such as Edlib. Algorithmically, the two main novelties are to split each sequence into 4 parts that are searched in parallel, and to use bitvectors in the text direction ( horizontally ) rather than the pattern direction ( vertically ). This allows significant speedups for short queries, especially when k is small, as has complexity O ( k⌈n/W⌉ ) when searching random text, where W = 256 is the SIMD width. Practically speaking, Sassy is the only recent index-free tool that is designed specifically for ASM, rather than the more common semi-global alignment. In addition, Sassy supports the IUPAC alphabet, which is essential for primer design and for matching ambiguous bases in assemblies. Separately, we also introduce the concept of overhang cost : a variant of ‘overlap’ alignment where e.g. a suffix of the pattern is matched against a prefix of the text, where each character of the pattern that extends beyond the text incurs a cost of e.g. α = 0.5. This is important when matches are near contig or read ends.

Results

Compared to Edlib, Sassy is 4 × to 15 × faster for sequences up to length 1000, and has throughputs exceeding 2 GB/s, whereas Edlib remains below 130 MB/s. Likewise, Sassy is up to 10 × faster than parasail when searching short strings. Sassy is also readily applicable to biological problems such as CRISPR off-target detection. When searching 61 guide sequences in a human genome, Sassy is 100 × faster than SWOffinder and only slightly slower (for k ≤ 3) than CHOPOFF, for which building its index takes 20 minutes. Sassy also scales well to larger k ∈ {4, 5}, unlike CHOPOFF whose index took over 10 hours to build.

Availibility

Sassy is available as Rust library and binary at https://github.com/RagnarGrootKoerkamp/sassy .

Article activity feed