Improved pangenomic classification accuracy with chain statistics

Nathaniel K. Brown
Vikram S. Shivakumar
Ben Langmead

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Compressed full-text indexes enable efficient sequence classification against a pangenome or tree-of-life index. Past work on compressed-index classification used matching statistics or pseudo-matching lengths to capture the fine-grained co-linearity of exact matches. But these fail to capture coarse-grained information about whether seeds appear co-linearly in the reference. We present a novel approach that additionally obtains coarse-grained co-linearity (“chain”) statistics. We do this without using a chaining algorithm, which would require superlinear time in the number of matches. We start with a collection of strings, avoiding the multiple-alignment step required by graph approaches. We rapidly compute multi-maximal unique matches (multi-MUMs) and identify BWT sub-runs that correspond to these multi-MUMs. From these, we select those that can be “tunneled,” and mark these with the corresponding multi-MUM identifiers. This yields an ℴ( r + n/d )-space index for a collection of d sequences having a length- n BWT consisting of r maximal equal-character runs. Using the index, we simultaneously compute fine-grained matching statistics and coarse-grained chain statistics in linear time with respect to query length. We found that this substantially improves classification accuracy compared to past compressed-indexing approaches and reaches the same level of accuracy as less efficient alignmentbased methods.

Version published to 10.1101/2024.10.29.620953v1 on bioRxiv
Nov 2, 2024

PARMIK: PArtial Read Matching with Inexpensive K-mers

This article has 3 authors:
1. Morteza Baradaran
2. Ryan M Layer
3. Kevin Skadron
This article has no evaluationsLatest version Oct 17, 2024
FroM Superstring to Indexing: a space-efficient index for unconstrained k -mer sets using the Masked Burrows-Wheeler Transform (MBWT)

This article has 3 authors:
1. Ondřej Sladký
2. Pavel Veselý
3. Karel Břinda
This article has no evaluationsLatest version Nov 3, 2024
Integer programming framework for pangenome-based genome inference

This article has 6 authors:
1. Ghanshyam Chandra
2. Md Helal Hossen
3. Stephan Scholz
4. Alexander T Dilthey
5. Daniel Gibney
6. Chirag Jain
Reviewed by Arcadia Science

This article has 2 evaluationsAppears in 1 listLatest version Oct 29, 2024Latest activity Nov 1, 2024

Listed in

Abstract

Article activity feed

Related articles

PARMIK: PArtial Read Matching with Inexpensive K-mers

FroM Superstring to Indexing: a space-efficient index for unconstrained k -mer sets using the Masked Burrows-Wheeler Transform (MBWT)

Integer programming framework for pangenome-based genome inference