Sequence similarity estimation by random subsequence sketching

Ke Chen
Vinamratha Pattar
Mingfu Shao

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Sequence similarity estimation is essential for many bioinformatics tasks, including functional annotation, phylogenetic analysis, and overlap graph construction. Alignment-free methods aim to solve large-scale sequence similarity estimation by mapping sequences to more easily comparable features that can approximate edit distances efficiently. Substrings or kmers, as the dominant choice of features, face an unavoidable compromise between sensitivity and specificity when selecting the proper k -value. Recently, subsequence-based features have shown improved performance, but they are computationally demanding, and determining the ideal subsequence length remains an intricate art. In this work, we introduce SubseqSketch, a novel alignment-free scheme that maps a sequence to an integer vector, where the entries correspond to dynamic, rather than fixed, lengths of random subsequences. The cosine similarity between these vectors exhibits a strong correlation with the edit similarity between the original sequences. Through experiments on benchmark datasets, we demonstrate that SubseqSketch is both efficient and effective across various alignment-free tasks, including nearest neighbor search and phylogenetic clustering. A C++ implementation of SubseqSketch is openly available at https://github.com/Shao-Group/SubseqSketch .

Version published to 10.1101/2025.02.05.636706v1 on bioRxiv
Feb 8, 2025

DiVerG: Scalable Distance Index for Validation of Paired-End Alignments in Sequence Graphs

This article has 3 authors:
1. Ali Ghaffaari
2. Alexander Schönhuth
3. Tobias Marschall
This article has no evaluationsLatest version Feb 17, 2025
EvANI benchmarking workflow for evolutionary distance estimation

This article has 4 authors:
1. Sina Majidian
2. Stephen Hwang
3. Mohsen Zakeri
4. Ben Langmead
This article has no evaluationsLatest version Feb 23, 2025
CADENCE: Clustering Algorithm - Density-based Exploration and Novelty Clustering with Efficiency

This article has 3 authors:
1. Lexin Chen
2. Daniel R. Roe
3. Ramón Alain Miranda-Quintana
This article has no evaluationsLatest version Feb 28, 2025

Listed in

Abstract

Article activity feed

Related articles

DiVerG: Scalable Distance Index for Validation of Paired-End Alignments in Sequence Graphs

EvANI benchmarking workflow for evolutionary distance estimation

CADENCE: Clustering Algorithm - Density-based Exploration and Novelty Clustering with Efficiency