Efficient Identification of Short Tandem Repeats via Context-Aware Motif Discovery and Ultra-Fast Sequence Alignment

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Tandem repeats (TRs) are highly polymorphic genomic elements, associated with diverse molecular traits and implicated in numerous human diseases. However, large-scale analysis of TRs has been limited by computational challenges, including motif recognition, detection in complex regions, and excessive computational cost. Here we present FastSTR, a computationally efficient tool for precise detection and characterization of TRs. FastSTR integrates a context-aware N-gram motif model with a segmented global alignment algorithm to enable accurate motif identification and boundary definition, even for repeat units up to 8 bp. Across 13 species, FastSTR achieved >90% recall and 99% precision, running several times faster than existing methods white outperforming them in both sensitivity and accuracy. Applied to the human genome, FastSTR uncovered previously unannotated HSATII elements, resolved population-specific TR demonstrate, and identified recurrent STR alterations in lung cancer. These results demonstrate FastSTR as a versatile framework for TR annotation and discovery, advancing studies of genome evolution, genetic diversity, and disease.

Article activity feed