Ribbit: Accurate identification and annotation of complex tandem repeat sequences in genomes
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
DNA tandem repeats (TRs) are crucial for genomic functions like protein binding, chromatin modulation, splicing, and gene regulation. Abnormal length variations in TRs, especially expansions, are associated with over 60 neurodegenerative diseases. The function and stability of a TR locus is dependent on its sequencing composition and purity. Recent studies report the disease-causing propensity of non-canonical motif expansions in TR loci, and highlight the intricate polymorphism dynamics in complex loci encompassing adjacent, overlapping, and nested TRs. These reports emphasize the need for precise definition and motif decomposition of TR loci. To address this, we present Ribbit, a tool that accurately and efficiently identifies and annotates TR loci in a genome. Ribbit uses 2-bit representation of DNA sequences for rapid identification of TRs of 2–100 bp motif size and resolves complex TR structures. Ribbit efficiently handles imperfections such as indels and substitutions, providing insights into nested and compound TR relationships through detailed motif decomposition. Comparative analyses using simulated data show Ribbit outperforms existing tools like Dot2dot and TRF in terms of runtime and accuracy. Ribbit reports TR loci in the human genome with lower redundancy than TRF and provides resolved TR regions comparable to variation clusters reported in recent catalogues. Therefore, Ribbit can be leveraged to understand the evolution and biology of complex TR regions in large genomes.