StrPhaser constructs tandem repeat alleles from VCF data

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Variant calling is a ubiquitous genomic technique that underpins many scientific disciplines. From a computational perspective, variant calling is a form of logical compression; neglecting large variation, a person’s genome can be losslessly described as a set of differences (SNP and small InDel alleles) relative to the reference sequence. Another common genomic technique is haplotype phasing, wherein alleles are partitioned into their paternal and maternal components (as haplotypes). Some classes of alleles are more difficult to describe than others, e.g., short tandem repeats (STRs). STRs serve as a critical marker for many genetic assays. However, STRs tend not to be explicitly reported in most genomic workflows. Here, we present StrPhaser, a novel algorithm that leverages phased variant calling datasets in the VCF file format to construct STR alleles. We evaluated StrPhaser on ∼10,000 STR alleles from 284 human genomes, achieving an average allele accuracy of 91%. In addition, StrPhaser better recovers longer STR alleles than competing approaches; in principle, STR alleles that are longer than the maximum read length can be characterized. This capability, combined with its user-friendly interface, speed, and generation of both STR genotypes and visualizations, makes StrPhaser a valuable tool for a wide range of genomic studies.

Availability

The StrPhaser is publicly available at https://github.com/XuewenWangUGA/StrPhaser .

Article activity feed