Haplotype-based Parallel PBWT for Biobank Scale Data

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Durbin’s positional Burrows-Wheeler transform (PBWT) enables algorithms with the optimal time complexity of O ( MN ) for reporting all vs all haplotype matches in a population panel with M haplotypes and N variant sites. However, even this efficiency may still be too slow when the number of haplotypes reaches millions. To further reduce the run time, in this paper, a parallel version of the PBWT algorithms is introduced for all versus all haplotype matching, which is called HP-PBWT (haplotype-based parallel PBWT). HP-PBWT parallelly executes the PBWT by splitting a haplotype panel into blocks of haplotypes. HP-PBWT algorithms achieve parallelization for PBWT construction, reporting all versus all L-long matches, and reporting all versus all set-maximal matches while maintaining memory efficiency. HP-PBWT has an time complexity in PBWT construction, and an time complexity for reporting all versus all L-long matches and reporting all versus all set-maximal matches, where T is the number of threads and c* is the maximum number of matches (of length L or maximum divergence value for L-long matches and set-maximal matches, re-spectively) per haplotype per site. HP-PBWT achieves 4-fold speed-up in UK Biobank genotyping array data with 30 threads in the IO-included benchmarks. When applying HP-PBWT to a dataset of 8 million randomized haplotypes (random binary strings of equal length) in the IO-excluded benchmarks, it can achieve a 22-fold speed-up with 60 cores on the Amazon EC2 server. With further hardware optimization, HP-PBWT is expected to handle billions of haplotypes efficiently.

Article activity feed