Haplotype-based Parallel PBWT for Biobank Scale Data
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Durbin’s positional Burrows-Wheeler transform (PBWT) enables algorithms with the optimal time complexity of O ( MN ) for reporting all vs all haplotype matches in a population panel with M haplotypes and N variant sites. However, even this efficiency may still be too slow when the number of haplotypes reaches millions. To further reduce the run time, in this paper, a parallel version of the PBWT algorithms is introduced for all versus all haplotype matching, which is called HP-PBWT (haplotype-based parallel PBWT). HP-PBWT parallelly executes the PBWT by splitting a haplotype panel into blocks of haplotypes. HP-PBWT algorithms achieve parallelization for PBWT construction, reporting all versus all L-long matches, and reporting all versus all set-maximal matches while maintaining memory efficiency. HP-PBWT has an time complexity in PBWT construction, and an time complexity for reporting all versus all L-long matches and reporting all versus all set-maximal matches, where T is the number of threads and c* is the maximum number of matches (of length L or maximum divergence value for L-long matches and set-maximal matches, re-spectively) per haplotype per site. HP-PBWT achieves 4-fold speed-up in UK Biobank genotyping array data with 30 threads in the IO-included benchmarks. When applying HP-PBWT to a dataset of 8 million randomized haplotypes (random binary strings of equal length) in the IO-excluded benchmarks, it can achieve a 22-fold speed-up with 60 cores on the Amazon EC2 server. With further hardware optimization, HP-PBWT is expected to handle billions of haplotypes efficiently.