Haplotype-based Parallel PBWT for Biobank Scale Data

Kecong Tang
Ahsan Sanaullah
Degui Zhi
Shaojie Zhang

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Durbin’s positional Burrows-Wheeler transform (PBWT) enables algorithms with the optimal time complexity of O ( MN ) for reporting all vs all haplotype matches in a population panel with M haplotypes and N variant sites. However, even this efficiency may still be too slow when the number of haplotypes reaches millions. To further reduce the run time, in this paper, a parallel version of the PBWT algorithms is introduced for all versus all haplotype matching, which is called HP-PBWT (haplotype-based parallel PBWT). HP-PBWT parallelly executes the PBWT by splitting a haplotype panel into blocks of haplotypes. HP-PBWT algorithms achieve parallelization for PBWT construction, reporting all versus all L-long matches, and reporting all versus all set-maximal matches while maintaining memory efficiency. HP-PBWT has an time complexity in PBWT construction, and an time complexity for reporting all versus all L-long matches and reporting all versus all set-maximal matches, where T is the number of threads and c* is the maximum number of matches (of length L or maximum divergence value for L-long matches and set-maximal matches, re-spectively) per haplotype per site. HP-PBWT achieves 4-fold speed-up in UK Biobank genotyping array data with 30 threads in the IO-included benchmarks. When applying HP-PBWT to a dataset of 8 million randomized haplotypes (random binary strings of equal length) in the IO-excluded benchmarks, it can achieve a 22-fold speed-up with 60 cores on the Amazon EC2 server. With further hardware optimization, HP-PBWT is expected to handle billions of haplotypes efficiently.

Version published to 10.1101/2025.02.04.636317v1 on bioRxiv
Feb 8, 2025

FastGA: Fast Genome Alignment

This article has 3 authors:
1. Gene Myers
2. Richard Durbin
3. Chenxi Zhou
This article has no evaluationsLatest version Jun 19, 2025
Human readable compression of GFA paths using grammar-based code

This article has 2 authors:
1. Peter Heringer
2. Daniel Doerr
This article has no evaluationsLatest version May 27, 2025
Movi Color: fast and accurate long-read classification with the move structure

This article has 4 authors:
1. Steven Tan
2. Sina Majidian
3. Ben Langmead
4. Mohsen Zakeri
This article has no evaluationsLatest version May 27, 2025

Listed in

Abstract

Article activity feed

Related articles

FastGA: Fast Genome Alignment

Human readable compression of GFA paths using grammar-based code

Movi Color: fast and accurate long-read classification with the move structure