Dynamic µ -PBWT: Dynamic Run-length Compressed PBWT for Biobank Scale Data

Pramesh Shakya
Ahsan Sanaullah
Degui Zhi
Shaojie Zhang

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Durbin’s positional Burrows-Wheeler transform (PBWT) supports efficient haplotype matching and queries given a panel of haplotypes. It has been widely used for statistical phasing, imputation and identity-by-descent (IBD) detection. However, the original PBWT panel doesn’t support dynamic updates when haplotypes need to be added or deleted from the panel. Dynamic-PBWT (d-PBWT) solved this problem but it is not memory efficient. While the memory constraint problem of the PBWT has been tackled by Syllable-PBWT and µ -PBWT, these are static data structures that do not allow updates. Additionally, Syllable-PBWT only supports long-match query and µ -PBWT only supports set-maximal match query, limiting their functionality in the compressed form. In this paper, we present Dynamic µ -PBWT (which can also be seen as compressed d-PBWT) that is memory efficient and supports dynamic updates. We run-length compress PBWT to achieve better compression rate and store the runs in the self-balancing trees to enable dynamic updates. We show that the number of updates per insertion or deletion in the tree at each site is constant regardless of the number of haplotypes in the panel and the updates can be made without decompressing the index. In addition, we use orders of magnitude less memory than d-PBWT. We also provide a long match query algorithm that can easily be extended back to the original µ -PBWT. Overall, the flexibility and space-efficiency of Dynamic µ -PBWT makes it a potential index data structure for biobank scale genetic data analyses. The source code for Dynamic µ -PBWT is available at https://github.com/ucfcbb/Dynamic-mu-PBWT .

Version published to 10.1101/2025.02.04.636479v1 on bioRxiv
Feb 8, 2025

Haplotype-based Parallel PBWT for Biobank Scale Data

This article has 4 authors:
1. Kecong Tang
2. Ahsan Sanaullah
3. Degui Zhi
4. Shaojie Zhang
This article has no evaluationsLatest version Feb 8, 2025
Haplotype Matching with GBWT for Pangenome Graphs

This article has 4 authors:
1. Ahsan Sanaullah
2. Seba Villalobos
3. Degui Zhi
4. Shaojie Zhang
This article has no evaluationsLatest version Feb 7, 2025
Measuring Genomic Data with PFP

This article has 3 authors:
1. Zsuzsanna Lipták
2. Simone Lucà
3. Francesco Masillo
This article has no evaluationsLatest version Feb 27, 2025

Listed in

Abstract

Article activity feed

Related articles

Haplotype-based Parallel PBWT for Biobank Scale Data

Haplotype Matching with GBWT for Pangenome Graphs

Measuring Genomic Data with PFP