Dynamic µ -PBWT: Dynamic Run-length Compressed PBWT for Biobank Scale Data

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Durbin’s positional Burrows-Wheeler transform (PBWT) supports efficient haplotype matching and queries given a panel of haplotypes. It has been widely used for statistical phasing, imputation and identity-by-descent (IBD) detection. However, the original PBWT panel doesn’t support dynamic updates when haplotypes need to be added or deleted from the panel. Dynamic-PBWT (d-PBWT) solved this problem but it is not memory efficient. While the memory constraint problem of the PBWT has been tackled by Syllable-PBWT and µ -PBWT, these are static data structures that do not allow updates. Additionally, Syllable-PBWT only supports long-match query and µ -PBWT only supports set-maximal match query, limiting their functionality in the compressed form. In this paper, we present Dynamic µ -PBWT (which can also be seen as compressed d-PBWT) that is memory efficient and supports dynamic updates. We run-length compress PBWT to achieve better compression rate and store the runs in the self-balancing trees to enable dynamic updates. We show that the number of updates per insertion or deletion in the tree at each site is constant regardless of the number of haplotypes in the panel and the updates can be made without decompressing the index. In addition, we use orders of magnitude less memory than d-PBWT. We also provide a long match query algorithm that can easily be extended back to the original µ -PBWT. Overall, the flexibility and space-efficiency of Dynamic µ -PBWT makes it a potential index data structure for biobank scale genetic data analyses. The source code for Dynamic µ -PBWT is available at https://github.com/ucfcbb/Dynamic-mu-PBWT .

Article activity feed