Billi: Provably Accurate and Scalable Bubble Detection in Pangenome Graphs

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

A key application of pangenome graphs is the characterization of small and large genomic variants represented as bubbles within the graph. Although bubbles have been extensively studied in directed graphs in the context of genome assembly, there remains a need for a rigorous definition and systematic analysis of bubbles in bidirected graphs, which is the predominant data structure used to represent pangenomes. We show that existing bubble definitions for bidirected graphs do not fully meet the requirements for representing genetic sites and alleles; in particular, overlapping bubbles may not exhibit strict nesting. To address this, we introduce a new subgraph abstraction called panbubble and prove that it satisfies the desired structural properties for variant characterization. We then present an exact algorithm with runtime O(V^2(V + E)) for detecting all panbubbles in a bidirected graph G = (V, E). In addition, we propose a heuristic algorithm that produces identical output as the exact algorithm in practice and scales to large graphs, including both the first and second releases of the Human Pangenome Reference Consortium (HPRC). We implemented our algorithms in the tool Billi (github.com/at-cg/billi). On our largest dataset, Billi is more than 15x faster and uses over 5x less memory than VG.

Article activity feed