vMUS-dBG: A Novel De Bruijn Graph Model for De Novo Genome Assembly Using Variable-Length Minimum Unique Substrings

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

De novo genome assembly using de Bruijn graphs (DBGs) typically relies on fixed-length k -mers as the nodes of the graph. While this approach is effective, it presents a fundamental trade-off: smaller k values tend to collapse repeats, whereas larger k values can result in fragmentation, particularly in low-coverage regions. Although multi- k and variable-order methods help mitigate these issues, they still rely on fixed-length topology or heuristic parameter selection. In this work, we introduce a de Bruijn graph constructed from Minimum Unique Substrings (MUSs), substrings that appear exactly once within the genome. This new graph is referred to as the variable-length MUS de Bruijn graph (vMUS-dBG). In the vMUS-dBG, the nodes are defined by read-extracted MUS anchors, and directed edges represent read-supported transitions between successive occurrences of MUSs. Each edge is also enhanced with instance-level metadata to preserve positional weights (repeats) and support counts. This innovative design eliminates the necessity for a global k -mer selection, while producing a concrete, repeat-aware graph construction that operates differently from the abstract manifold-style DBG model. Our experiments using real 24x E. coli K 12 HiFi data demonstrate that a prototype implementation of our approach achieves contiguity and accuracy comparable to that of a fixed-k method. These results establish MUS-based variable-length graph construction as a principled and biologically grounded alternative to fixed- k De Bruijn graph assembly to explore.

Article activity feed