Genome Reconstruction with De Bruijn Graph Networks

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Short-read genome assembly still struggles with repeats and sequencing noise, where heuristic graph traversals often misresolve branches and break contigs. Motivated to test a learning-guided approach that scores edges directly on the unitig graph, aiming for more reliable path selection without relying on paired-end or long-read scaffolding, we introduce COGRAM. This graph-learning assembly pipeline integrates a compacted de Bruijn unitig graph with a GCN-guided hybrid search to score edges and reconstruct paths. On Escherichia coli, the method achieves a strong F1 and high global genome coverage, with behavior that varies by local graph complexity: large sampled regions allow the greedy phase to traverse most nodes before beam expansion, yielding high F1; medium-complexity fragments can trap the beam search and truncate recall and coverage; very small regions are trivially solved. These observations motivate practical tuning levers—expanding the greedy horizon, widening the beam, and increasing the top-k retained at branch points—to trade additional computation for robustness. Unlike Eulerian assemblers such as SPAdes, Velvet, and ABySS that combine bubble popping, tip trimming, and paired-end scaffolding to exceed 99\% genome fraction with long, low-error contigs routinely, COGRAM purposefully takes a different route: it poses a Hamiltonian reconstruction on the unitig graph and decodes with a greedy-plus-beam strategy. In early testing, the unitig graph covers 97.9\% of nodes with 94.7\% recall and forms one dominant path, and the model attains approximately 95\% overlap without paired-end or long-read information—evidence that the GCN learns local overlap patterns. Improving contiguity (N50), reducing misassemblies, and lowering base-error rates are deferred to future work; the present results establish COGRAM as a promising proof of concept that bridges learning-based edge inference with classical DBG assembly mechanics.

Article activity feed