Genome Reconstruction with De Bruijn Graph Networks
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Short-read genome assembly still struggles with repeats and sequencing noise, where heuristic graph traversals often misresolve branches and break contigs. Motivated to test a learning-guided approach that scores edges directly on the unitig graph, aiming for more reliable path selection without relying on paired-end or long-read scaffolding, we introduce COGRAM. This graph-learning assembly pipeline integrates a compacted de Bruijn unitig graph with a GCN-guided hybrid search to score edges and reconstruct paths. On Escherichia coli, the method achieves a strong F1 and high global genome coverage, with behavior that varies by local graph complexity: large sampled regions allow the greedy phase to traverse most nodes before beam expansion, yielding high F1; medium-complexity fragments can trap the beam search and truncate recall and coverage; very small regions are trivially solved. These observations motivate practical tuning levers—expanding the greedy horizon, widening the beam, and increasing the top-k retained at branch points—to trade additional computation for robustness. Unlike Eulerian assemblers such as SPAdes, Velvet, and ABySS that combine bubble popping, tip trimming, and paired-end scaffolding to exceed 99\% genome fraction with long, low-error contigs routinely, COGRAM purposefully takes a different route: it poses a Hamiltonian reconstruction on the unitig graph and decodes with a greedy-plus-beam strategy. In early testing, the unitig graph covers 97.9\% of nodes with 94.7\% recall and forms one dominant path, and the model attains approximately 95\% overlap without paired-end or long-read information—evidence that the GCN learns local overlap patterns. Improving contiguity (N50), reducing misassemblies, and lowering base-error rates are deferred to future work; the present results establish COGRAM as a promising proof of concept that bridges learning-based edge inference with classical DBG assembly mechanics.