GnnDebugger: GNN based error correction in De Bruijn Graphs
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Motivation
Modern sequencing technologies have enabled the reconstruction of complete mammalian genomes from telomere to telomere. However, scaling this achievement to thousands of species and population-level studies remains a challenge. Key bottlenecks include the low quality of the draft assemblies and the high coverage requirements. In particular, reconstructing complete and accurate sequences of both haplotypes in diploid genomes is especially difficult since the sequencing depth is not always sufficient to properly reconstruct diverged regions. Inspired by the success of neural networks in extracting patterns from the data on a massive scale, we introduce a method for correcting errors in De Bruijn Graphs using Graph Neural Networks.
Results
Our model provides a reliable classification of edges into correct and erroneous, especially for diploid genomes with coverage depth 35 and lower. We demonstrate that these predictions can guide the downstream read error correction algorithm and genome assembly, ultimately allowing for more accurate genome assembly.
Availability and implementation
Both GnnDebugger ( https://github.com/m5imunovic/gnndebugger ) and LJA ( https://github.com/AntonBankevich/LJA/tree/gnndebugger ) are available on GitHub. Datasets used for training and testing of ML model are available at Zenodo: https://doi.org/10.5281/zenodo.15073168 . HG002 reference and reads are available at https://github.com/marbl/HG002 . Primates references and reads are available at https://github.com/marbl/Primates .