De Bruijn Graph Partitioning for Scalable and Accurate DNA Storage Processing

Florestan De Moor
Olivier Boullé
Dominique Lavenier

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Motivation

DNA-based data storage offers a compelling solution for long-term, high-density archiving. In this framework, accurately reconstructing high-quality encoded sequences after sequencing is critical, as it has a direct impact on the design of error-correcting codes optimized for DNA storage. Furthermore, efficient and scalable processing is essential to manage the large volumes of data expected in such applications.

Results

We introduce a novel method based on de Bruijn graph partitioning, enabling fast and accurate processing of sequencing data regardless of the underlying sequencing technology and without requiring prior knowledge of the information encoded in the oligonucleotides. Evaluated on both synthetic and real datasets, the method achieves excellent precision and recall. It is implemented in C++ within the software ConCluD and optimized for multi-core servers. Our experiments demonstrated that a dataset of 55 million reads, corresponding to a 135 MB binary file, can be processed in less than 10 minutes on a 16 hyper-threaded core server.

Version published to 10.1101/2025.05.19.654814 on bioRxiv
May 23, 2025

Parallel Architectures for Large - Scale Document Processing:Integrating OCR and RAG Pipelines

This article has 4 authors:
1. Alejandro Jaime
2. Veronica Gil-Costa
3. Marcelo Errecalde
4. Leticia Cagnina
This article has no evaluationsLatest version Jan 19, 2026
Lossless Pangenome Indexing Using Tag Arrays

This article has 3 authors:
1. Parsa Eskandar
2. Benedict Paten
3. Jouni Sirén
This article has no evaluationsLatest version Jan 18, 2026
GPU-accelerated modeling of biological regulatory networks

This article has 7 authors:
1. Joyce Reimer
2. Pranta Saha
3. Chris Chen
4. Neeraj Dhar
5. Brook Byrns
6. Steven Rayan
7. Gordon Broderick
This article has no evaluationsLatest version Jan 5, 2026

Discuss this preprint

Listed in

Abstract

Motivation

Results

Article activity feed

Related articles

Parallel Architectures for Large - Scale Document Processing:Integrating OCR and RAG Pipelines

Lossless Pangenome Indexing Using Tag Arrays

GPU-accelerated modeling of biological regulatory networks