Maptcha: An efficient parallel workflow for hybrid genome scaffolding

Oieswarya Bhowmik
Tazin Rahman
Ananth Kalyanaraman

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: Genome assembly, which involves reconstructing a target genome, relies on scaffolding methods to organize and link partially assembled fragments. The rapid evolution of long read sequencing technologies toward more accurate long reads, coupled with the continued use of short read technologies, has created a unique need for hybrid assembly workflows. The construction of accurate genomic scaffolds in hybrid workflows is complicated due to scale, sequencing technology diversity (e.g., short vs. long reads, contigs or partial assemblies), and repetitive regions within a target genome. Results: In this paper, we present a new parallel workflow for hybrid genome scaffolding that would allow combining pre-constructed partial assemblies with newly sequenced long reads toward an improved assembly. More specifically, the workflow, called Maptcha, is aimed at generating long genome scaffolds of a target genome, from two sets of input sequences---an already constructed partial assembly of contigs, and a set of newly sequenced long reads. Our scaffolding approach internally uses an alignment-free mapping step to build a <contig,contig> graph using long reads as linking information. Subsequently, this graph is used to generate scaffolds. We present and evaluate a graph-theoretic ``wiring'' heuristic to perform this scaffolding step. To enable efficient workload management in a parallel setting, we use a batching technique that partitions the scaffolding tasks so that the more expensive alignment-based assembly step at the end can be efficiently parallelized. This step also allows the use of any standalone assembler of choice for generating the final scaffolds. Conclusions: Our experiments with Maptcha on a variety of input genomes, and comparison against a state-of-the-art hybrid scaffolder (LRScaf) demonstrate that Maptcha is able to generate longer and more accurate scaffolds in significantly faster runtimes. For instance, Maptcha produces scaffolds with an NG50 length of 4.8Mbp (compared to 171Kbp by LRScaf for T. crassiceps, and 81Mbp for Human chr 7 (compared to 4.5Mbp by LRScaf), while reducing the runtime from hours to minutes in several cases. We also performed a coverage experiment by varying the sequencing coverage depth for long reads, which demonstrated the potential of Maptcha to generate significantly longer scaffolds in low coverage settings (1x to 10x).

Version published to 10.1101/2024.03.25.586701v1 on bioRxiv
Mar 27, 2024

Hybrid Sequencing Facilitates Robust De Novo Plasmid Assembly

This article has 5 authors:
1. Sarah I. Hernandez
2. Casey-Tyler Berezin
3. Katie M. Miller
4. Samuel J. Peccoud
5. Jean Peccoud
This article has no evaluationsLatest version Mar 26, 2024
GCI: a continuity inspector for complete genome assembly

This article has 4 authors:
1. Quanyu Chen
2. Chentao Yang
3. Guojie Zhang
4. Dongya Wu
This article has no evaluationsLatest version Apr 9, 2024
Building better genome annotations across the tree of life

This article has 2 authors:
1. Adam H Freedman
2. Timothy B Sackton
This article has no evaluationsLatest version Apr 15, 2024

Maptcha: An efficient parallel workflow for hybrid genome scaffolding

Listed in

Abstract

Article activity feed

Hybrid Sequencing Facilitates Robust De Novo Plasmid Assembly

GCI: a continuity inspector for complete genome assembly

Building better genome annotations across the tree of life

Listed in

Abstract

Article activity feed

Related articles

Hybrid Sequencing Facilitates Robust De Novo Plasmid Assembly

GCI: a continuity inspector for complete genome assembly

Building better genome annotations across the tree of life