Vulcan: Improved long-read mapping and structural variant calling via dual-mode alignment
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (GigaScience)
Abstract
Background
Long-read sequencing has enabled unprecedented surveys of structural variation across the entire human genome. To maximize the potential of long-read sequencing in this context, novel mapping methods have emerged that have primarily focused on either speed or accuracy. Various heuristics and scoring schemas have been implemented in widely used read mappers (minimap2 and NGMLR) to optimize for speed or accuracy, which have variable performance across different genomic regions and for specific structural variants. Our hypothesis is that constraining read mapping to the use of a single gap penalty across distinct mutational hot spots reduces read alignment accuracy and impedes structural variant detection.
Findings
We tested our hypothesis by implementing a read-mapping pipeline called Vulcan that uses two distinct gap penalty modes, which we refer to as dual-mode alignment. The high-level idea is that Vulcan leverages the computed normalized edit distance of the mapped reads via minimap2 to identify poorly aligned reads and realigns them using the more accurate yet computationally more expensive long-read mapper (NGMLR). In support of our hypothesis, we show that Vulcan improves the alignments for Oxford Nanopore Technology long reads for both simulated and real datasets. These improvements, in turn, lead to improved accuracy for structural variant calling performance on human genome datasets compared to either of the read-mapping methods alone.
Conclusions
Vulcan is the first long-read mapping framework that combines two distinct gap penalty modes for improved structural variant recall and precision. Vulcan is open-source and available under the MIT License at https://gitlab.com/treangenlab/vulcan.
Article activity feed
-
Now published in GigaScience doi: 10.1093/gigascience/giab063
Yilei Fu 1Department of Computer Science, Rice University, Houston, TX 77005, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Yilei FuMedhat Mahmoud 2Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, United States of America3Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, United States of AmericaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Medhat MahmoudViginesh Vaibhav Muraliraman 1Department of Computer Science, Rice University, Houston, TX 77005, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFritz J. Sedlazeck 2Human …
Now published in GigaScience doi: 10.1093/gigascience/giab063
Yilei Fu 1Department of Computer Science, Rice University, Houston, TX 77005, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Yilei FuMedhat Mahmoud 2Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, United States of America3Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, United States of AmericaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Medhat MahmoudViginesh Vaibhav Muraliraman 1Department of Computer Science, Rice University, Houston, TX 77005, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFritz J. Sedlazeck 2Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, United States of AmericaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Fritz J. SedlazeckFor correspondence: Fritz.Sedlazeck@bcm.edu treangen@rice.eduTodd J. Treangen 1Department of Computer Science, Rice University, Houston, TX 77005, USAFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Todd J. TreangenFor correspondence: Fritz.Sedlazeck@bcm.edu treangen@rice.edu
A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab063 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
These peer reviews were as follows:
Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102841 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102842
-
-