Error correcting optical mapping data

This article has been Reviewed by the following groups

Read the full article

Abstract

Optical mapping is a unique system that is capable of producing high-resolution, high-throughput genomic map data that gives information about the structure of a genome . Recently it has been used for scaffolding contigs and for assembly validation for large-scale sequencing projects, including the maize, goat, and Amborella genomes. However, a major impediment in the use of this data is the variety and quantity of errors in the raw optical mapping data, which are called Rmaps. The challenges associated with using Rmap data are analogous to dealing with insertions and deletions in the alignment of long reads. Moreover, they are arguably harder to tackle since the data are numerical and susceptible to inaccuracy. We develop cOMet to error correct Rmap data, which to the best of our knowledge is the only optical mapping error correction method. Our experimental results demonstrate that cOMet has high prevision and corrects 82.49% of insertion errors and 77.38% of deletion errors in Rmap data generated from the Escherichia coli K-12 reference genome. Out of the deletion errors corrected, 98.26% are true errors. Similarly, out of the insertion errors corrected, 82.19% are true errors. It also successfully scales to large genomes, improving the quality of 78% and 99% of the Rmaps in the plum and goat genomes, respectively. Last, we show the utility of error correction by demonstrating how it improves the assembly of Rmap data. Error corrected Rmap data results in an assembly that is more contiguous and covers a larger fraction of the genome.

Article activity feed

  1. Now published in GigaScience doi: 10.1093/gigascience/giy061

    Kingshuk Mukherjee 1Department of Computer and Information Science and Engineering, University of Florida, GainesvilleFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteFor correspondence: kingdgp@ufl.eduDarshan Washimkar 2Department of Computer Science, Colorado State University, Fort CollinsFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteMartin D. Muggli 2Department of Computer Science, Colorado State University, Fort CollinsFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteLeena Salmela 3Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of HelsinkiFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteChristina Boucher 1Department of Computer and Information Science and Engineering, University of Florida, GainesvilleFind this author on Google ScholarFind this author on PubMedSearch for this author on this site

    A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giy061 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

    These peer reviews were as follows:

    Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101178 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101179 Reviewer 3: http://dx.doi.org/10.5524/REVIEW.101180