Fast-SG: an alignment-free algorithm for hybrid assembly

This article has been Reviewed by the following groups

Read the full article

Abstract

Background

Long-read sequencing technologies are the ultimate solution for genome repeats, allowing near reference-level reconstructions of large genomes. However, long-read de novo assembly pipelines are computationally intense and require a considerable amount of coverage, thereby hindering their broad application to the assembly of large genomes. Alternatively, hybrid assembly methods that combine short- and long-read sequencing technologies can reduce the time and cost required to produce de novo assemblies of large genomes.

Results

Here, we propose a new method, called Fast-SG, that uses a new ultrafast alignment-free algorithm specifically designed for constructing a scaffolding graph using light-weight data structures. Fast-SG can construct the graph from either short or long reads. This allows the reuse of efficient algorithms designed for short-read data and permits the definition of novel modular hybrid assembly pipelines. Using comprehensive standard datasets and benchmarks, we show how Fast-SG outperforms the state-of-the-art short-read aligners when building the scaffoldinggraph and can be used to extract linking information from either raw or error-corrected long reads. We also show how a hybrid assembly approach using Fast-SG with shallow long-read coverage (5X) and moderate computational resources can produce long-range and accurate reconstructions of the genomes of Arabidopsis thaliana (Ler-0) and human (NA12878).

Conclusions

Fast-SG opens a door to achieve accurate hybrid long-range reconstructions of large genomes with low effort, high portability, and low cost.

Article activity feed

  1. Now published in GigaScience doi: 10.1093/gigascience/giy048

    Alex Di Genova 1Facultad de Ingenier´ıa y Ciencias, Universidad Adolfo Iba´n˜ez, Santiago, Chile.2Mathomics Bioinformatics Laboratory, Center for Mathematical Modeling, University of Chile, Av. Blanco Encalada 2120, 7th floor, Santiago, Chile.3Inria Grenoble Rhonˆe-Alpes, 655, Avenue de l’Europe, 38334 Montbonnot, France.4CNRS, UMR5558, Universite´ Claude Bernard Lyon 1, 43, Boulevard du 11 Novembre 1918, 69622 Villeurbanne, France.5Fondap Center for Genome Regulation, Av. Blanco Encalada 2085, 3rd floor, Santiago, Chile.Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteGonzalo A. Ruz 1Facultad de Ingenier´ıa y Ciencias, Universidad Adolfo Iba´n˜ez, Santiago, Chile.6Center of Applied Ecology and Sustainability (CAPES), Santiago, Chile.Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteMarie-France Sagot 3Inria Grenoble Rhonˆe-Alpes, 655, Avenue de l’Europe, 38334 Montbonnot, France.4CNRS, UMR5558, Universite´ Claude Bernard Lyon 1, 43, Boulevard du 11 Novembre 1918, 69622 Villeurbanne, France.Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteFor correspondence: marie-france.sagot@inria.fr amaass@dim.uchile.clAlejandro Maass 2Mathomics Bioinformatics Laboratory, Center for Mathematical Modeling, University of Chile, Av. Blanco Encalada 2120, 7th floor, Santiago, Chile.5Fondap Center for Genome Regulation, Av. Blanco Encalada 2085, 3rd floor, Santiago, Chile.7Department of Mathematical Engineering, University of Chile, Av. Blanco Encalada 2120, 5th floor, Santiago, Chile.Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteFor correspondence: marie-france.sagot@inria.fr amaass@dim.uchile.cl

    A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giy048 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

    These peer reviews were as follows:

    Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101129 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101130