When do longer reads matter? A benchmark of long read de novo assembly tools for eukaryotic genomes
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (GigaScience)
Abstract
Background
Assembly algorithm choice should be a deliberate, well-justified decision when researchers create genome assemblies for eukaryotic organisms from third-generation sequencing technologies. While third-generation sequencing by Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) have overcome the disadvantages of short read lengths specific to next-generation sequencing (NGS), third-generation sequencers are known to produce more error-prone reads, thereby generating a new set of challenges for assembly algorithms and pipelines. Since the introduction of third-generation sequencing technologies, many tools have been developed that aim to take advantage of the longer reads, and researchers need to choose the correct assembler for their projects.
Results
We benchmarked state-of-the-art long-read de novo assemblers, to help readers make a balanced choice for the assembly of eukaryotes. To this end, we used 13 real and 72 simulated datasets from different eukaryotic genomes, with different read length distributions, imitating PacBio CLR, PacBio HiFi, and ONT sequencing to evaluate the assemblers. We include five commonly used long read assemblers in our benchmark: Canu, Flye, Miniasm, Raven and Redbean. Evaluation categories address the following metrics: reference-based metrics, assembly statistics, misassembly count, BUSCO completeness, runtime, and RAM usage. Additionally, we investigated the effect of increased read length on the quality of the assemblies, and report that read length can, but does not always, positively impact assembly quality.
Conclusions
Our benchmark concludes that there is no assembler that performs the best in all the evaluation categories. However, our results shows that overall Flye is the best-performing assembler, both on real and simulated data. Next, the benchmarking using longer reads shows that the increased read length improves assembly quality, but the extent to which that can be achieved depends on the size and complexity of the reference genome.
Article activity feed
- 
      Background Assembly algorithm choice should be a deliberate, well-justified decision when researchers create genome assemblies for eukaryotic organisms from third-generation sequencing technologies. While third-generation sequencing by Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) have overcome the disadvantages of short read lengths specific to next-generation sequencing (NGS), third-generation sequencers are known to produce more error-prone reads, thereby generating a new set of challenges for assembly algorithms and pipelines. Since the introduction of third-generation sequencing technologies, many tools have been developed that aim to take advantage of the longer reads, and researchers need to choose the correct assembler for their projects.Results We benchmarked state-of-the-art long-read de novo assemblers, … Background Assembly algorithm choice should be a deliberate, well-justified decision when researchers create genome assemblies for eukaryotic organisms from third-generation sequencing technologies. While third-generation sequencing by Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) have overcome the disadvantages of short read lengths specific to next-generation sequencing (NGS), third-generation sequencers are known to produce more error-prone reads, thereby generating a new set of challenges for assembly algorithms and pipelines. Since the introduction of third-generation sequencing technologies, many tools have been developed that aim to take advantage of the longer reads, and researchers need to choose the correct assembler for their projects.Results We benchmarked state-of-the-art long-read de novo assemblers, to help readers make a balanced choice for the assembly of eukaryotes. To this end, we used 13 real and 72 simulated datasets from different eukaryotic genomes, with different read length distributions, imitating PacBio CLR, PacBio HiFi, and ONT sequencing to evaluate the assemblers. We include five commonly used long read assemblers in our benchmark: Canu, Flye, Miniasm, Raven and Redbean. Evaluation categories address the following metrics: reference-based metrics, assembly statistics, misassembly count, BUSCO completeness, runtime, and RAM usage. Additionally, we investigated the effect of increased read length on the quality of the assemblies, and report that read length can, but does not always, positively impact assembly quality.Conclusions Our benchmark concludes that there is no assembler that performs the best in all the evaluation categories. However, our results shows that overall Flye is the best-performing assembler, both on real and simulated data. Next, the benchmarking using longer reads shows that the increased read length improves assembly quality, but the extent to which that can be achieved depends on the size and complexity of the reference genome.Competing Interest StatementThe authors have declared no competing interest. **Reviewer 2: Katharina Scherf ** General comments This paper is a very thorough report on large-scale proteomics mapping of ca. 4000 wheat samples and several challenges related to sample preparation, measurement and data analysis. It is the first paper reporting such an extensive dataset and tools for analysis. Overall, I think that the authors have done in-depth work and it is also described in a way that can be understood well. The descriptions of how the authors arrived at the final workflow will also be useful to other groups attempting to do proteomics of wheat or other grains. I have only few comments for improvement. Note: line numbers would have been helpful Specific comments Abstract - Results: "LMA expression greatly impacted grain starch and other carbohydrates …" and then alpha-gliadins and LMW glutenin is mentioned. However, these are proteins and their relation to starch/carbohydrates is not clear. Introduction overall: Please harmonize the use of alpha-amylase and a-amylase; alpha-amylase is recommended, or else the Greek letter. p3, L1: "great source of protein": In terms of quantity, this is true. However, you should also include a brief statement about protein quality, which is not ideal, especially when considering gluten proteins section 2.1: Please include if all samples were grown together at the same place in one year (or not); i.e. include the information from section 3.1.1 already here. 
- 
      AbstractBackground Assembly algorithm choice should be a deliberate, well-justified decision when researchers create genome assemblies for eukaryotic organisms from third-generation sequencing technologies. While third-generation sequencing by Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) have overcome the disadvantages of short read lengths specific to next-generation sequencing (NGS), third-generation sequencers are known to produce more error-prone reads, thereby generating a new set of challenges for assembly algorithms and pipelines. Since the introduction of third-generation sequencing technologies, many tools have been developed that aim to take advantage of the longer reads, and researchers need to choose the correct assembler for their projects.Results We benchmarked state-of-the-art long-read de novo … AbstractBackground Assembly algorithm choice should be a deliberate, well-justified decision when researchers create genome assemblies for eukaryotic organisms from third-generation sequencing technologies. While third-generation sequencing by Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) have overcome the disadvantages of short read lengths specific to next-generation sequencing (NGS), third-generation sequencers are known to produce more error-prone reads, thereby generating a new set of challenges for assembly algorithms and pipelines. Since the introduction of third-generation sequencing technologies, many tools have been developed that aim to take advantage of the longer reads, and researchers need to choose the correct assembler for their projects.Results We benchmarked state-of-the-art long-read de novo assemblers, to help readers make a balanced choice for the assembly of eukaryotes. To this end, we used 13 real and 72 simulated datasets from different eukaryotic genomes, with different read length distributions, imitating PacBio CLR, PacBio HiFi, and ONT sequencing to evaluate the assemblers. We include five commonly used long read assemblers in our benchmark: Canu, Flye, Miniasm, Raven and Redbean. Evaluation categories address the following metrics: reference-based metrics, assembly statistics, misassembly count, BUSCO completeness, runtime, and RAM usage. Additionally, we investigated the effect of increased read length on the quality of the assemblies, and report that read length can, but does not always, positively impact assembly quality.Conclusions Our benchmark concludes that there is no assembler that performs the best in all the evaluation categories. However, our results shows that overall Flye is the best-performing assembler, both on real and simulated data. Next, the benchmarking using longer reads shows that the increased read length improves assembly quality, but the extent to which that can be achieved depends on the size and complexity of the reference genome. This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad084), and has published the reviews under the same license. These are as follows. **Reviewer 1: Nobuaki Takemori ** The large proteome dataset for wheat, a representative grain, presented in this manuscript is valuable not only for agriculture science but also for basic plant science, but unfortunately, the manuscript is too wordy in its description and informative. Of course, a detailed description of the experimental methods and data generation process is an important component in obtaining reproducibility, but excessive information in the main text may have the unintended effect of hindering the reader's understanding of the manuscript. The volume of the main text in this manuscript should be reduced to 1/2 or even 1/3 of the original by referring to the following suggested revisions. Title: It looks rather like the title of a review article and is not appropriate for the title of an original research paper. An abbreviation is also used, making it difficult to understand. It should be changed to a title that more specifically and pragmatically reflects the content of the paper. Materials and Methods 2.3: The sample pretreatment used in this experiment has already been described in Ref. 41, so detailed description in this text is unnecessary. Also, Figure 1, which visualizes the experimental process, is too packed with information and is difficult to read in its small font. In addition, many extraneous photographs of LC-MS instruments and other common equipment are included. Sample pretreatment should be described very briefly in the text, and only those areas where there are differences from previous reports should be mentioned. If the author wishes to describe the details of the experiment to assure reproducibility, it is recommended to describe it in the form of an experimental protocol and include it in the Supplementary Information. Materials and Methods 2.5: The 11 different paths the authors have set up for LC-MS/MS analysis are difficult to understand in text. Maybe they could be summarized in a table or visualized using a flowchart. Materials and Methods 2.6 to 2.9: It is recommended that only the essentials be described in the text and the minute details be moved to the Supplementary Information. Results 3.2.(p 26, line 11-20): The description should be moved to the introduction. Results 3.1.3-3.1.4 Too detailed and too long. Only the main points should be mentioned. It would be effective to use concise Figures where possible. Figure 6: Too much information; A, B, F, and G should be supplemental information. Figure 8: Wheat cartoon is unnecessary. The font is too small. This information should be in a Table. 
- 
    
