Comparison of long-read methods for sequencing and assembly of a plant genome

This article has been Reviewed by the following groups

Read the full article

Abstract

Background

Sequencing technologies have advanced to the point where it is possible to generate high-accuracy, haplotype-resolved, chromosome-scale assemblies. Several long-read sequencing technologies are available, and a growing number of algorithms have been developed to assemble the reads generated by those technologies. When starting a new genome project, it is therefore challenging to select the most cost-effective sequencing technology, as well as the most appropriate software for assembly and polishing. It is thus important to benchmark different approaches applied to the same sample.

Results

Here, we report a comparison of 3 long-read sequencing technologies applied to the de novo assembly of a plant genome, Macadamia jansenii. We have generated sequencing data using Pacific Biosciences (Sequel I), Oxford Nanopore Technologies (PromethION), and BGI (single-tube Long Fragment Read) technologies for the same sample. Several assemblers were benchmarked in the assembly of Pacific Biosciences and Nanopore reads. Results obtained from combining long-read technologies or short-read and long-read technologies are also presented. The assemblies were compared for contiguity, base accuracy, and completeness, as well as sequencing costs and DNA material requirements.

Conclusions

The 3 long-read technologies produced highly contiguous and complete genome assemblies of M. jansenii. At the time of sequencing, the cost associated with each method was significantly different, but continuous improvements in technologies have resulted in greater accuracy, increased throughput, and reduced costs. We propose updating this comparison regularly with reports on significant iterations of the sequencing technologies.

Article activity feed

  1. doi: 10.1093/gigascience/giaa146

    **Reviewer 2. Mile Šikić ** Reviewer Comments to Author: In their paper Murigneux et. al. made a comparison of three long-read sequencing technologies applied to the de novo assembly of a plant genome, Macadamia jansenii. They generated sequencing data using Pacific Biosciences (Sequel I), Oxford Nanopore Technologies (PromethION), and BGI (single-tube Long Fragment Read) technologies. Sequenced data are assembled using a bunch of state of the art long-read assemblers and hybrid Masurca assembler. Although paper is easy to follow, and this kind of analysis is more than welcomed I have several major and minor concerns. Major concerns

    1. The authors use 780 Mbps as the estimated size of the genome. Yet, this is not supported by data. In chapter "Genome size estimation", they present the genome size estimation using K-mer counting, but these sizes are 650 Mbps or less
    2. Since the real size of the genome is unknown, It would be worthwhile if authors provide analyses such as those enabled by KAT (Mapleson et al., 2017), which compares the k-mer spectrum of the assembly to the k-mer spectrum of reads (preferably Illumina). For control of the misassembled contigs, authors also might align larger contigs obtained using different tools to compare similarity among them (e.g., using tools such as Gepard or similar).
    3. The authors compare assemblies with "Illumina assembly", but it is not clear what that means and why they consider this as a valid comparison.
    4. Although they started ONT data analysis with four tools, they perform further analysis on just two tools (Flye and Canu). In addition, for PacBio data, they use three tools (Redbean, Fly and Canu). It is not clear why the authors chose these tools. Canu and Fly have larger N50, larger total length, and the longest contigs. However, this does not take into account possible misassembles. Assemblers might have problems with uncollapsed haplotypes, which can result in assemblies larger than expected. In their recent manuscript, Guiglielmoni et al (https://doi.org/10.1101/2020.03.16.993428) showed that Canu is prone to uncollapsed haplotypes. Also, in this manuscript is presented that using PacBio data Canu produces much longer assemblies than other tools (1.2 Gbps). Therefore, the longer total size of a assembly cannot guaranty a better genome. Furthermore, on ONT data Raven has the second-best initial Busco score (before polishing), and its assembled genome consists of the least number of contigs. Therefore, I deem that the full analysis needs to performed using all tools for both Nanopore and Pacbio data.
    5. It would be of interest to a broad community if authors add the computational costs in total cost per genome for each sequencing technolgy. They might compare their machines with AWS or other cloud specified configurations. Besides, it is not clear which types of machines they used. Information from supplementary materials such as GPU, large memory, HPC is not descriptive enough. Minor comments:
    6. The authors use the published reference genome of Macadamia integrifolia v2 for comparison. It would be interesting if they can provide us with information about sequencing read technology used for this assembly.
    7. The authors mentioned that the newer generation of PacBio sequencing technology (Sequel II) which provides higher accuracy and lower costs. It would also be worth to mention the newer generations of assembly tools such as Canu 2.0, Raven v1.1.5 or Flye Version 2.7.1 It is worth considering Racon for polishing with Illumina reads too. Yet, this is not a requirement, because authors already use state of the art tools.
  2. Now published in GigaScience

    **Review 1. Cecile Monat. ** Reviewer Comments to Author: Introduction part:

    • It would be nice to put the genome size and to indicate the reference genome that is already sequenced and assembled for Macadamia, just to put a context for the people who are not familiar with Macadamia. Methods part:
    • ONT library preparation and sequencing part:
      • What was the reason to used both MinION and PromethION and not only PromethION?
      • For what reason didn't you use the same version of MinKNOW to assemble the MinION (MinKNOW (v1.15.4)) and PromethION (MinKNOW (v3.1.23)) data?
    • Assembly of genomes part:
      • Is there a reason for doing 4 iterations of Racon? And not 3 or 5?
      • Maybe you should precise that Racon is used as an error-correction module and Medaka to create the consensus sequence.
      • "Hybrid assembly was generated with MaSuRCA v3.3.3 (MaSuRCA, RRID:SCR_010691) [32] using the Illumina and the ONT or PacBio reads and using Flye v2.5 to perform the final assembly of corrected mega-reads" this sentence is not very clear to me. Does it mean that you have first used ONT/PacBio data + Illumina on MaSuRCA software to generate what they call "super-reads" and then from this data you used Flye to get the final assemblies?
      • as I understood stLFR is similar to 10x genomics, why not compare this technology data too?
    • Assembly comparison part:
      • "We compared the assemblies with the published reference genome of Macadamia integrifolia v2 (Genbank accession: GCA_900631585.1)." First, I think it is important to add the reference paper. Secondly, I cannot see where did you compare your assemblies with the one published? For me, you compared all your assemblies between each other, but I cannot find any other assembly.
      • when you said "Illumina assembly" do you refer to the Macadamia integrifolia assembly? If so, please clarify it in the rest of the paper, and add the data for this reference genome in your figures. Results part:
    • ONT genome assembly part:
      • Is there any interested to combine MinION and PromethION data? Are there any advantages to combining it?
      • "The genome completeness was slightly better after two iterations of NextPolish (95.5%) than after two iterations of Pilon (95.2%) (Sup Table 1)." Here I would precise that it is the case for the Flye assembly, but surprisingly (at least for me?) after two iterations of NextPolish on the Canu assembly, the results were a little less good as with one iteration. So, depending on the assembler you use, the number of iteration needed might be different.
      • "As an estimation of the base accuracy, we computed the number of mismatches and indels as compared to the Illumina assembly." Here I am not sure which assembly you refer to when you use the "Illumina assembly" term. Do you refer to the Macadamia integrifolia assembly or to the MaSuRCA hybrid assembly? If you refer to the last one, I would suggest using the word hybrid assembly instead of Illumina assembly, it might be confusing.
      • Why not using the Pilon and NextPolish step on the ONT+Illumina (MaSuRCA) assembly since they are tools dedicated to long and short reads polishing?
    • PacBio genome assembly part:
      • Why did you use FALCON as the assembler for PacBio but not for ONT? If I am correct, it is not uniquely build to work on PacBio data but is ok for all long-reads technologies.
      • "Two subsets of reads corresponding to 4 SMRT cells and equivalent to a 43× and 39× coverage were assembled using Flye." why choosing Flye for this analysis? I'm also wondering if this part is necessary since afterward, you do the ONT equivalent coverage which is more interesting for the comparison of the technologies.
      • Comment on the structure: for this paragraph, I would prefer to have first the result with the same assemblers as with the ONT data, and then an explanation of why you choose to perform also a test with FALCON and then the FALCON results.
    • stLFR genome assembly part:
      • Supernova might have been used on PacBio data as well, why not?
      • why not trying to complement PacBio data with stLFR as you did with ONT? Are there any incompatibilities? Discussion part:
    • "The amount of sequencing data produced by each platform corresponds to approximately 84× (PacBio Sequel), 32× (ONT) and 96× (BGI stLFR) coverage of the macadamia genome" I would have put this information into the Results part, but it's only my preference.
    • "For both ONT and PacBio data, the highest assembly contiguity was obtained with a long-read only assembler as compared to an hybrid assembler incorporating both the short and long reads." I would suggest using the term "long-read polished" instead of "long-read only" since the assembly with the best contiguity integrates the Illumina data for the polishing. Tables and figures:
    • Table 2:
      • For this figure, if I understood properly you have chosen the best assembly of each technology. If I am right, then please precise it in the title of the figure. -Figure 1:
      • If I understood properly and here when you write "Base accuracy of assemblies as compared to Illumina assembly" you refer to the Macadamia integrifolia assembly, then I would add the Macadamia integrifolia assembly in this figure, and maybe put a dotted line at the limit of it for each category (InDels and mismatches) so it is easier for the reader to compare with it.
    • Figure 2:
      • Here I would put all the assemblies you had in Figure 1