Optimizing experimental design for genome sequencing and assembly with Oxford Nanopore Technologies

Abstract

High quality reference genome sequences are the core of modern genomics. Oxford Nanopore Technologies (ONT) produces inexpensive DNA sequences, but has high error rates, which make sequence assembly and analysis difficult as genome size and complexity increases. Robust experimental design is necessary for ONT genome sequencing and assembly, but few studies have addressed eukaryotic organisms. Here, we present novel results using simulated and empirical ONT and DNA libraries to identify best practices for sequencing and assembly for several model species. We find that the unique error structure of ONT libraries causes errors to accumulate and assembly statistics plateau as sequence depth increases. High-quality assembled eukaryotic sequences require high-molecular-weight DNA extractions that increase sequence read length, and computational protocols that reduce error through pre-assembly correction and read selection. Our quantitative results will be helpful for researchers seeking guidance for de novo assembly projects.

This article is a preprint and has not been certified by peer review [what does this mean?].

John M. Sutton 1Department of Biological Sciences, The University of Alabama, Tuscaloosa, AL 35487-0344Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for John M. SuttonJoshua D. Millwood 1Department of Biological Sciences, The University of Alabama, Tuscaloosa, AL 35487-0344Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteA. Case McCormack 1Department of Biological Sciences, The University of Alabama, Tuscaloosa, AL 35487-0344Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteJanna L. Fierst 1Department of Biological Sciences, The University of Alabama, Tuscaloosa, AL 35487-0344Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Janna L. FierstFor correspondence: janna.l.fierst@ua.edu

This work has been peer reviewed in GigaByte (https://doi.org/10.46471/gigabyte.27), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

**Reviewer 1. Zhao Chen ** The authors should clarify why only Canu and Flye were selected instead of other long-read assemblers such as Raven, Redbean, Shasta, and Miniasm. Rationales should be given for why these two assemblers were selected. The same thing for MaSuRCA. It looks like you used MaSuRCA for hybrid assembly. Unicycler also contains a commonly used hybrid assembly pipeline. Therefore, you should also explain why MaSuRCA was selected for your study. A flow chart with all bioinformatics tools included should be provided to show more clearly how this entire study was carried out, including assembly, error correction, and analysis after assembly. More information about the quality of long reads should be provided, such as Phred quality scores, percentage of reads with Q30 or above, and average read lengths. QUAST should suffice for these quality analyses. Only testing simulated reads is not sufficient for making a solid conclusion since simulated reads cannot be treated as being equal to real reads or reflect basecalling errors in real reads. Since real reads are readily available on NCBI, real reads should also be tested. As your title didn’t mention anything about the fact that this study was solely based on testing simulated reads and your objective was to optimize the bioinformatic pipeline for processing Oxford Nanopore long reads, the experiments should be performed by including all conditions. Accordingly, real reads should also be tested, which could significantly improve the scientific quality of this study. Line12-13: This may not be true, since many studies have been published on how to assemble and error-correct Oxford Nanopore long reads to produce accurate genomes. The authors should describe why the present study is novel and what new findings were reported.

Recommendation: Major Revisions

**Reviewer 2. Shanlin Liu ** The genome de novo assembly based on third generation sequencing (ONT in the current work) has been widely applied for plenty of organisms, including bacteria, plants and animals with various genome sizes, e g. the two recently published lungfish genomes (genome size of > 30 G) in Nature and Cell, and genomes of a broad range of species published in GigaScience, Scientific Data, Molecular Ecology Resources, et al. It is pretty easy to find the analysis pipeline or datasets that were used to obtain a high quality genome assembly in those published works. The authors generated multiple genome assemblies for four model species using different simulated datasets with varied sequencing depths and different assembly tools, and tried to provide useful guidance for those who are new to genome assembly. However, I am afraid that the current study contains some limitations in the results and conclusions that may mislead the readers, and I do recommend the authors reconstruct the manuscript and address those issues before its publication. First of all, a routine practice of genome assembly with long reads (either ONT or PacBio) includes a polishing step based on long reads itself using tools like Nanopolish, Medaka, Racon, et al. The author skipped this step in all of their analyses and directly evaluated the assembly errors based on the outputs generated from different combinations of datasets and software. It has little practical value whatever the results showed. Secondly, the four model species included in the current work can hardly represent a broad range of organisms – all have a genome size < 200 MB and low level of repetitive elements (< 30%). Hence, the analysis results from the current work offer scant guidance to those who work on organisms like plants, fishes, insects, mammals et al. For example, computing resources become the first hurdle for the genome assembly when working on > 100X ONT reads for the species with large genome size even if you can afford the sequencing. So, researchers would generate less data or prefer assemblers like WTDBG, NextDenovo, Falcon to obtain their genome assembly. In addition, the authors deem Caenorhabditis species as a highly heterozygous genome (0.7% according to their calculation), which is also open to question. Genomes of multitudinous insects and plants have a much higher heterozygous level. What’s more, the authors may want to pay attention to the news regarding the Sequel II sequencing platform recently released by PacBio Tech. As far as I know, it can provide inexpensive long read sequencing thanks to its huge improvement in sequencing throughput. Also, it also has a new release of a library preparation kit that can work on low amounts of DNA inputs. If so, what you stated in the instruction section may be incorrect. Beside the major issues mentioned above, there are some other minor ones listing as follows: Line 89. The authors may want to provide common names of those model species to improve readability of the manuscript. Line 119 Genome references and ONT reads were derived from different individuals or strains, and there are very low coverage ONT reads for E. coli. I am not sure whether those factors will influence the quality of simulation or not. The authors may add a caution to clarify these concerns.

Line 24 A combination of experimental techniques? It is better to specify what experimental techniques. Line 128 Incorrect word format and C. latens missed. Line 141 How to define the best performance, the most contiguous assembly? Line 137 When you say “failed to produce an assembly”, does it mean that software failed to generate outputs or unexpected assembly results? Line 287 Supplement the BUSCO value of the reference TAIR10 Line 287 what do you mean by “combined approach”? Do you mean the method that corrects reads using Canu and assembles them using FLYE? Line 233 – 241 the “corrected” and “selected” dataset used in the Nematoda test were not applied to other organisms. Line 241 Canu correction could truncate some low quality reads or cut long reads into multiple pieces for speculated chimeric reads. I don’t think you can reach a conclusion that read length influences assembly quality using the current results. Line 242 Please rephrase this sentence and put Figure 5 and reference #36 in better positions to avoid misunderstanding. Line 337 – 341 duplicates to the content line 308 – 312, and conflicts between each other. Line 355 All the tested organisms have genome sizes < 200 MB, please specific this limitation instead of saying a broad range of organisms. Line 368 Low coverage may mislead readers, the authors cannot reach such a conclusion based on merely one single test. Line 461 which model was used – high accuracy? or flip-flop? Table 1. Too long a header, could move some of the content as table notes.

Recommendation: Major Revisions

Read the original source

Optimizing experimental design for genome sequencing and assembly with Oxford Nanopore Technologies

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

OligoSeq: Rapid nanopore-sequencing of single-stranded oligonucleotides

Assemblatron: An Automated Workflow for High-Throughput Assembly of Big-DNA Libraries

Finding an optimal sequencing strategy to detect short and long genetic variants in a human genome

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

OligoSeq: Rapid nanopore-sequencing of single-stranded oligonucleotides

Assemblatron: An Automated Workflow for High-Throughput Assembly of Big-DNA Libraries

Finding an optimal sequencing strategy to detect short and long genetic variants in a human genome