Long-read and chromosome-scale assembly of the hexaploid wheat genome achieves high resolution for research and breeding

This article has been Reviewed by the following groups

Read the full article

Abstract

The sequencing of the wheat ( Triticum aestivum ) genome has been a methodological challenge for many years due to its large size (15.5 Gb), repeat content, and hexaploidy. Many initiatives aiming at obtaining a reference genome of cultivar Chinese Spring have been launched in the past years and it was achieved in 2018 as the result of a huge effort to combine short-read sequencing with many other resources. Reference-quality genome assemblies were then produced for other accessions but the rapid evolution of sequencing technologies offers opportunities to reach high-quality standards at lower cost. Here, we report on an optimized procedure based on long-reads produced on the ONT (Oxford Nanopore Technology) PromethION device to assemble the genome of the French bread wheat cultivar Renan. We provide the most contiguous and complete chromosome-scale assembly of a bread wheat genome to date. Coupled with an annotation based on RNA-Seq data, this resource will be valuable for the crop community and will facilitate the rapid selection of agronomically important traits. We also provide a framework to generate high-quality assemblies of complex genomes using ONT.

Article activity feed

  1. sequencing

    **Reviewer 3. Murukarthick Jayakodi **

    Aury et al have assembled the French bread wheat cv. Renan using Oxford Nanopore long read technology, optical map and Hi-C. They achieved a decent N50 of 2.2 Mb and constructed pseudomolecules with reference-guided approach. The assembly was corrected with Hi-C map. They annotated ~ 84% of repeats and projected gene models from previously assembled Chinese Spring reference genome. The assembly quality was validated with standard approach. The Renan assembly showed good collinearity with existing short-read wheat assemblies and pinpointed some large (1 > Mb) inversions. There is a potential to catalogue structural variants i.e. large INDELs. However, many false-positives are expected when long and short read assemblies are compared. Nevertheless, they compared a complex tandem repeat region. They used appropriate tools for assembly and downstream analysis. This is an improved additional genome resource for wheat community.

  2. The

    **Reviewer 2. Gabriel Keeble-Gagnere **

    The authors report on a new assembly of a French wheat variety, Renan, using Oxford Nanopore sequencing technology combined with short read polishing, Bionano optical maps and Hi-C to validate chromosome-level ordering after anchoring to IWGSC RefSeq v2.1. This is the first study I know of to use Oxford Nanopore to assemble a complete wheat genome, and the results demonstrate that this technology (together with short read polishing, Bionano, Hi-C, etc) can be successfully applied to such a complex genome. Evidence is presented to support the quality of the assembly, but it is mostly at the global statistics level (eg: contig N50, total size of gaps) or macro-scale (whole chromosome dotplots). One detailed comparison between Renan and Chinese Spring of a biologically important region is presented. The assembly is clearly of a high standard and is a valuable addition to the growing set of wheat varieties assembled to chromosome-scale. However, given the high quality of the IWGSC RefSeq v2.1 assembly (Zhu et al. (2021)), the claim that this assembly "achieves higher resolution for research and breeding" is quite strong and needs to be supported by more evidence. Given what is presented here, a more accurate statement might be "achieves higher contiguity and local completeness". The high contig N50 of 2.2Mb is highlighted but I feel that more work is needed to demonstrate that the sequence is free of artefacts. The authors show in Figure 2 that this assembly has the lowest (though only slightly) complete BUSCO score out of the wheat genomes they compare with. Is it possible that some regions cause problems for the Oxford Nanopore technology and are either fragmented or completely absent from the assembly? Bionano maps were used but no evidence is presented to show the level of agreement with the assembled sequence and Bionano maps, as is done in Zhu et al. (2021).

    In summary I think there are two key things to address:

    1. More evidence supporting that the assembly is locally accurate, especially validation with alignment to Bionano maps;
    2. Some results presented to relate this assembly to the existing chromosome-scale assemblies of wheat genomes.

    To address these points, I think the following would greatly enhance the paper:

    a) Using any method (eg: the method in Brinton et al. (2020)), identify identical-by-state haplotypes between Renan and Chinese Spring and the chromosome-scale assemblies from Walkowiak et al. (2020). This analysis would essentially produce a table which would be valuable supplementary data. A figure similar to Figure 3 (b) from Walkowiak et al. (2020) for a single chromosome, showing the regions of the existing wheat genomes sharing haplotypes with Renan would help place this genome into context.

    b) This then defines large regions of the Renan assembly that can be directly compared at the base level to other assemblies. Select 2 or 3 examples to show how the Renan sequence compares to the equivalent region in other assemblies, and show the Bionano validation of Renan sequence together with presence of genes and gaps in each assembly being compared. Since the sequences being compared here should be the same (based on the previous step above), the genes from the Renan annotation can be mapped across and directly compared. This would provide direct evidence for the higher quality assembly being claimed. Figure 5 is a good comparison of a biologically important region, but it is unclear if the region in Chinese Spring and Renan is the same haplotype or not. This needs to be clarified at the start of this section. If the same, then the comparison is of two regions expected to be basically identical (and could be one of the examples used in the proposed comparison analysis above); if different, then that needs to frame the discussion since the region in Chinese Spring could theoretically contain different genes or more repeats, for example.

    Centromeres are not mentioned, though it is known to be a particularly difficult region in wheat genome assemblies. How do the centromeres look in this assembly and how do they compare to previous wheat assemblies? Do the Bionano maps validate the assembly in the centromere region? The analysis in point a) above would identify centromeres in common with other assemblies. Likewise, the distal ends of chromosome arms, including the telomere sequences, are known to cause problems for Hi-C ordering and orientation. Again, the Bionano alignments demonstrating correct ordering would be particularly valuable.

    Figure 2 should be a supplementary figure.

  3. Abstract

    This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giac034 and has published the reviews under the same license. These reviews were as follows.

    **Reviewer 1. Sean Walkowiak **

    First review: Comment 1: The authors could more clearly and accurately present and discuss sequencing and assembly approaches, including the advantages and limitations of the ONT assembly presented here

    While the standards of 'quality' for assemblies are evolving, there are standard sets of 'science-based' criteria for considering the quality of a genome, such as the 14 criteria listed in the manuscript here: https://www.nature.com/articles/s41586-021-03451-0#Tab1. Many of these criteria are ambitious, particularly for wheat due to its size and complexity, and many criteria are not met using previous assembly approaches, or the approaches used in this study. It is true that CS and 10+ Wheat Genomes do not use long reads; however, these assemblies are valuable and have been rigorously validated using 10X Genomics, Hi-C, and long read data. They also perform well for TE content, BUSCO (as outlined by Tables 1 and 2 and Fig 3 in this manuscript), and they were actually used in this MS as a reference for guiding the ONT assembly. I would also expect that they have a better base pair accuracy than the assembly presented here. I therefore suggest that the authors revise their statement "these assemblies have been produced using short-read technologies and are therefore not up to the quality standard of current genome assemblies". If the authors wish to discuss assembly quality, which I recommend they should, I suggest focusing on advantages and limitations of each technology and assembly approach in a constructive way, perhaps with a stronger focus on the value of the ONT resource developed here. In regards to base pair accuracy, ONT is at a disadvantage to short reads or to PacBio. This is particularly true in the context of HiFi reads, which have increased accuracy over ONT and Illumina and have greater lengths than Illumina, but PacBio and HiFi are not discussed. This is not to say that PacBio is superior in every way, the reads from ONT are longer and these hold a significant value. As an example of differences between PacBio and ONT that might provide useful context to describe the differences between ONT and PacBio approaches, please see: https://pubmed.ncbi.nlm.nih.gov/33319909/, for differences between short read (TriTex) and PacBio, please see https://www.nature.com/articles/s41586-020-2947-8 . All of these approaches are valuable but have both advantages and limitations, with ONT also having many clear advantages and disadvantages. But these need to be clearly communicated and supported, either through the results of this study or through the literature. For example, in the discussion, the authors state that "ONT devices HAVE a real advantage over other long-read technologies". There is only one other long read sequencing technology, so are if you saying that ONT HAS a 'real advantage' over PacBio based on read length, this is valid, but can be stated more explicitly and with examples of the read lengths from this study and the literature. It is then stated that the "error rate is drastically reduced for nanopore", again this valuable and a great advancement in regards to ONT, but it would be wise not to dismiss that this error rate is still higher than PacBio HiFi, which again can be stated explicitly with support from the literature. While both of these concepts are important, after they are stated, they are not actually discussed or framed to highlight the work from this study. The true advantage of ONT, even over PacBio HiFi, is that the long reads can resolve more complex regions that span greater distances, which are abundant in wheat (see reference from above). The authors are presenting an exciting and valuable resource with this genome assembly and this assembly has advantages due to the application of ONT, for the reasons mentioned above regarding long complex regions, but these are not fully highlighted and the authors do not take full advantage of what this assembly has to offer. I think the authors should provide additional context and support related to the value and drawbacks of their ONT assembly. The advantages are discussed superficially at the gene level through a couple of examples (Fig 5), though none of these examples are supported with any significant biological data or validation analysis. There are many interesting features of genomes that are captured by ONT that are not captured well by short reads or PacBio, and it is unfortunate that these are not explored in any significant depth in the manuscript.

    Comment 2: Some of the 'highlighted features' in the manuscript could be better selected/executed

    This comment relates to the previous comment on having little detail on what the ONT genome is uniquely capable of providing over other approaches. Instead, the authors focus on some anomalies in the D genome as well as differences in the nanopore software for base calling. It is unclear to me what the objective is of the report on the D genome. I suspect that this may be due to differences in repeat content between D and the other subgenomes, or an artifact of the tools and analyses used. Page 6, Figures S1 and S2, may be a consequence of poor read filtering for reads that align ambiguously - i,e perhaps reads from A and B may crossmap at a greater likelihood than those from D due to differences/similarities in repeat content between subgenomes. Once reads are aligned, the alignments should be properly filtered using standard 'best practices for NGS'- I do not see that any filtering or analysis of cross mapping was performed, but I may have missed it. Once the alignments are filtered, read coverage dips and peaks can then be assessed statistically using tools such as CNVnator and cn.mops, which are designed specifically for comparative read depth analysis since depth may not be normally distributed, rather than arbitrarily looking at 2 times the median. There may be differences between genes and intergenic regions in terms of mapping accuracy, so it may be ideal to interrogate read depth for those separately. The increased gaps is also interesting and I wonder if this could be due to the read accuracy of ONT and read mapping and assembly biases when having similar subgenomes.

    Nevertheless, the results and discussion on the D genome are interesting but distracting and likely reflect that the authors should take more time to explore their data and its biases before presenting this information. In summary, I believe that additional work is needed to bring value to the read depth and D genome analysis should the authors choose to include this in the manuscript. While I agree that it would be useful to communicate that a significant gain was observed when basecalling using the more accurate basecaller, the emphasis on this is disproportionate to its value in the manuscript. The observation of a better assembly when using reads from a more advanced basecaller is not something new. As for the error rate of the ONT between organisms (yeast and wheat), with a sample size of 2, I do not think that this is worth presenting or discussing in any detail. While this may just be an artifact of the DNA quality itself from two experiments, I suspect that this may be a valid result from the manuscript and due to sequencing repeats, which are more abundant in wheat, in combination with how these basecallers self train to be more accurate. While this is certainly valid, it is not novel or interesting. This result comparing species was not tested with sufficient scientific rigor/evidence, it distracts from the central result of the manuscript, and just reaffirms something that we already known about the basecalling software and challenges of sequencing homopolymers and the importance of getting accurate reads using the more advanced basecalling methods.

    Comment 3: Why Renan? This comment relates to the other two comments on the selected areas of focus. The biological story, which was on gliadins, was of some value and highlighted some of the advantages of an ONT assembly, but this was not supported by any significant biological data. Renan is a well-known cultivar with abundant genomic data, mapping populations, trait data for diseases, etc. It is unfortunate that the authors could not use the genome to dig deeper to more thoroughly demonstrate the value of this assembly specifically in the context of ONT and genomics of wheat or the biology of wheat and Renan, specifically. With abundant QTL data available specifically for Renan, these could have easily been anchored to the assembly to highlight novel transcripts from the RNAseq from this study, just as an example. Even the comparisons of the Renan assembly to other available assemblies was mostly superficial and did not highlight in significant detail the value of having an ONT assembly or the value of having data specifically for Renan. While a detailed 'biological story' may be beyond the scope of this manuscript, there was minimal effort to highlight the value of the assembly, and this comment is more of a larger reflection that more could have been done to highlight the value of the genome to support the author's vague claims that the genome "will benefit the wheat community and help breeding programs".

    Minor Comments The absence of numbered lines made it difficult to provide more detailed feedback, but there are minor items throughout, so I suggest numbering the lines and also giving the manuscript a thorough review. I appreciate that the authors present and suggest methods for future assembly of complex genomes using ONT, but unlike the abstract states 'we also provide the methodological standards to generate high-quality assemblies of complex genomes'. I would argue that the standards used for ONT assembly are known and are not established here. I would also suggest caution when stating that the methods here should be considered the 'standard' for the reasons indicated in Comment 1 regarding other approaches used to assemble complex genomes, such as PacBio/HiFi, and the lack of a proper investigation/discussion/comparison of assembly quality.

    Page 2: last line - what is the abbreviation ca. ? Table 1: Busco is presented twice with different values. Table 1 and 2 use different versions of RefSeq, I would stick to one version. It is unclear to me what trend or result is that the authors are trying to present in figure 1, which I would say is common for circos plots. Presenting data 'for the sake of presenting it' is not terribly valuable and I would encourage the authors to use the figures to present a trend or result that is impactful. In addition, the data that is presented is not presented clearly, and is cryptic. The roman numerals in the figure caption for Figure 1 are not actually in the figure. The caption also indicate that the dots indicate lower and higher values, but not of what - perhaps density of gaps? The color scales are not presented for each track. Two of the color scale pallets also look similar.

    Page 6: 62% of exons were identical, which means 48% had SNPs, so the authors argue that SNPs are therefore rare at 48% of exons? I do not think that 48% of exons having SNPs is rare, I think it that this would mean that nearly half of exons have SNPs, so this is therefore common. Perhaps this statistic is misleading and the focus should instead be on the 0.7% divergence. How does this value compare with other within species comparisons of gene content and could this be an artifact of ONT accuracy? This question relates to a general comment that the authors could do better at bringing relevant comparisons or parallels in from the literature throughout the manuscript to bring value to any findings or insights they are presenting. Particularly in the context of other ONT assemblies.

    Page 7, capitalize the T for technology, it is part of the name of the company and is a proper noun. This is repeated elsewhere.

    Page 7: 'on wheat'? this statement could be written more clearly The way that the text is worded, it sounds like the basis for selecting the SmartDenovo assembly was the number of unknown bases, when I suspect it was actually a multitude of factors (BUSCO, gene or TE content, assembly stats, etc). While I do not question the selection of the assembly, I do suggest a clearer presentation of the information. I appreciate that the authors presented the data from multiple assemblers, one of the concerns with ONT is that the read accuracy is low and this may lead to issues in assembly of complex polyploids with similar subgenomes. I suspect that based on this study, it is clear that this is a valid concern for some assemblers, but may have been overcome in others. Though none of this is explored or discussed. Again, is there any information in the literature contrasting assemblers that could provide insights into what you observed?

    Searches at 90% identify and coverage for genes and TEs is not strict, especially with genomes that have highly identical subgenomes. If you reduce your thresholds enough, all features will map to your genome.....

    The choice of language is often objective or not representative of the results. For example, the 'extremely' similar TE content between Renan and CS. Why not say it is similar and actually report a value or a % difference. This would be more concise and informative than using vague and overzealous language. Page 8, short reads (dash or no dash?) The font sizes in Figure 2 are very small.

    The RNAseq is not really presented at all, except in the Materials and Methods. I thought the genes were ab initio predicted until I saw RNAseq in the materials and methods. I suggest at least making a note of RNAseq into the results and/or discussion since this additional effort does bring added value to the annotations and the manuscript. The discussion says de novo annotations, but I suggest explicitly stating that RNAseq was performed.

    Figure 3 C and D do not have horizontal axis labels, the top should be labelled as subgenome, bottom as chromosome, and the vertical axis (not the top) should be labelled as number of gaps and chromosome length. Same comment for labelling of vertical axis for panels A and B, horizontal axis should be labelled as genome assemblies, which are reflected in the pallet/legend. Note that many of the colours in this pallet are similar and difficult to differentiate, it may actually take less space to label the bars with each wheat line to make it less cryptic.

    How were the dotplots in figure 4 generated? Perhaps I missed it in the materials and methods. Also one of the axis have labels or units, etc.

    Much of the text in Figure 5 is too small and illegible.

    Page 10: The discussion is superficial and vague and should provide an accurate and pragmatic discussion of the results in the context of the literature. For example, the manuscript boasts a 'higher resolution'... but of what? Perhaps 'complex repetitive regions'? To reiterate my previous comment on the lack of literature support throughout the manuscript - Were these 'higher resolutions' of comparable to what was observed in the literature when ONT was applied to other systems? Again, these advantages of ONT and the assembly could be more thoroughly

    Re-review:

    The revised manuscript addresses the major concerns/comments. The assembly and its report are an exciting new resource for the wheat community. I only have one very minor comment below:

    When writing variety names in text and figures, it is important to be exact because there are many varieties with similar names internationally. CDC Stanley, not "Stanley"; CDC Landmark, not "Landmark"; "LongReach Lancer", not "Lancer", not "LongRead Lancer" - typo on line 308. I suggest performing a thorough check throughout.