Structural variation discovery in wheat using PacBio high-fidelity sequencing

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

Structural variations (SVs) pervade plant genomes and contribute substantially to the phenotypic diversity. However, most SVs were ineffectively assayed because of their complex nature and the limitations of early genomic technologies. The recent advance in third-generation sequencing, particularly the PacBio high-fidelity (HiFi) sequencing technology, produces highly accurate long-reads and offers an unprecedented opportunity to characterize SVs’ structure and functionality. As HiFi sequencing is relatively new to population genomics, it is imperative to evaluate and optimize HiFi sequencing based SV detection before applying the technology at scale.

Results

We sequenced wheat genomes using HiFi reads, followed by a comprehensive evaluation of mainstream long-read aligners and SV callers in SV detection. The results showed that the accuracy of deletion discovery is markedly influenced by callers, which account for 87.73% of the variance, while both aligners (38.25%) and callers (49.32%) contributed substantially to the accuracy variance for insertions. Among the aligners, Winnowmap2 and NGMLR excelled in detecting deletions and insertions, respectively. For SV callers, SVIM achieved the best performance. We demonstrated that combining the aligners and callers mentioned above is optimal for SV detection. Furthermore, we evaluated the effect of sequencing depth on the accuracy of SV detection, showing that low-coverage HiFi sequencing is sufficiently robust for high-quality SV discovery.

Conclusions

This study thoroughly evaluated SV discovery approaches using HiFi reads, establishing optimal workflows to investigate structural variations in the wheat genome. The notable accuracy of SV discovery from low-coverage HiFi sequencing indicates that skim HiFi sequencing is effective and preferable to characterize SVs at the population level. This study will help advance SV discovery and decipher the biological functions of SVs in wheat and many other plants.

Article activity feed

  1. This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/10783001.

    Overall, this is an excellent paper on structural variant (SV) detection in polyploid wheat and Aegilops tauschii from long-read PacBio HiFi sequencing. It provides detailed comparisons of mainstream tools and an evaluation of coverage requirements, both of which appear to fill extant gaps in the literature. The structure and direction of the paper are rational, and we found it relatively straightforward to grasp the main points being conveyed. These multiple positives suggest that the paper will become a valuable reference for those working on SVs in cereals, and potentially to plant scientists more broadly.

    Upon reading and commenting on this preprint we have several suggestions which we hope will be of value to the authors. These comments are intended to be constructive, and we hope that our review contributes to an improved version of an already very strong manuscript.

    1. The paper presents some excellent findings and will be a valuable contribution to the field of SV exploration. However, in places the use of overstatement and subjective language could undermine the impact of the work. A more nuanced reflection of the paper's findings would help greatly. Some specific areas are highlighted below:

      • The statement (L35-36) that the results have "demonstrated that combining the aligners and callers mentioned above is optimal for SV detection" is perhaps too broad.

        • The authors have demonstrated clearly that Winnowmap2/SVIM and NGMLR/SVIM are optimal for SV detection for this particular dataset and the set of mappers and callers selected.

        • However, given that no methodology is provided to explain how the mappers and callers were chosen, it seems plausible that ACCs comprised of other mappers and/or callers could outperform those described here. For a similar reason, the claim (L29) to have performed a "comprehensive evaluation of mainstream long-read aligners and SV callers" is also somewhat overstated.

        • The statement is also sequencing-technology agnostic, but the work only compares ACCs' performance for PacBio HiFi reads, not for Oxford Nanopore. It may be that alternative ACCs are more accurate for ONT reads.

        • A minor point is that technically the authors have not shown that their highest performing single ACCs are better than all possible ensembles. Given that there are 1,048,555 possible ensembles when choosing 2-20 ACCs from a set of 20, only a tiny fraction (0.15%) of the combinatoric space has been explored. Testing 1,600 ensembles is incredibly impressive, but the claims of finding the optimal solution are perhaps overstated.

        • The statement could also be caveated "…for SV detection in wheat" as other plant groups are not within the scope of this work.

      • Similarly, the statement (L41) regarding identifying "optimal workflows to investigate structural variations in the wheat genome" could be conditioned.

        • Given that a single accession has been tested for each species of wheat or wheat relative, this claim is quite broad. It may well be the case that different ACCs are optimal for different subspecies or accessions.

        • One way to advance this could be to separately evaluate which ACCs are best for each species rather than pooling all results together into a single F-score. Effectively, the different species could be used as replications. If the same ACC is best for each species independently, this would go some way to supporting the conclusion.

        • This additional analysis could also be quite complimentary to the suggestion by the authors that "adopting a tailored approach, which employs the most efficacious methods for detecting each SV type separately, could be a good practice to enhance the accuracy of SV discovery in genomic studies." i.e. if different SV types have different 'best' ACCs, why not also see if different species or varieties have different 'best' ACCs?

      • In L201-203 it is stated that "we consider that the optimal individual method is efficient, given that ensemble approach presents a clear precision-recall trade-off"

        • This doesn't necessarily appear to be the case as the authors state that this is the case in the paragraph immediately above this statement (L193-194).

      • Given the above points, other uses of hyperbolic language could be altered.

        • L36, L41, L89, L202 "optimal"

        • L29, L87, L100 "comprehensive"

        • L233 "exhaustive"

      • There are also a few places where subjective language is used and more neutral terms could better allow the reader to form their own conclusions.

        • L76 "strikes the perfect balance"

        • L230 "pressing necessity"

        • L271 "appealing solution"

        • L277 "ideal approach"

    2. ABSTRACT: The aligner/caller ensemble work is not discussed in the abstract – this was a major endeavour and a very valuable result so it could be worth including!

    3. BACKGROUND: It would be useful to lay out a very rough pipeline for identifying SVs from long reads. This could even be as basic as explaining what mappers and SV callers actually do and noting that mapping comes before SV calling. This would be useful for people new to analysing sequencing data and could fit nicely between L77 and L78.

    4. BACKGROUND & RESULTS: The background currently only introduces bread wheat, describing it as hexaploid. The existence of tetraploid durum wheat, diploid goat grass, and the concept of the A, B, and D genomes is only brought up in the Results section, which we believe could be confusing for readers not familiar with the evolutionary history of wheat. One alternative could be to introduce the three species in the Background section to facilitate a smoother Results section. L111-118 could be tweaked slightly and moved into the Background section.

    5. RESULTS: We felt that the figure legends are lacking in detail throughout the paper.

    6. RESULTS: Post-verification, the 'truth set' comprised 1.0M deletions and 1.5M insertions, but it is not stated how many were initially discovered. This gives the reader no sense of how good the initial SV discovery by PacBio HiFi actually was.

    7. RESULTS & METHODS: It is not disclosed in the main paper that the PCR-based validation was only conducted on the Aegilops tauschii (DD genome) SVs. While this information can be found in Supplementary Table 4, we felt that this should be made clear in the Results and Methods sections.

      • We also noted that 7/10 of the insertions validated fall on chromosome 11 and wondered if there was a particular reason for this?

      • Additionally, in Supplementary Table 4, the authors could also provide the IWGSC Chinese Spring chromosome name and coordinates to help readers.

    8. RESULTS: L153 states that 91.76% deletions and 81.71% insertions in the truth set were confirmed by the assembly-based method. However, it would be great to also include some description of how many new SVs discovered by the assembly approach are not in the truth set.

      • These statistics can be seen in Fig. 3c and Fig. 3f but aren't discussed. For insertions the assembly method produced > 6-fold the number of SVs.

      • It would be helpful to mention this discrepancy in the Results and then examine why it might have occurred in the Discussion.

      • Also, given this, it might be worth saying that the recall statistics for ACCs or ensembles are only in relation to the specific truth set derived from HiFi + NGS reads.

    9. RESULTS: The assembly-based method is described as an 'independent approach' for truth set validation. However, the assembly and the truth set were constructed using the same PacBio reads. It is therefore not clear that this is a truly independent method.

    10. RESULTS: The formulae for precision, recall, and F-score are given in the Methods section, but it would be helpful to also include an explanation of these metrics in the main Results section to increase reading accessibility for non-informaticians/non-statisticians. This could be placed just before the section that compares the various ACCs. Perhaps some simple examples could be provided too? For example, "If a given method detects 10 SVs and 8 of these are true positives, the method's precision is 0.8". Explaining why the F-score is a good overall metric of an ACC's performance would be helpful too.

    11. RESULTS and DISCUSSION: ACC comparison is all done relatively. What are the absolute differences in performance between ACCs? Are they all more or less acceptable or are some combinations really bad? This would give the reader a sense of how robust SV detection from HiFi reads is to the choice of mappers and callers. Referencing Supplementary Figure 6 or even including it in one of the main figures would really help with this point.

    12. RESULTS: In the low-coverage HiFi analysis represented in Fig. 5, it is unusual that the recall for the SV caller SVDSS steadily decreases at higher coverage depths. This could be addressed in the Results and Discussion.

    13. RESULTS: The low-coverage HiFi section would be more robust if it was also tested on the AABB and AABBDD genomes for a few coverage levels and ACCs (i.e. no need for it be as comprehensive as the impressive testing on the DD genome).

    14. DISCUSSION: It is stated on L243 that it is inherently more difficult to identify insertions than deletions. It would be excellent if the authors could provide more detail and expert opinion on this matter as it is not apparent to the reader why this is the case. Also, it might be worth mentioning again the large number of insertions found in the assembly-based SV set that are absent from the truth set.

    15. DISCUSSION: This section could be elevated by further discussion of the relative pros and cons of using  PacBio HiFi for SV discovery. For example, if the authors trusted moderate-depth NGS to generate the truth set, why not just always use NGS? Is low-coverage PacBio more cost-effective? More accurate? Is the workflow simpler?

    16. METHODS: It is very exciting that the authors are able to achieve high precision with the minimal support read set to 1. However, it would also be interesting to see how the results change if the minimal support read is set to 2 or 3. It would certainly reduce recall, but would precision improve? Given the stated accuracy of HiFi reads (99.9%) and the mean lengths of 13.0, 17.2, and 12.9 kb for the three wheat species, each read will have, on average, 13, 17, and 13 base-call errors. Exploring how increasing the minimal support read could tackle this issue would further increase the value of the paper's analyses.

    17. METHODS: Given the potential of the manuscript to be a reference for the community, we felt that the sampling and sequencing sections lack detail

      • Sampling

        • The names of the accessions used are not given.

        • The age of plants at the time of sampling and the amount of tissue collected is not given.

        • It would also help to know the number of plants sampled per accession. Was the tissue pooled, were there any biological or technical replications, etc?

      • Sequencing

        • How was HMW DNA extracted and what was the final concentration and elution volume?

        • It would also be great to see some quality metrics for the HMW DNA – e.g. fragment length metrics from Tapestation or Femto Pulse, e.g. absorbance ratios from Nanodrop or equivalent. Such benchmarks could help others judge whether the quality of their DNA is suitable for PacBio HiFi.

        • The number of flow cells used per sample is mentioned in the Results but is not described in the Methods.

        • Additional details of Illumina sequencing would be helpful (chemistry, read lengths, single- or paired-end).

    18. METHODS: The bioinformatics sections are more detailed but further explanation would be helpful.

      • How were "mainstream" long-read mappers and SV callers selected for the analyses?

      • Details of how the components of variance were calculated for aligners vs SV callers could be useful.

      • Details of how the Poisson distribution tests for deletions were performed could be useful as these are only briefly mentioned in Fig. 1 Step 2.

      • Details of the Samtools settings used for sub-sampling for the low-coverage HiFi analyses could be useful.

      • Supplementary Table 9 could be referenced in Methods as well as Results.

      • We felt that the following sections could be improved with further explanations: "The SV truth set construction", "Validation of SV truth set by genome assembly, PCR amplification and Sanger sequencing", and "Evaluation of the ACCs and ensemble approach".

        • For example, for SV truth set construction, the authors could explain why setting Depth = 0 is the 'gold standard'. We agree this is the case, but it took us quite a while and an examination of the supplementary figures to deduce why.

        • Referring to Supplementary Figure 2 could help the authors explain the "read depth method" more clearly.

      • The resource and run-time data presented in Supplementary Figure 10 is very valuable for end-users, thank you for including this!

    Minor suggestions

    General

    • Grammar errors were noticed throughout the paper, but there is a higher frequency in the Methods and the Discussion. A few examples:

      • Inappropriate use of the definite article 'the' (e.g. L231, L249, L266)

      • Inappropriate tenses (e.g. L262: "recommended" should be "recommend") (e.g. L302 and L306: "detailed parameters were listed" should be "are listed")

    • It would be preferable to use consistent language throughout to improve clarity.

      • For example, "goatgrass" is also referred to as "strangulata" or just  "the DD genome".

      • Another example is the use of the terms "low-coverage HiFi" and "skim HiFi" to mean the same thing. This occurs in the Abstract, Discussion, and Conclusion.

    Abstract

    • L21-22 – Just using the word "were" is vague and the sentence structure is not quite right. The authors could perhaps alter to "However, SVs have historically been ineffectively assayed…"

    • L21-22 – "Complex nature" is rather vague and uninformative. Perhaps the authors could state more clearly that short-read technologies cannot span large INDELs, leading to poor reconstruction.

    • L29 – "wheat genomes" is perhaps not enough detail. It might be good to state that you sequenced one accession of each of Triticum aestivum (hexaploid, AABBDD), Triticum turgidum (4x, AABB), and Aegilops tauschii (2x DD)

    • L29-30 – consider changing to "…callers for SV detection", not "in".

    • L37 – perhaps indicate roughly what is meant by "low coverage" so someone just reading the abstract still gets the main idea. At this point, it is unclear if "low" means a coverage of 0.01x, 0.1x, 1x, 10x, etc.

    • L42 – This sentence could be shortened to "The accuracy of SV discovery from low-coverage HiFi sequencing suggests it is effective and preferable to characterize SVs at the population level."

    • L37-38 – "sufficiently robust for high-quality SV discovery" This statement could also include the fact that the recall is substantially reduced.

    • L43 – "preferable" is quite subjective. It depends what the user's end goals are. Yes, low-coverage HiFi is more cost-effective, but it also produces less complete results – recall is lower.

    Background

    • L75 – The term "SV breakpoint" is not defined in the paper. It is used again in Figure 3.

    • L71-75 – Here the "initial form of long-read sequencing" is introduced and both methods are described to have high error rates. The authors go on to show that PacBio HiFi has much better accuracy than these methods. For completeness, it would be good to also show how the current generation of Oxford Nanopore compares to PacBio HiFi.

    Results

    • L120 – It might be more appropriate to quote the N50s here rather than average lengths.

    • L164 – "Ground truth" is being used here as a synonym for the truth set. Suggest just saying "truth set" here as well to keep things obvious.

    • L117-118 – The full scientific names are already given in L112 and L113.

    • L220-221 – suggest altering this to "… while low-coverage HiFi sequencing recovers fewer SVs overall, the validity of detected SVs is not comprised". The term "technological missing data" is a convoluted way of justifying that lower coverage produces lower recall.

    • In many figures some text is too small for easy reading.

    • Fig. 1 – Steb 2b. We didn't feel we understood the Paragraph method from this figure or from the Methods section. Also, the arrowheads on the arrows s are a bit strange in this panel.

    • Fig. 2

      • Ease of comprehension could be boosted by swapping the axes of the count plots – i.e. making the bars go horizontal. That way the mapper/caller combination are on the same line for both the count plots and length distribution plots! This would make scanning the figure much easier.

      • The white bar for SVDSS is hard to distinguish.

      • In the bar plots, colours represent aligners, whereas in in the length-distribution plots, colours represent callers. It could be better to use consistent colour keys for both types of figure. The grouping of the plots could be changed so that they can all be coloured by caller.

      • The y-axis labels on the length distribution plots are very small.

    • Fig. 3

      • No y-axis scale on plots in (a) and (d) panels.

      • The y-axis labels on panels (b) and (e) are very small.

      • On panels (b) and (e), it might be good to add a central line to each sub-panel to show the centromere, to give a better impression of the low density of structural variations around the centromere.

      • Panels (g) and (h) could be improved by showing less of the sequence surrounding the SV of interest. The size of the remaining SV sequence not depicted could be indicated using square brackets as we were confused at first when seeing the "…".

      • In panel (h) the red text on the green arrow could be difficult for some readers. Perhaps consider using colour-blind friendly palettes.

    • Fig. 4

      • The y-axis labels mis-spell "precision" in panels (a), (b), (c), and (d)

      • The x-axis labels on panels (c) and (f) are very small

    • Fig. 5 – It was hard to follow the meaning of the different colours and shapes. Perhaps consider including a Supplementary Figure that shows each caller on a separate graph.

    Conclusion

    • L285 – "with much less computational resources" could be changed to "while requiring much fewer computational resources"

    Competing interests

    The authors declare that they have no competing interests.