Independent assessment and improvement of wheat genome sequence assemblies using Fosill jumping libraries

This article has been Reviewed by the following groups

Read the full article

Abstract

Background

The accurate sequencing and assembly of very large, often polyploid, genomes remains a challenging task, limiting long-range sequence information and phased sequence variation for applications such as plant breeding. The 15-Gb hexaploid bread wheat (Triticum aestivum) genome has been particularly challenging to sequence, and several different approaches have recently generated long-range assemblies. Mapping and understanding the types of assembly errors are important for optimising future sequencing and assembly approaches and for comparative genomics.

Results

Here we use a Fosill 38-kb jumping library to assess medium and longer–range order of different publicly available wheat genome assemblies. Modifications to the Fosill protocol generated longer Illumina sequences and enabled comprehensive genome coverage. Analyses of two independent Bacterial Artificial Chromosome (BAC)-based chromosome-scale assemblies, two independent Illumina whole genome shotgun assemblies, and a hybrid Single Molecule Real Time (SMRT-PacBio) and short read (Illumina) assembly were carried out. We revealed a surprising scale and variety of discrepancies using Fosill mate-pair mapping and validated several of each class. In addition, Fosill mate-pairs were used to scaffold a whole genome Illumina assembly, leading to a 3-fold increase in N50 values.

Conclusions

Our analyses, using an independent means to validate different wheat genome assemblies, show that whole genome shotgun assemblies based solely on Illumina sequences are significantly more accurate by all measures compared to BAC-based chromosome-scale assemblies and hybrid SMRT-Illumina approaches. Although current whole genome assemblies are reasonably accurate and useful, additional improvements will be needed to generate complete assemblies of wheat genomes using open-source, computationally efficient, and cost-effective methods.

Article activity feed

  1. Now published in GigaScience doi: 10.1093/gigascience/giy053

    Fu-Hao Lu 1John Innes Centre, Norwich Research Park, Norwich NR4 7UH, UK Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteNeil McKenzie 1John Innes Centre, Norwich Research Park, Norwich NR4 7UH, UK Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteGeorge Kettleborough 2The Earlham Institute, Norwich Research Park, Norwich NR4 7UZ, UK Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteDarren Heavens 2The Earlham Institute, Norwich Research Park, Norwich NR4 7UZ, UK Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteMatthew D. Clark 2The Earlham Institute, Norwich Research Park, Norwich NR4 7UZ, UK Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteMichael W. Bevan 1John Innes Centre, Norwich Research Park, Norwich NR4 7UH, UK Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Michael W. Bevan

    A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giy053 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

    These peer reviews were as follows:

    Reviewer 1: http://dx.doi.org/10.5524/REVIEW.101147 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.101148