Genome assembly of the hybrid grapevine Vitis ‘Chambourcin’

Curation statements for this article:
  • Curated by GigaByte

    GigaByte logo

    Editor’s Assessment

    Hybrid genomes are tricky to assemble, and few genomic resources are available for hybrid grapevines such as ‘Chambourcin’, a French-American interspecific hybrid grape grown in the eastern and midwestern United States. Here is an attempt to assemble Chambourcin’ using a combination of PacBio HiFi long-reads, Bionano optical maps, and Illumina short-read sequencing technologies. Producing an assembly with 26 scaffolds, an N50 length 23.3 Mb and an estimated BUSCO completeness of 97.9% that can be used for genome comparisons, functional genomic analyses, and genome-assisted breeding research. Error correction and pilon polishing was a challenge with this hybrid assembly, but after trying a few different approaches in the review process have improved it, and as they have documented what they did and are clear about the final metrics, users can assess the quality themselves.

    This assessment refers to version 2 of this preprint.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

‘Chambourcin’ is a French-American interspecific hybrid grape variety grown in the eastern and midwestern United States and used for making wine. Currently, there are few genomic resources available for hybrid grapevines like ‘Chambourcin’.

Results

We assembled the genome of ‘Chambourcin’ using PacBio HiFi long-read sequencing, Bionano optical map sequencing and Illumina short read sequencing. We produced an assembly for ‘Chambourcin’ with 26 scaffolds with an N50 length of 23.3 Mb and an estimated BUSCO completeness of 97.9%. 33,791 gene models were predicted, of which 81% (27,075) were functionally annotated using Gene Ontology and KEGG pathway analysis. We identified 16,056 common orthologs between ‘Chambourcin’ gene models, V. vinifera ‘PN40024’ 12X.v2, VCOST.v3, Shine Muscat ( Vitis labruscana x V. vinifera ) and V. riparia Gloire. A total of 1,606 plant transcription factors representing 58 different gene families were identified in ‘Chambourcin’. Finally, we identified 304,571 simple sequence repeats (SSRs), repeating units of 1-6 base pairs in length in the ‘Chambourcin’ genome assembly.

Conclusions

We present the genome assembly, genome annotation, protein sequences and coding sequences reported for ‘Chambourcin’. The ‘Chambourcin’ genome assembly provides a valuable resource for genome comparisons, functional genomic analysis and genome-assisted breeding research.

Article activity feed

  1. Editor’s Assessment

    Hybrid genomes are tricky to assemble, and few genomic resources are available for hybrid grapevines such as ‘Chambourcin’, a French-American interspecific hybrid grape grown in the eastern and midwestern United States. Here is an attempt to assemble Chambourcin’ using a combination of PacBio HiFi long-reads, Bionano optical maps, and Illumina short-read sequencing technologies. Producing an assembly with 26 scaffolds, an N50 length 23.3 Mb and an estimated BUSCO completeness of 97.9% that can be used for genome comparisons, functional genomic analyses, and genome-assisted breeding research. Error correction and pilon polishing was a challenge with this hybrid assembly, but after trying a few different approaches in the review process have improved it, and as they have documented what they did and are clear about the final metrics, users can assess the quality themselves.

    This assessment refers to version 2 of this preprint.

  2. Background ‘Chambourcin’ is a French-American interspecific hybrid grape variety grown in the eastern and midwestern United States and used for making wine. Currently, there are few genomic resources available for hybrid grapevines like ‘Chambourcin’.Results We assembled the genome of ‘Chambourcin’ using PacBio HiFi long-read sequencing and Bionano optical map sequencing. We produced an assembly for ‘Chambourcin’ with 27 scaffolds with an N50 length of 23.3 Mb and an estimated BUSCO completeness of 98.2%. 33,265 gene models were predicted, of which 81% (26,886) were functionally annotated using Gene Ontology and KEGG pathway analysis. We identified 16,501 common orthologs between ‘Chambourcin’ gene models, V. vinifera ‘PN40024’ 12X.v2, VCOST.v3, V. riparia ‘Manitoba 37’ and V. riparia Gloire. A total of 1,589 plant transcription factors representing 58 different gene families were identified in ‘Chambourcin’. Finally, we identified 310,963 simple sequence repeats (SSRs), repeating units of 16 base pairs in length in the ‘Chambourcin’ genome assembly.Conclusions We present the genome assembly, genome annotation, protein sequences and coding sequences reported for ‘Chambourcin’. The ‘Chambourcin’ genome assembly provides a valuable resource for genome comparisons, functional genomic analysis, and genome-assisted breeding research.

    This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.84) and has published the reviews under the same license. These are as follows.

    **Reviewer 1. Lingfei Shangguan ** Reviewers Comments: Grapevine is one of the most important fruit crops in the world, and ‘Chambourcin’ is a hybrid wine grape variety in the world, which represented the cross species between North American and European Vitis species. The authors have sequenced the genome sequence of ‘Chambourcin’, and obtained the repeat sequences and gene annotation information. However, the sequence depth was too low for the grape genome, especially the high heterozygosity. They also not applied the illumine sequencing for sequence correction.

    Re-review: Since the authors have made some correction and improvement, the genome quality was still low, and the manuscript has not improvement significantly. Authors should provide the haplotype sequences, and describe the genome assembly and correction steps more clearly. Moreover, the innovation of the article is insufficient. I suggest reject.

    **Reviewer 2. Pablo Carbonell-Bejerano **

    Are all data available and do they match the descriptions in the paper? No. Access to the raw data for the RNA-seq dataset that was used for gene predictions is not indicated

    Are the data and metadata consistent with relevant minimum information or reporting standards?

    No. Any description of the RNA-seq dataset and its origin or features is fully missing. I could not find other data that would be required according to guidelines in http://gigadb.org/site/guide:

    • Full (not summary) BUSCO results output files (text)
    • readme.txt including all file names with a brief description of each
    • sample metadata that complies with the Genomic Standards Consortium.

    Is the data acquisition clear, complete and methodologically sound?

    Yes. Sequencing and bioinformatic methods followed are generally sound.

    Is there sufficient detail in the methods and data-processing steps to allow reproduction? No. 1. Availability for the scripts used in bioinformatic analyses and data plotting is generally missing.

    1. L124. Authors describe that minimap2 was used to obtain the dotplot. However, minimap2 alone does not produce dotplots.

    2. L131. It is unclear how ‘PN40024’ 12X.v2, VCost.v3 protein annotations were used as input of BRAKER2. Do authors mean protein sequences instead? Where were these protein data retrieved from? How are proteins aligned to the assembly? Was BRAKER run from masked or unmasked assembly?

    Is there sufficient data validation and statistical analyses of data quality? No.

    1. Validation of the original material for its true-to-typeness as 'Chambourcin' cultivar genotype is not mentioned, neither the number of different plants used for DNA extraction. While post-assembly validation of the Chambourcin genome assembly genotype from the mapped Chambourcin rhAmpSeq markers may be possible, such genotype validation is not mentioned either in the text.

    2. In general, the quality and the genome variation represented in the Chambourcin genome assembly produced here could have been further tested. For instance, from 2% BUSCO duplication and 501.5 Mb of primary assembly size as compared to the 481.5 Mb haploid genome size that can be inferred from the k-mer analysis presented by the authors indicates, it seems that further duplication purging of the primary assembly is likely needed. This issue could be addressed by looking for assembly regions with reduced alignment depth when all HiFi reads are mapped to the primary assembly. Duplicated regions to be purged could also be supported by co-linear assembly segments sharing BUSCO duplicated genes. For assembly reliability assessment, 10X, rhAmpSeq, or Illumina WGS data that is available for Chambourcin could also be used to validate genome variants represented in this Chambourcin assembly when comparing the inter-haplotype variants detected between primary and haplotig assemblies or the haplotypes with genome assemblies from other genotypes.

    Is the validation suitable for this type of data? Yes. The validation is suitable, although it might not suffice in all cases.

    Is there sufficient information for others to reuse this dataset or integrate it with other data? No. As described before, there is missing information at several instances, like for the origin of the RNA-seq.

    Additional Comments: 1. L171. Is it correct that total length of Bionano maps was as small as 962,964 bp? Or do authors mean kb instead of bp in that sentence?

    1. The mapping of Chambourcin rhAmpSeq markers could have been further exploited to phase contig haplotypes before purging haplotypes and assembly scaffolding?

    2. For the Conclusion in L254, it might be arguable whether the presented Chambourcin genome assembly is the first genome assembly of a complex interspecific hybrid or not. For instance 'Shine Muscat' might also be considered a complex inter-specific hybrid grape cultivar and its genome assembly was published: https://academic.oup.com/dnaresearch/article/29/6/dsac040/6808674 It might even be arguable whether the one presented in this publication is the first Chambourcin genome assembly as there is a 10X Genomics-based assembly available for Chambourcin: https://www.nature.com/articles/s41467-019-14280-1

    Re-review: Efforts to improve the accuracy of the MS and the availability of data are clear in the revised version. Authors have included descriptions of M&M procedures and information about the origin of several datasets that were missing. They also included files with commands and original results to the FTP server. In addition, they did further de-duplication of the assembly, added Illumina sequencing for assembly polishing, and included further QC stats and comparisons to another recently published hybrid grapevine genome assembly.

    Most revision actions were successful. However, it is not recommended to polish HiFi assemblies with Illumina reads as in most cases it harms the consensus quality more than it improves it, which is particularly true for repetitive and highly heterozygous genomes like the one of Chambourcin grapevine cultivar. In fact, the BUSCO Completeness of 97.9% after Pilon short-read polishing compared to 98.2% in the former version indicates that polishing with Illumina short-reads is indeed harming in this revised version. I indeed agree with authors that 28x depth of PacBio HiFi reads should suffice to produce a quality genome assembly without using more depth or another sequencing technologies as they indicate in their response. I would recommend to remove the Pilon polishing from the final assembly version, which is only recommended in error-prone PacBio CLR or Nanopore assemblies. Instead, authors could use the Illumina reads for k-mer analysis of assembly consensus quality and completeness.

    **Editorial Board Member adjudication: **

    Comment 1. How many times did you do the polishing with Pilon? This is not clear in the documents provided. It could be 1 round or many. Many would be a concern. When we run error correction on genomes, we monitor BUSCO and when it drops, roll back one iteration. Comment 2. How many sites were corrected in the polishing of the primary and haplotig assembly? Comment 3. Can you run KAT (KAT: A K-Mer Analysis Toolkit to Quality Control NGS Datasets and Genome Assemblies.” Bioinformatics 33 (4): 574–76) to check the diploid, primary and haplotig assemblies? Comment 4. Can you align the mRNAseq and whole genome shotgun reads to diploid, primary and haplotig assemblies and report the percent mapping including the properly paired?