Haplogenome assembly reveals structural variation in Eucalyptus interspecific hybrids

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

De novo phased (haplo)genome assembly using long-read DNA sequencing data has improved the detection and characterization of structural variants (SVs) in plant and animal genomes. Able to span across haplotypes, long reads allow phased, haplogenome assembly in highly outbred organisms such as forest trees. Eucalyptus tree species and interspecific hybrids are the most widely planted hardwood trees with F1 hybrids of Eucalyptus grandis and E. urophylla forming the bulk of fast-growing pulpwood plantations in subtropical regions. The extent of structural variation and its effect on interspecific hybridization is unknown in these trees. As a first step towards elucidating the extent of structural variation between the genomes of E. grandis and E. urophylla, we sequenced and assembled the haplogenomes contained in an F1 hybrid of the two species.

Findings

Using Nanopore sequencing and a trio-binning approach, we assembled the separate haplogenomes (566.7 Mb and 544.5 Mb) to 98.0% BUSCO completion. High-density SNP genetic linkage maps of both parents allowed scaffolding of 88.0% of the haplogenome contigs into 11 pseudo-chromosomes (scaffold N50 of 43.8 Mb and 42.5 Mb for the E. grandis and E. urophylla haplogenomes, respectively). We identify 48,729 SVs between the two haplogenomes providing the first detailed insight into genome structural rearrangement in these species. The two haplogenomes have similar gene content, 35,572 and 33,915 functionally annotated genes, of which 34.7% are contained in genome rearrangements.

Conclusions

Knowledge of SV and haplotype diversity in the two species will form the basis for understanding the genetic basis of hybrid superiority in these trees.

Article activity feed

  1. De novo

    Xupo Ding

    1. The CDS and protein sequences could not extracted from the file of masked.fasta with gff3 file when verifying the accuracy of genes loci and related proteins. The extract software is gffread in cufflinks 2.1.1. Please confirm the final assembly file that would upload to GigaDB.2. Confirmed the accuracy of gene predication, especially for ks calculation.3. Before the repeat masked with the software of Repeatmasker, the final sequences were scanned with LTR_retriever and the LAI index have been generated in this folder. The LAI values were 20.55 and 18.06, which could be classified the haplogenome assembly as the reference or gold level, please describe the LAI values after busco completeness in the revised manuscript.4. The percentages of two largest subfamilies of LTR, Gypsy and Copia, were not presented in the supplementary TableS5.5. Two Eucalyptus genomes have been published (Nature 2014; Gigascience, 2020) and they were all not analysis the LTR insert time in detail. The insert times of all TE, Gypsy and Copia would highlighted this manuscript, especially the basic data have been presented with *.list in the LTR_harvest and LTR_retriever scan.6. Did the special genes of each haplogenome classify? Which pathways or Go terms they enriched in?7. Some SVs may be associated with the plant traits. The genes distributing in the regions of different SVs type should be furtherly identified and enriched with GO and KEGG.8. "Syntenic gene pairs between the E. grandis and E. urophylla haplogenomes were identified using a python version of MCScan, JCVI v1.1.18."Syntenic gene pairs in Figure 4 seemed only from JCVI,not using MCScan.9. The reference cite should be consistent, such as Candotti et al in the section of Genome scaffolding should be revised.10. Language should be improved and modified by academic editor.
  2. Summary

    Chao Bian: This study, entitled "Haplogenome assembly reveals interspecific structural variation in Eucalyptus hybrids", has reported two haplotypes from Eucalyptus grandis and E. urophylla.Both genomes are of high quality and high completeness. Nevertheless, why not directly and separately sequenced the Eucalyptus grandis and E. urophylla, and separately assembled each genome? In this way, the authors will not perform so much assembling steps to distinguish haplogenome.On the other hand, the authors have written a large paragraph to show the SV and SNP between both Eucalyptus species. However, the author only shown the number of SVs and SNPs, but did not show any relationship between the SV and biological characters. Could some SVs and SNPs involved in or impacted some genes can interpret some biological difference between Eucalyptus grandis and Eucalyptus grandis?In my view, only showing the number of SVs and SNPs is indeed fruitless for wide interests of this study. Some biological stories should be reported in a genome study.Please provide new figures with higher resolution. These figures are too much unclear.Please use the novel version of BUSCO V5.2.2, and indicate the used library.What's the QUAST assessment result in this study?The English language of this paper needs to be largely polished. Too much spelling and mistakes were appeared in the manuscript.Some minor suggestions:The decimal places should be uniform, such as "(567 Mb and 545 Mb) to 97.9% BUSCO completion" and "scaffold N50 of 43.82 Mb and 42.45 Mb for the E. grandis and E. urophylla haplogenomes, respectively".In 'All scripts used in this study is available on github.', 'is' should be 'are'.The language of this sentence should be revised "Illumina short-reads were used for k-mer based genome size estimation was performed using Jellyfish v2.2.6 (Jellyfish, RRID:SCR_005491) [25] for 21- mers and visualised with GenomeScope v2.0"For scaffolding step, why the authors removed all contigs smaller than 3kb?'The predicted gene space was' should be 'The predicted gene spaces were'.For "a contig N50 of 3.91 Mb 1." and 'was greater than 88.0% 2', what're meaning of the last '1' and '2' in these sentences.In this sentence 'Approximately 3.3 μg of HMW DNA from was used without', 'from' what?"a BUSCO completeness score of at least 95.3% was obtained for contigs anchored to one of the eleven chromosomes.", for one of the eleven chromosomes? Why contigs were only anchored to one chromosome?Revise 'markers each.,'."BUSCO completeness scores of 94.6% and 95.8% was obtained", 'was' should be 'were'."Although there is a greater number of local variants compared to SVs", 'there is' should be 'there are'."respectively, Supplementary Table S3)" revised to 'respectively, (Supplementary Table S3)'.'Mbp' revised to 'Mb'.'assemblies was' should be 'assemblies were'.