A New and Improved Genome Sequence of Cannabis sativa

Abstract

Cannabis is a diploid species (2n = 20), the estimated haploid genome sizes of the female and male plants using flow cytometry are 818 and 843 Mb respectively. Although the genome of Cannabis has been sequenced (from hemp, wild and high-THC strains), all assemblies have significant gaps. In addition, there are inconsistencies in the chromosome numbering which limits their use. A new comprehensive draft genome sequence assembly (~900 Mb) has been generated from the medicinal cannabis strain Cannbio-2, that produces a balanced ratio of cannabidiol and delta-9-tetrahydrocannabinol using long-read sequencing. The assembly was subsequently analysed for completeness by ordering the contigs into chromosome-scale pseudomolecules using a reference genome assembly approach, annotated and compared to other existing reference genome assemblies. The Cannbio-2 genome sequence assembly was found to be the most complete genome sequence available based on nucleotides assembled and BUSCO evaluation in Cannabis sativa with a comprehensive genome annotation. The new draft genome sequence is an advancement in Cannabis genomics permitting pan-genome analysis, genomic selection as well as genome editing.

Now published in Gigabyte doi: 10.46471/gigabyte.10

**Reviewer 2. Ramil Mauleon ** Are all data available and do they match the descriptions in the paper? No Additional Comments Bioproject PRJNA667278 in NCBI appears to be still embargoed, a reviewer link would be helpful.

Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples http://gigadb.org/site/guide No Additional Comments Sample provenance / passport information is lacking for the Cannbio-2 material. Outright mention of the source of RNAseq +TSA info in the methods would be helpful. Same comment as above for Genbank bioproject.

Is the data acquisition clear, complete and methodologically sound? No Additional Comments It's mostly clear from the DNA extraction, pacbio sequencing and primary assembly. The anchoring of the assembled contigs into pseudochromosomes using another published genome lack detail and only broadly mention the software used (RaGOO). This is a very critical step that will distinguish if the Cannbio-2 assembly is an improvement vs the mentioned genome assemblies (esp. cs10, PK); it's a circular argument if the genome assembly is ascertained against existing assemblies from other cannabis accessions and declared improved. As noted by the authors, there are differences (rather than inconsistencies) between the compared published genomes, and these may be inherent in each genome; any analyses on an assembly based on these would cause ascertainment bias. Is there sufficient detail in the methods and data-processing steps to allow reproduction? No Additional Comments The previous comment regarding anchoring of contigs to an existing genome applies to this as well. Regarding genome annotation, is there any basis for the choice of annotation method, i.e. annotator software (Augustus), the consensus builder (EVN), and PASA ? MAKER (MAKER-P) and BRAKER are available pipelines, both being reported as good for plants, and GeneMark is a prediction software suite that excels in plant genome annotation. Re, evidences for annotation, it appears that transcript de novo assemblies were used, but the RNAseq data was not incorporated in the prediction step. No orthologous protein databases appear to have been used as hints for gene prediction. These are just observations/suggestions to further improve annotation quickly. In general, the annotation steps would benefit from a bit more detail for reproducibility, but I would say the annotation if done at the contig level would be very solid.

Is there sufficient data validation and statistical analyses of data quality? No Additional Comments On the assembly itself, since there was no mention of the method for anchoring contigs into chromosomes, there is no information on how scaffolds are spaced along the genome, is it padding by a fixed # Ns? Are all assembled contigs anchored or are there unanchored ones? Again on the point of anchoring and ordering of contigs, ideally evidence from the same sequenced material would be the best to use (an example - genetic linkage map with sequence-based markers). Plant genomes are notorious for rearrangements (inversions, insertions, translocations, tandem repeats etc) even within species, and this appears to be the weakest evidence in this paper (how the contigs were anchored into chromosomes). Re gene annotation, you can conduct the BUSCO on the predicted genes and report those as well. Again, results will reflect the outcome of the annotation method used. For BUSCO in general, I'd be cautious in comparing results across published genomes and it would be more informative during an optimization of the assembly methodology or testing different assembly methods (checking whether you are improving the assembly of the same underlying dataset). On this same topic, are the unmapped contigs from other assemblies used? The same question with the assembly done by the authors apply.

Is the validation suitable for this type of data? No Additional Comments Mostly yes for the primary genome assembly. The pseudochromosome assembly analysis data validation is not convincing. If done at the contig level, the genome annotation would be solid.

Is there sufficient information for others to reuse this dataset or integrate it with other data? No Additional Comments Recapping, missing are the biomaterial information,information on pseudochromosome assembly, explicit mention of genbank IDs for transcript assembly and RNAseq data used in annotation (instead of being in the reference) would improve re-use and integration. On the chromosome nomenclature, I don't understand why the author doesn't mention the ongoing nomenclature being used by the community as reported in the NCBI cs10 refseq release.

Any Additional Overall Comments to the Author I believe reporting on results based on the main evidences generated by the authors (in this current work and the previous one on transcriptome) would make this a stronger data release, i.e. contig/scaffold assemblies, the annotation of that based on your own RNAseq data . On a related note, have you tried using your short-reads data during assembly? Could your assembly have been improved if you used the Illumina data during assembly itself (hybrid assembly, scaffolding)? Cannabis genomes are known to be highly heterozygous, a report of this would be easy to conduct from your assembly vs your reads dataset especially the short-reads and would be an important finding.

Recommendation Major Revision

Read the original source

A New and Improved Genome Sequence of Cannabis sativa

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed