A reference genome for the critically endangered woylie, Bettongia penicillata ogilbyi

Read the full article


Biodiversity is declining globally, and Australia has one of the worst extinction records for mammals. The development of sequencing technologies means that genomic approaches are now available as important tools for wildlife conservation and management. Despite this, genome sequences are available for only 5% of threatened Australian species. Here we report the first reference genome for the woylie ( Bettongia penicillata ogilbyi ), a critically endangered marsupial from Western Australia, and the first genome within the Potoroidae family. The woylie reference genome was generated using Pacific Biosciences HiFi long-reads, resulting in a 3.39 Gbp assembly with a scaffold N50 of 6.49 Mbp and 86.5% complete mammalian BUSCOs. Assembly of a global transcriptome from pouch skin, tongue, heart and blood RNA-seq reads was used to guide annotation with Fgenesh++, resulting in the annotation of 24,655 genes. The woylie reference genome is a valuable resource for conservation, management and investigations into disease-induced decline of this critically endangered marsupial.

Article activity feed

  1. Abstract

    This paper has been published by GigaByte, which openly shares its peer reviews under a CC-BY4.0 licence.

    **Reviewer 1. Qiye Li ** Are all data available and do they match the descriptions in the paper? No The available of all raw sequencing data generated in this study are not stated. And it would be appreciated if the authors could provide a table summarizing all the sequencing data generated in this study.

    Is the data acquisition clear, complete and methodologically sound? Yes Could you also provide the gender information for woy03?

    Is there sufficient detail in the methods and data-processing steps to allow reproduction? No

    L145-146: It is unclear how the authors determined full-length protein-coding genes by BLAST against the Swiss-Prot non-redundant database. It would be appreciated if the authors could provide more details here.

    L183: The authors indicated that 15,904 of the 24,655 protein-coding genes were supported by mRNA evidence and 1,309 by protein evidence. Does the mRNA evidence come from the RNA-seq data? Where does the protein evidence come from?

    Is there sufficient data validation and statistical analyses of data quality? No

    L233: Contaminating sequences in the reference genome are noteworthy, as the DNA for genome sequencing was extracted from wild animals that were dead before sampling. However, I would say high mapping rates did not necessarily represent low contaminating DNA, as the contaminating DNA (e.g. from bacteria), if exists in your dataset, might have been assembled as part of the woylie reference genome. It is unclear if the authors have submitted the genome to NCBI. If so, I think they should have got a report about contamination from NCBI.

    It would be appreciated if the authors could provide some more statistics for protein-coding genes (e.g. Mean gene size, Mean exon number per gene, Mean exon length and Mean intron length) and compare these metrics to other marsupials. This will be helpful to judge the quality of the gene models.

    Is the validation suitable for this type of data? Yes

    Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

    Recommendation: Minor Revision

    **Reviewer 2. Parwinder Kaur ** Well presented document with good data and analyses practices.

    Recommendation: Accept

    **Reviewer 3: Walter Wolfsberger ** Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes

    The submission body and the table 1 state the following assembly stats of the genome assembly that seem to indicate some potential issues: Genome size (Gb) 3.39 No. scaffolds 1,116 No contigs 3,016 Scaffold N50 (Mb) - 6.94 Contig N50 (Mb)- 1.99

    The main issue here for me lies in Scaffold N50 in relation to other parameters, when in comparison with the assemblies using similar methodological approach.

    This can either be good or bad, as these numbers might indicate an issue during scaffolding, or presence of long top assembly scaffolds (which is great). I believe, that the submission would significantly benefit if this information is mentioned and discussed.

    The approach used to generate the assembly seems to utilize 10x PE sequences to scaffold the assembly. There are hybrid assembly approaches available, that leverage short reads to improve the assembly quality, given the slightly limited coverage of PacBio HiFi reads (approx. 12x).

    Recommendation: Minor Revision

    Re-review: The authors addressed all my assembly-related comments in sufficient manner and provided updates that will benefit the manuscript and data released with it.