A high-quality genome and comparison of short- versus long-read transcriptome of the palaearctic duck Aythya fuligula (tufted duck)

This article has been Reviewed by the following groups

Read the full article

Abstract

Background

The tufted duck is a non-model organism that experiences high mortality in highly pathogenic avian influenza outbreaks. It belongs to the same bird family (Anatidae) as the mallard, one of the best-studied natural hosts of low-pathogenic avian influenza viruses. Studies in non-model bird species are crucial to disentangle the role of the host response in avian influenza virus infection in the natural reservoir. Such endeavour requires a high-quality genome assembly and transcriptome.

Findings

This study presents the first high-quality, chromosome-level reference genome assembly of the tufted duck using the Vertebrate Genomes Project pipeline. We sequenced RNA (complementary DNA) from brain, ileum, lung, ovary, spleen, and testis using Illumina short-read and Pacific Biosciences long-read sequencing platforms, which were used for annotation. We found 34 autosomes plus Z and W sex chromosomes in the curated genome assembly, with 99.6% of the sequence assigned to chromosomes. Functional annotation revealed 14,099 protein-coding genes that generate 111,934 transcripts, which implies a mean of 7.9 isoforms per gene. We also identified 246 small RNA families.

Conclusions

This annotated genome contributes to continuing research into the host response in avian influenza virus infections in a natural reservoir. Our findings from a comparison between short-read and long-read reference transcriptomics contribute to a deeper understanding of these competing options. In this study, both technologies complemented each other. We expect this annotation to be a foundation for further comparative and evolutionary genomic studies, including many waterfowl relatives with differing susceptibilities to avian influenza viruses.

Article activity feed

  1. Abstract

    A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab081), where the paper and peer reviews are published openly under a CC-BY 4.0.

    These peer reviews were as follows:

    **Reviewer 1. Qi Zhou **

    Mueller et al. presented a high-quality genome and annotation of tufted duck with combined long-read and short-read techniques here. Tufted duck shows a different susceptibility to avian influenza A viruses (AIV) compared to mallards that share the habitat. So besides adding a new avian genomic resource, tufted duck genome may facilitate the research into the genetic basis of AIV infection. Overall, I think the genome is of high quality, but I do have several comments below:

    The introduction is largely devoted to the great advantage of PacBio over Illumina techniques in elucidating the non-model species' genome feature. This is not needed for the authors of Gigascience. I suggest the authors provide more information in the tufted duck. From the previous studies, how diverged in terms of million years and sequence level between the tufted duck vs. mallard? What is the phylogenetic position of tufted duck in Galloanserase? Are there any lab or field studies of tufted duck's susceptibility to AIV? What is the potential genetic cause? Also, since it is known in mallards that RIG-I is responsible for the AVI response, is this gene then intact or how is this gene expressed in the newly presented tufted duck genome? The analyses part needs to show the repeat content of tufted duck and its comparison to other avian genomes. Particularly, the repeat content of Z and W chromosome. Did the author look into the centromere or telomere sequences? Are Z and W chromosomes assembled into two intact sequences? If so, evidence is needed to show that there is no chimeric assembly between Z and W, or other autosome sequences, as it is mentioned in the paper that 'most of the genome separated into haplotypes'. Tissue-specific expression part: Here what does 'supported' gene mean exactly? Just to make sure, the authors means 'genes' or 'transcripts'? 'Stringtie2 may discard single-exon transcript model..' Did the author find that the Stringtie2 results generally have a much lower proportion of single-exon transcripts compared to say, Iso-seq data? 'The average number of transcripts in the long-read pipeline often almost matched..': I am confused here that Figure 5 shows the opposite result that the supported PacBio genes are much lower in number than those produced by Illumina reads. Any results to support this claim? 'This distribution is much more balanced in the long-read pipeline..' here the authors may suggest that PacBio iso-seq recover more alternative splicing transcripts compared to Illumina data. But it is unclear that the supported genes of iso-seq are so much lower in number than those of Illumina, which may be caused by the relatively lower coverage of iso-seq? So I would conclude at least in this study, both techniques are complementary to each other, rather than one performing better than the other. How are the TEs annotated by these small RNAs, as apart from miRNAs, there should be large portions of piRNAs mapping to TEs. Re-Review: For question 3: The author need to explain more about why they think Figure S1 shows there is no chimeric assembly between Z and W chromosome. As Figure S1 is just a Hi-C matrix plot, among the submitted materials, I also cannot find the legend explaining the figure.

  2. Background

    **Reviewer 2. Joshua Peñalba ** This data note by Mueller et al. describes the high-quality, chromosome-scale assembly of the tufted duck and the gene annotation using both Illumina and PacBio sequencing. The authors present and compare the resulting annotations from the different sequencing platforms which is useful for researchers intending to do RNAseq for annotation.

    I think the details in the note focuses primarily on the gene annotation comparison and the genome assembly has not received adequate attention. Since this will likely be the data note that reports on the genome assembly, it should probably have more details on the chromosomes (in detail below). I understand that one can go into an exhaustive description but I think these as a minimum will give the reader a good idea about the quality of the genome assembly:

    Are the of 34 autosomes + sex chromosomes expected? Was there an a priori expectation based on the karyotype or based on the mallard genome assembly? Was this expectation provided during the scaffolding using HiC? Does the assembly size match the expected genome size based on an independent estimate? Since this is a chromosome-scale assembly, what are the metrics of individual chromosomes? I see in NCBI that the chromosome numbers have been assigned, is this based on size or homology to chicken chromosomes? What are the lengths of each chromosome? GC content? Gene content? Gaps? How many contigs were scaffolded by HiC to build each chromosome? More detail is needed regarding how the manual curation is done so it can be repeatable by other researchers. What is the sequencing effort (# lanes, # SMRT cells, etc.) and resulting coverage from each technology of the genome? This doesn't have to be very detailed if it is reported elsewhere but some idea for the reader will be helpful. I am aware that the VGP pipeline was used for the assembly, is there a GitHub for the pipeline that can be included which has the specific commands and flags for each step? Since the annotation comparison was exhaustive, I don't have as many comments on it. The authors may not be explicitly making a recommendation on which approach to use but what is a good metric that the readers can use to compare the results considering the orders of magnitude difference in sequencing coverage?

    Regarding the Illumina and PacBio annotation comparisons, since the coverage are substantially different, in what metric are they comparable? Is it similar in sequencing cost? Would the PacBio still underperform in terms of recovered genes if it had the same coverage as the Illumina libraries? What was the sequencing effort for the PacBio IsoSeq?