Long-read HiFi sequencing correctly assembles repetitive heavy fibroin silk genes in new moth and caddisfly genomes

Abstract

Insect silk is a versatile biomaterial. Lepidoptera and Trichoptera display some of the most diverse uses of silk, with varying strength, adhesive qualities, and elastic properties. Silk fibroin genes are long (>20 Kbp), with many repetitive motifs that make them challenging to sequence. Most research thus far has focused on conserved N- and C-terminal regions of fibroin genes because a full comparison of repetitive regions across taxa has not been possible. Using the PacBio Sequel II system and SMRT sequencing, we generated high fidelity (HiFi) long-read genomic and transcriptomic sequences for the Indianmeal moth (Plodia interpunctella) and genomic sequences for the caddisfly Eubasilissa regina. Both genomes were highly contiguous (N50 = 9.7 Mbp/32.4 Mbp, L50 = 13/11) and complete (BUSCO complete = 99.3%/95.2%), with complete and contiguous recovery of silk heavy fibroin gene sequences. We show that HiFi long-read sequencing is helpful for understanding genes with long, repetitive regions.

Abstract

This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.64), and has published the reviews under the same license. These are as follows.

**Reviewer 1. Peter Mulhair **

Is the language of sufficient quality?

This manuscript is clear and concise. However, there are some issues with consistency in species names used throughout the manuscript. First, on line 99 Eubasilissa regina should be italicised. Secondly, I would recommend after the initial use of the full names of the species (Plodia interpunctella and Eubasilissa regina) that these be referred to as P. interpunctella and E. regina in the rest of the text. There is inconsistent use of full species names, shortened species names and genus name alone which may cause confusion. Please read through and correct these inconsistencies throughout the manuscript text.

Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples http://gigadb.org/site/guide

No. Missing items from the metadata checklist include (1) Coding gene annotations (GFF), Coding gene nucleotide sequences and Coding gene translated sequences (fasta) and (2) Full (not summary) BUSCO results output files (text).

Is the data acquisition clear, complete and methodologically sound?

Yes. Is there a specific reason why fifth instar larvae were used for RNA sequencing of silk glands of P. interpunctella? If this stage is biologically important than it may be worth stating why this specific stage is used.

Is there sufficient detail in the methods and data-processing steps to allow reproduction?

Yes. However, the code used for Heavy fibroin gene annotation could be made publicly available to enable reproducibility of this analysis (using other species for example or to annotate other repeat rich genes). This could be uploaded to the rest of the relevant code at https://github.com/AshlynPowell/silk-gene-visualization

Is there sufficient information for others to reuse this dataset or integrate it with other data?

Yes. One point worth making is that on Line 161 you state that "The assembly for E. regina is the most contiguous Trichoptera genome assembly to date.". However, there are currently 3 chromosome level assemblies available for Trichoptera on NCBI. I would recommend removing this statement, or changing it by also pointing to these other genomes available.

Other comments:

This work was carried out to a very high quality and I am particularly happy to see more high quality genomic and transcriptomic data for these groups of insects. I also think that annotation of the Heavy fibroin genes is of particular importance and relevance to researchers interested in silk evolution and evolution and annotation of repeat rich proteins.

Recommendation: Minor revision

**Reviewer 2. Reuben W Nowell **

Are all data available and do they match the descriptions in the paper?

No. I wasn't able to access the data with the FTP link provided.

Additional comments:

A very nice piece of work, I have only a few minor comments:

Line 140: "with the k-mer length set to 1" - do you mean 21?
Line 164: great that you provide a link to the GenomeScope html but I recommend to add these kmer plots as additional supplemental figures, they are extremely useful. Just a screenshot of the GenomeScope plot would be fine.
Line 164: in relation to the kmer distributions, in fact both plots look a little bit multimodal to me... especially the Eubasilissa, with peaks at 1n (20x), 2n (40x) and 4n (~80x) coverage. This might indicate tetraploidy, which might explain the large increase in genome span and gene number for this species too. You could run OrthoFinder and look at the distribution of OG membership size, for diploid assemblies it peaks at 2, but you might find a peak at 2 and 4 for Eubasilissa if it is tetraploid.
Line 167: how many contaminant contigs were identified, and where did they come from? - Line 168: the coverage for both species is roughly the same, but the species with the much larger genome is the more contiguous one - any ideas why this is the case?
Line 184: maybe this is a silly question, but how do you know they are full-length? Based on the B. mori BAC sequence?
Line 192: a unit for molecular weight, Da?
Line 224: would be useful to know how many genes are in the Insecta core BUSCO db (i.e., where the 95% comes from).
Line 233: is there a possibility that RepetModeler has also classified the repeat-rich fibroin genes as 'repeats', and so these are masked in the assemblies?
Line 243: this is a huge difference in gene number! Why? Is the E. regina assembly actually a diploid assembly? Or ploidy > 2? [See above comment on kmer plots].
Line 265: "insects have generally been neglected with respect to genome sequencing efforts" - quite a bold statement and I'm not sure I agree, there has been a huge focus on lepidopteran genomics and much of the early sequencing from initiatives such as Darwin Tree of Life have been on insects (also i5k).
Line 457: Table 2: any idea why the P. interpunctella HiFi assembly is ~60 Mb shorter than the two Illumina assemblies?
Line 475: Figures 2 and 3: these are nice figures but I don't quite follow what the two coloured panels on the left are showing, specifically, why are there two panels? A bit more clarification in the legend needed perhaps.
Line 476: N and C capitalised

Recommendation: Minor revision

**Reviewer 3. Martin Pippel **

Is there sufficient information for others to reuse this dataset or integrate it with other data?

Yes. (partly) : To make the study fully reproducible the authors need to upload the PacBio HiFi data (e.g. to NCBI). Otherwise the genome assemblies cannot be reproduced with the available raw data in GenBank.

Any additional comments:

The manuscript entitled “Long-read HiFi sequencing correctly assembles repetitive heavy fibroin silk genes in new moth and caddisfly genomes” from Kawahara et al. describes the de novo assembly and gene annotation of two silk-producing insect species Plodia interpunctella and Eubasilissa regina. The manuscript is well structured and written. Sequencing data, assemblies and genome annotations are publicly available and can be reused by the scientific community. Both contig assemblies show a very high contiguity and good BUSCO scores. Indeed, several from the 118 P. interpunctella and 53 E. regina contigs show telomere repeat sequence at both ends indicating that those represent full chromosomes. Furthermore, the authors showed that even long repetitive genes such as silk fibroin genes were gapless assembled. I consider the manuscript as a valuable contribution for the scientific community and do only have some minor comments and suggestions:

line 129: - which CCS version was used? line 140: - k-mer length was set to 1? Not 21? line 148: - Typo: obd10 reference endopterygota. - In order to make the Busco scores better comparable to other recent Lepidoptera assemblies it would be better to provide the BUSCO scores for P. interpunctella based on the lepidoptera lineage line 158: CCS data should be added to GenBank as well. Usually the raw data (subreads.bam) is lossy converted into fastq files from NCBI, which makes it impossible to reproduce the consensus step with pbCCS or even the assembly. line 159: Both read coverages are quite high and the heterozygosity rates are with 0.7 (Eubasilissa) and 0.36 (Plodia) high as well. I was wondering if the alternate assemblies were also of a decent quality and if those are published as well? line 265: As of today, there are at least 3 other HiFi assemblies available: (GCA_917563855.2, GCA_929108145.1, GCA_917880885.1) line 457: Table 2 states that E.regina was assembled into 53 contigs. However the assembly available at NCBI GCA_022840565.1 has 123 contigs!?

Read the original source

Long-read HiFi sequencing correctly assembles repetitive heavy fibroin silk genes in new moth and caddisfly genomes

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed