Genome size evolution in the diverse insect order Trichoptera

This article has been Reviewed by the following groups

Read the full article

Abstract

Background

Genome size is implicated in the form, function, and ecological success of a species. Two principally different mechanisms are proposed as major drivers of eukaryotic genome evolution and diversity: polyploidy (i.e., whole-genome duplication) or smaller duplication events and bursts in the activity of repetitive elements. Here, we generated de novo genome assemblies of 17 caddisflies covering all major lineages of Trichoptera. Using these and previously sequenced genomes, we use caddisflies as a model for understanding genome size evolution in diverse insect lineages.

Results

We detect a ∼14-fold variation in genome size across the order Trichoptera. We find strong evidence that repetitive element expansions, particularly those of transposable elements (TEs), are important drivers of large caddisfly genome sizes. Using an innovative method to examine TEs associated with universal single-copy orthologs (i.e., BUSCO genes), we find that TE expansions have a major impact on protein-coding gene regions, with TE-gene associations showing a linear relationship with increasing genome size. Intriguingly, we find that expanded genomes preferentially evolved in caddisfly clades with a higher ecological diversity (i.e., various feeding modes, diversification in variable, less stable environments).

Conclusion

Our findings provide a platform to test hypotheses about the potential evolutionary roles of TE activity and TE-gene associations, particularly in groups with high species, ecological, and functional diversities.

Article activity feed

  1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac011), which carries out open, named peer-review.

    These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 2: Gregg Thomas

    This paper presents 17 new insect genomes from the order of caddisflies (Trichoptera). The authors combine these genomes with 9 previously sequenced genomes to analyze genome size evolution across the order. They find that genome size tends to correlate with evolution of repeat elements, specifically expansion of transposable elements (TEs). Interestingly, the authors also notice that TE expansions also correlate with gene copy-number (or gene fragment copy-number), even of highly conserved genes used to assess genome completeness. Overall, I find this paper very well written and easy to follow. The genomic resources and analyses presented provide novel new resources and findings for insects in the order Trichoptera, with potential implications beyond. I have only minor suggestions before publication, outlined below.

    1. Regarding the TE and BUSCO gene fragment associations, while I think this is a really interesting analysis, I found the underlying models a bit difficult to understand. Line 236 reads, "To test whether repetitive fragments were due to TE insertions near or in the BUSCO genes or, conversely, due to the proliferation of 'true' BUSCO protein-coding gene fragments…" Is the idea that a BUSCO gene has been duplicated itself and then one copy is either fragmented by a TE insertion or hitch-hikes with a TE (as mentioned on line 501)? Or are these fragments only of BUSCO genes that didn't match a full BUSCO gene at all, but the fragments that did match had unexpectedly high coverage? I guess I'm just confused as to whether a gene duplication needs to precede the TE insertions/hitch-hiking, which is subsequently pseudogenized either prior to or because of the TE activity, or if these are gene losses. I understand how the TE could inflate the coverage of these fragments, but I guess I'm still not clear on how these fragments arise in the first place. Any clarification would be helpful! Also, if the case is that these are fragments of BUSCO genes that have no full matches in the genome, how might assembly contiguity or quality be affecting these matches?

    2. One thing that I noticed throughout the figures is that branch B1, leading to A. sexmaculata, the branch leading to clade A, and the branch leading to clade B (as labeled in Figures 1 and 2) appear to form a polytomy. I don't find this mentioned in the text and am wondering why this relationship remains unresolved with these data. I don't think this has any bearing on the results, since all analyses are done on the tips of the tree, but I think readers looking at these trees will want to know what is going on at that node.

    3. The authors use custom scripts for their BUSCO-TE correlation analysis and provide a link to a Box folder on line 514. I would request that these scripts be put somewhere more stable and accessible (e.g., github). Not only was I asked to login when clicking the link, but after I had done so that link didn't seem to exist.

    Minor/editorial points

    1. Would the authors be able to report concordance factors for the species tree? I think this should be easy enough with IQ-tree and is something I ask everyone to do. This may also help answer my question about the polytomy.

    2. The authors do a good job of mentioning and citing programs used throughout the manuscript but seem to skip this in the Assembly section (starting on Line 398). "First, we applied a long-read assembly method…" Which one? Same for "de novo hybrid assembly approaches." I see that assembly is covered in detail in the Supplement, but I think naming the main programs used (wbtdbg2 and Masurca) should be in the main text.

    3. Line 281-282: I think some of the brackets and parentheses here are mismatched or un-closed.

  2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac011), which carries out open, named peer-review.

    These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 1: Julie Blommaert

    Summary of the paper and overall impression

    In their paper, "Genome size evolution in the diverse insect order Trichoptera", Heckenhauer et al report a 14-fold variation in genome size in caddis-flies. The authors find evidence for increases in transposable elements associated with larger genomes, and report that in caddis-flies living in less stable environments, some genes are replicated in association with transposable elements. Overall, this paper represents a comprehensive collection of data, however, I have some concerns about some of the reporting of methods, some analyses and conclusions. To support some of the conclusions, namely that WGD or large-scale duplications do not play a role in caddis fly genomes, I believe the authors could perform additional analyses. Further, I was left confused by the descriptions of the methods, especially around the replicated BUSCO gene analyses. Please see my comments below.

    Main comments:

    1. The authors report that their gene-age distribution analyses do not support the hypothesis of a WGD, but given previous suggestions that WGD are important in these species, the authors should conduct additional analyses (e.g smudgeplot, minor allele frequency distributions in single-copy genes) to rule out this possibility. While it can indeed be difficult to find a balance between the evidence of absence and an absence of evidence, more effort should go into resolving the matter of WGD in caddis-flies. Some of the genomescope peaks, and some of the coverage peaks from the backmap approach seem to at least hint at large-scale duplications or variations in copy number. Further analyses should also consider if assembled gene copies may be collapsed duplicates.

    2. I admit I am confused by the terminology around the TE-associated BUSCO genes. Are these cases where BUSCO has reported a high number of duplicates? Or where BUSCO annotated regions have a high coverage? Two things need to be clarified here; what made them stick out in the first place (coverage? Duplications?), and what are they really (TEs that BUSCO mistook for BUSCOs? fragments of real BUSCOs attached to TEs?).

    Minor comments:

    1. Lines 53-57: "Genome size can vary widely among closely and distantly related species, though our knowledge is still scarce for many non-model groups. This is especially true for insects, which account for much of the earth's species diversity. To date 1,345 insect genome size estimates have been published, representing less than 0.15% of all described insect species." While I appreciate the authors' point that there is a relatively little data available about genome size and only a small proportion of nonmodel insects in the Animal Genome Size database, this is the case for all groups, and insects actually represent the largest group of invertebrates in the AGSDb. However, this does not mean insects, or chironomid are a poor system to study this in, so authors could reframe this first sentence to justify the study system with something more than highlighting how understudied this is in insects.

    2. Line 76: correct to "In insects, the KNOWN ranges of genomic repeat proportion are…"

    3. Lines 89-91: Why are species rich groups a better system to study RE evolution and environmental interactions than e.g. populations, species complexes, recently diverged species, or groups in the process of speciation?

    4. Lines 113-115: The data description does not, in my opinion, need to justify the species selection since this is done in the intro

    5. Genome size estimates- sequencing based estimates can also be impacted by GC-content, especially in libraries which were produced using PCR, this may be a useful point regarding the differences between FCM and sequencing-based estimates

    6. RepMod versions inconsistent Line 463 says v2, earlier says v1

    7. Line 468-469- What did you use to merge repmask out files?

    8. All read-based analyses: were they run on decontaminated read libraries? If so, please briefly clarify this in the main manuscript. Genome size with GenomeScope: 444-448; RepeatExplorer: Lines 471-479

    9. Why only use dnaPipeTE for repeat divergences and not also abundances? Does dnaPipeTE agree with RepeatExplorer?

    10. Line 495: What is meant by "BUSCO genes showed regions of unexpected high copy number…"? Are these genes reported by BUSCO as duplicated or is this referring to increased coverage?

    11. Lines 506-507: "We used copy number profiles to identify BUSCO genes with repetitive sequences based on coverage profiles" The meaning of this is unclear. The reported copy number from BUSCO? Coverage of mapped reads?

    12. Table 1- please report the full BUSCO summary (e.g. C:39.7%[S:39.2%,D:0.5%],F:35.8%,M:24.5%,n:2442) for each species, lumping complete and fragmented together is unneccesary, and readers are usually interested enough in the full complement of BUSCOs that it should not be in the supplements, but in the main paper

    13. Coverages from backmap method can and should be compared to genomescope kcov estimates (while correcting for kmer size; see here for a brief explanation https://www.biostars.org/p/221672/), this will validate both approaches and offer further evidence when considering polploidy.

    14. In the supplementary note about TAGC plots, Figures S31, S36, S38, S44, S45, S46, S47 don't list contaminant exclusion criteria- if contaminants weren't removed this needs to be stated, and in some cases, especially those where there are different "blobs", (e.g. S47) justified

    15. Supplementary note 9: Figure reference is wrong?

    16. Supplementary note 10: Can coverage comparisons using average BUSCO coverage be re-run using corrected kcov estimates? This would validate the BUSCO coverage approach.

    17. Supp Data 1: Coverage estimates would be more accurate if based on FCM measurements and total sequenced bp (before and after decontamination) and can also be compared to corrected kcov estimates

    18. Limnephilus lunatus has too low coverage to get reliable genomescope