GenomeDelta: detecting recent transposable element invasions without repeat library

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

To evade repression by the host defense, transposable elements (TEs) are occasionally horizontally transferred (HT) to naive species. TE invasions triggered by HT may be much more abundant than previously thought. For example, previous studies in Drosophila melanogaster found 11 TE invasions over 200 the past years. A major limitation of current approaches for detecting recent invasions is the necessity for a repeat-library, which is notoriously difficult to generate. To address this, we developed GenomeDelta, a novel approach for identifying sample-specific sequences, such as recently invading TEs, without prior knowledge of the sequence. It can thus be used with model and non-model organisms. As input, GenomeDelta requires a long-read assembly and short-read data. It will find sequences in the assembly that are not represented in the short read data. Beyond identifying recent TE invasions, GenomeDelta can detect sequences with spatially heterogeneous distributions, recent insertions of viral elements and recent lateral gene transfers. We thoroughly validated GenomeDelta with simulated and real data from extant and historical specimens. Finally, we demonstrate that GenomeDelta can reveal novel biological insights: we discovered the three most recent TE invasions in Drosophila melanogaster and a novel TE with a geographically heterogeneous distribution in Zymoseptoria tritici .

Article activity feed

  1. GenomeDelta

    Thank you for providing a wonderful tool and some really interesting insights from Drosophila! I noticed that the Github repo contains instructions for installing on multiple systems, and I'm wondering if providing a Nextflow or Snakemake-based approach would be useful for general users. I could see this being especially useful for the multi-sample input (both the many reads vs. single assembly and many reads vs. many assemblies), so users could provide a single input folder with minimal nested for looping.

  2. recent lateral gene transfer

    This is the first use case I thought of while reading the manuscript - finding recent HGT events beyond coding regions could be an extremely powerful use of this tool. I'd love to see an example of this!

  3. high quality genome assembly

    Would you be able to provide insight on what "high quality" means in this case? I assume the bar to confidently call small gaps against fragmented reads is much higher than in some other cases, so some guidance on what level of quality is warranted here would be great.

  4. Since the three novel sequences have a low bias (i.e. close to zero), they may be considered promising candidates

    Oh, I see this section guides users through finding candidates. I wonder if this could be incorporated into the earlier section (or point to this section), since it's helpful.

  5. The most promising sequence (low coverage bias, high copy number and substantial length) corresponds to KoRV

    I'm a little confused how I'd find the most promising sequence from looking at Figure 2D without knowing KoRV was present for that particular sequence. Are the three measures mentioned the best for gauging this, and is there a weighting to apply to each - for example, is low coverage bias more important than substantial length, or should they be considered equally? Also, what defines "substantial length" in the context of a TE?

  6. This can be explained by the fact that degraded fragments of these TEs, likely the remnants of ancient invasions, are present in all genomes, including the genome of the strain sampled at 1815

    Could you comment more on the impact of degraded/endogenized transposable element sequences on the predictive accuracy of GenomeDelta? I imagine that remnants generated over decades or centuries would significantly obfuscate the results for that particular element, but I'm not sure if that is accurate?