GenomeDelta: detecting recent transposable element invasions without repeat library

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

We present GenomeDelta, a novel tool for identifying sample-specific sequences, such as recent transposable element (TE) invasions, without requiring a repeat library. GenomeDelta compares high-quality assemblies with short-read data to detect sequences absent from the short reads. It is applicable to both model and non-model organisms and can identify recent TE invasions, spatially heterogeneous sequences, viral insertions, and hotizontal gene transfers. GenomeDelta was validated with simulated and real data and used to discover three recent TE invasions in Drosophila melanogaster and a novel TE with geographic variation in Zymoseptoria tritici .

Article activity feed

  1. GenomeDelta

    Thank you for providing a wonderful tool and some really interesting insights from Drosophila! I noticed that the Github repo contains instructions for installing on multiple systems, and I'm wondering if providing a Nextflow or Snakemake-based approach would be useful for general users. I could see this being especially useful for the multi-sample input (both the many reads vs. single assembly and many reads vs. many assemblies), so users could provide a single input folder with minimal nested for looping.

  2. recent lateral gene transfer

    This is the first use case I thought of while reading the manuscript - finding recent HGT events beyond coding regions could be an extremely powerful use of this tool. I'd love to see an example of this!

  3. high quality genome assembly

    Would you be able to provide insight on what "high quality" means in this case? I assume the bar to confidently call small gaps against fragmented reads is much higher than in some other cases, so some guidance on what level of quality is warranted here would be great.

  4. Since the three novel sequences have a low bias (i.e. close to zero), they may be considered promising candidates

    Oh, I see this section guides users through finding candidates. I wonder if this could be incorporated into the earlier section (or point to this section), since it's helpful.

  5. The most promising sequence (low coverage bias, high copy number and substantial length) corresponds to KoRV

    I'm a little confused how I'd find the most promising sequence from looking at Figure 2D without knowing KoRV was present for that particular sequence. Are the three measures mentioned the best for gauging this, and is there a weighting to apply to each - for example, is low coverage bias more important than substantial length, or should they be considered equally? Also, what defines "substantial length" in the context of a TE?

  6. This can be explained by the fact that degraded fragments of these TEs, likely the remnants of ancient invasions, are present in all genomes, including the genome of the strain sampled at 1815

    Could you comment more on the impact of degraded/endogenized transposable element sequences on the predictive accuracy of GenomeDelta? I imagine that remnants generated over decades or centuries would significantly obfuscate the results for that particular element, but I'm not sure if that is accurate?