The genetic control of rapid genome content divergence in Arabidopsis thaliana

Christopher J Fiscus
Daniel Koenig

Curated by eLife

eLife Assessment

This important study systematically investigates repeat expansion in the plant Arabidopsis thaliana using a new k-mer-based method, expanding on smaller studies to more comprehensively identify cis- and trans-acting loci associated with repeat dynamics. The approach is methodologically sound and broadly applicable to large-scale short-read datasets for assessing copy number variation and genomic repeat content. While convincing in its scope and novelty, the findings would be further strengthened with exploratory analyses of datasets from other species with more or fewer repeats in their genomes.

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (eLife)

Abstract

Genome evolution in eukaryotes is predominantly driven by the dynamics of repetitive sequences, which vary widely in both copy number and sequence composition. The rate of repeat evolution changes between species and within a species and is likely modulated by both genetics and environment. To uncover the factors shaping the rate of genome content evolution, we analyzed 1,142 resequenced Arabidopsis thaliana genomes using a novel K-mer based approach. With this dataset, we characterized genome content variation and identified hypervariable regions that contribute to major differences in repeat abundance. We then treated repeat abundance as a quantitative trait and performed genome-wide association studies to map the genetic basis of copy number variation across more than 400 repeat families. We jointly analyzed these results using a meta-GWAS approach, revealing both cis-acting variants and over 50 trans-acting loci that regulate repeat abundance genome-wide. Finally, we found that purifying selection acts against mutations that increase the rate of genome content divergence, favoring alleles that limit repeat expansion. Together, our results provide new insights into the genetic architecture and evolutionary forces shaping genome evolution in plants.

Version published to 10.7554/elife.108238.1 on eLife
Oct 21, 2025
Version published to 10.7554/elife.108238 on eLife
Oct 21, 2025
eLife
Oct 20, 2025

eLife Assessment

This important study systematically investigates repeat expansion in the plant Arabidopsis thaliana using a new k-mer-based method, expanding on smaller studies to more comprehensively identify cis- and trans-acting loci associated with repeat dynamics. The approach is methodologically sound and broadly applicable to large-scale short-read datasets for assessing copy number variation and genomic repeat content. While convincing in its scope and novelty, the findings would be further strengthened with exploratory analyses of datasets from other species with more or fewer repeats in their genomes.

Read the original source
eLife
Oct 20, 2025

Reviewer #1 (Public review):

Summary:

Overall, this study is an excellent and systematic investigation of the expansion of repeat sequences in Arabidopsis thaliana, and the genetic mechanisms underlying these expansions. Many of the key findings here confirm smaller studies of both repeat sequence variation and the individual genes associated with the expansion of various repeat classes. The authors present a highly effective and practical approach that requires datasets that are far more readily available than the multiple reference genomes used to annotate repeat variation in recent works. Therefore, they provide an approach that shows significant promise in non-model systems in which far less is known of repeat variation and its underlying drivers.

Strengths:

This is a very methodologically sound study that extends the relatively …

Reviewer #1 (Public review):

Summary:

Overall, this study is an excellent and systematic investigation of the expansion of repeat sequences in Arabidopsis thaliana, and the genetic mechanisms underlying these expansions. Many of the key findings here confirm smaller studies of both repeat sequence variation and the individual genes associated with the expansion of various repeat classes. The authors present a highly effective and practical approach that requires datasets that are far more readily available than the multiple reference genomes used to annotate repeat variation in recent works. Therefore, they provide an approach that shows significant promise in non-model systems in which far less is known of repeat variation and its underlying drivers.

Strengths:

This is a very methodologically sound study that extends the relatively well-studied Arabidopsis thaliana repeat landscape with more systematic sampling, highlights the loci associated with repeat expansions (many of which were previously identified in a piecemeal manner), and provides some evolutionary inference on these.

Weaknesses:

Regarding cis-QTLs: I foresee at least two causes of these associations: non-repetitive cis-acting sequences that promote or permit the expansion of local repeats, and variation in repeat sequences themselves that directly tag the expanding sequence itself. It's arguable whether these are truly two distinct classes, but an attempt to discriminate between them may provide some insight as to the local factors that allow for repeat expansion, beyond the mere presence of a repeat sequence. One way to discriminate these could be to map the ~1300 12-mer frequency profiles on the reference genome, and filter any SNPs with elevated 12-mer frequency from the GWAS (or to categorize them independently).

I also have a question regarding the choice of k=12 in kmer profile analyses. Did the authors perform any GWAS with other values of K? If so, how did the results change? I would expect that as K is increased, the associations would become more specific to individual repeat families, possibly to the point where only cis-acting loci are detected. The authors show convincing evidence that k=12 is appropriate; however, I would be interested to see if/how GWAS results vary among e.g. k=10, 12, 15, 18.

Read the original source
eLife
Oct 20, 2025

Reviewer #2 (Public review):

Summary:

The authors introduce a K-mer-based method for profiling repeat content within a species, applied here to 1,142 A. thaliana genomes sequenced with short reads. This approach allowed them to bypass the challenges of genome assembly, particularly for repetitive regions, while still quantifying copy number variation. Their analysis identified >50 trans-acting loci regulating repeat abundance, enriched for genes involved in DNA repair, replication, and methylation. They also speculate on the role of selection in shaping genome repeat content, arguing that purifying selection tends to suppress alleles that promote repeat expansion.

The work presents a scalable way to extract meaningful insights from the large quantities of short-read datasets available. However, I have several concerns regarding the …

Reviewer #2 (Public review):

Summary:

The authors introduce a K-mer-based method for profiling repeat content within a species, applied here to 1,142 A. thaliana genomes sequenced with short reads. This approach allowed them to bypass the challenges of genome assembly, particularly for repetitive regions, while still quantifying copy number variation. Their analysis identified >50 trans-acting loci regulating repeat abundance, enriched for genes involved in DNA repair, replication, and methylation. They also speculate on the role of selection in shaping genome repeat content, arguing that purifying selection tends to suppress alleles that promote repeat expansion.

The work presents a scalable way to extract meaningful insights from the large quantities of short-read datasets available. However, I have several concerns regarding the methodology, scope of claims, and interpretation of results.

Strengths:

The authors leverage a large dataset, >1100 samples, of A. thaliana. The scale of the study is impressive and clearly bolsters their findings. Additionally, this provides a framework for future, large-scale studies and offers a solid foundation for hypothesis generation. The k-mer-based method is generally practical for large-scale analysis and should be transferable to other datasets. Finally, the authors are commendably upfront about many of the project's limitations.

Weaknesses:

The decision to use k=12 is loosely justified. While the authors performed a sweep of k-mer lengths (from 5-20) and noted computational constraints, the choice is highly dataset-specific. Benchmarking across different k values with additional datasets (especially including other species) would strengthen confidence in the robustness of the method.

All analyses rely exclusively on the TAIR10 reference genome, which is incomplete and known to collapse certain repetitive regions. This dependence raises concerns that some repeats (especially recently expanded or highly variable ones) are systematically undercounted. With improved A. thaliana assemblies now available, testing the method against a more complete reference would alleviate these concerns.

The manuscript's conclusions are framed in very broad terms (e.g., "shaping genome evolution in plants"). However, the study is restricted to a single species, A. thaliana, which may not represent other plants. While the findings may suggest general principles, the claims in the abstract and conclusion should be moderated to reflect the study system more accurately.

The identification of >50 trans-acting loci enriched for DNA repair and replication genes is compelling, but the conclusions remain correlational.

Read the original source
Version published to 10.1101/2025.06.11.659220 on bioRxiv
Jun 16, 2025

Towards a quantitative view of the NLR gene family 4evolution in the genome space

This article has 10 authors:
1. Luzie Wingen
2. Duncan Crosbie
3. Yiheng Hu
4. Eric Kemen
5. Xinyi Liu
6. Marion Müller
7. Niklas Schandry
8. Korbinian Schneeberger
9. Detlef Weigel
10. Aurélien Tellier
This article has no evaluationsLatest version Dec 24, 2025
The heterogeneous selection landscape of genome evolution in prokaryotes

This article has 5 authors:
1. Eugene Koonin
2. Sofiya Garushyants
3. Svetlana Karamycheva
4. Nash Rochman
5. Yuri Wolf
This article has no evaluationsLatest version Dec 12, 2025
Salmonids reveal principles of regulatory evolution following autotetraploidization

This article has 31 authors:
1. Daniel Macqueen
2. Marie-Odile Baudement
3. Diego Perojil Morata
4. Gareth Gillard
5. Pooran Dewari
6. Manu Gundappa
7. Tomasz Podgorniak
8. Lars Grønvold
9. Damir Baranasic
10. Audrey Laurent
11. François Giudicelli
12. Bojan Zunar
13. Erika Carrera-García
14. Aline Perquis
15. Aurélien Brionne
16. Tan Nguyen
17. Rose Ruiz Daniels
18. Gabriela Merino
19. David Thybert
20. Garth Ilsley
21. Alexandra Louis
22. Torgeir Hvidsten
23. Camille Berthelot
24. Peter Harrison
25. Hugues Roest Crollius
26. Yann Guiguen
27. Boris Lenhard
28. Simen Sandve
29. Julien Bobe
30. Matthew Kent
31. Sigbjørn Lien
This article has no evaluationsLatest version Jan 27, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Towards a quantitative view of the NLR gene family 4evolution in the genome space

The heterogeneous selection landscape of genome evolution in prokaryotes

Salmonids reveal principles of regulatory evolution following autotetraploidization