The genetic control of rapid genome content divergence in Arabidopsis thaliana
Curation statements for this article:- 
  Curated by eLifeeLife Assessment This important study systematically investigates repeat expansion in the plant Arabidopsis thaliana using a new k-mer-based method, expanding on smaller studies to more comprehensively identify cis- and trans-acting loci associated with repeat dynamics. The approach is methodologically sound and broadly applicable to large-scale short-read datasets for assessing copy number variation and genomic repeat content. While convincing in its scope and novelty, the findings would be further strengthened with exploratory analyses of datasets from other species with more or fewer repeats in their genomes. 
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (eLife)
Abstract
Genome evolution in eukaryotes is predominantly driven by the dynamics of repetitive sequences, which vary widely in both copy number and sequence composition. The rate of repeat evolution changes between species and within a species and is likely modulated by both genetics and environment. To uncover the factors shaping the rate of genome content evolution, we analyzed 1,142 resequenced Arabidopsis thaliana genomes using a novel K-mer based approach. With this dataset, we characterized genome content variation and identified hypervariable regions that contribute to major differences in repeat abundance. We then treated repeat abundance as a quantitative trait and performed genome-wide association studies to map the genetic basis of copy number variation across more than 400 repeat families. We jointly analyzed these results using a meta-GWAS approach, revealing both cis-acting variants and over 50 trans-acting loci that regulate repeat abundance genome-wide. Finally, we found that purifying selection acts against mutations that increase the rate of genome content divergence, favoring alleles that limit repeat expansion. Together, our results provide new insights into the genetic architecture and evolutionary forces shaping genome evolution in plants.
Article activity feed
- 
    
- 
    
- 
      eLife Assessment This important study systematically investigates repeat expansion in the plant Arabidopsis thaliana using a new k-mer-based method, expanding on smaller studies to more comprehensively identify cis- and trans-acting loci associated with repeat dynamics. The approach is methodologically sound and broadly applicable to large-scale short-read datasets for assessing copy number variation and genomic repeat content. While convincing in its scope and novelty, the findings would be further strengthened with exploratory analyses of datasets from other species with more or fewer repeats in their genomes. 
- 
      Reviewer #1 (Public review): Summary: Overall, this study is an excellent and systematic investigation of the expansion of repeat sequences in Arabidopsis thaliana, and the genetic mechanisms underlying these expansions. Many of the key findings here confirm smaller studies of both repeat sequence variation and the individual genes associated with the expansion of various repeat classes. The authors present a highly effective and practical approach that requires datasets that are far more readily available than the multiple reference genomes used to annotate repeat variation in recent works. Therefore, they provide an approach that shows significant promise in non-model systems in which far less is known of repeat variation and its underlying drivers. Strengths: This is a very methodologically sound study that extends the relatively … Reviewer #1 (Public review): Summary: Overall, this study is an excellent and systematic investigation of the expansion of repeat sequences in Arabidopsis thaliana, and the genetic mechanisms underlying these expansions. Many of the key findings here confirm smaller studies of both repeat sequence variation and the individual genes associated with the expansion of various repeat classes. The authors present a highly effective and practical approach that requires datasets that are far more readily available than the multiple reference genomes used to annotate repeat variation in recent works. Therefore, they provide an approach that shows significant promise in non-model systems in which far less is known of repeat variation and its underlying drivers. Strengths: This is a very methodologically sound study that extends the relatively well-studied Arabidopsis thaliana repeat landscape with more systematic sampling, highlights the loci associated with repeat expansions (many of which were previously identified in a piecemeal manner), and provides some evolutionary inference on these. Weaknesses: Regarding cis-QTLs: I foresee at least two causes of these associations: non-repetitive cis-acting sequences that promote or permit the expansion of local repeats, and variation in repeat sequences themselves that directly tag the expanding sequence itself. It's arguable whether these are truly two distinct classes, but an attempt to discriminate between them may provide some insight as to the local factors that allow for repeat expansion, beyond the mere presence of a repeat sequence. One way to discriminate these could be to map the ~1300 12-mer frequency profiles on the reference genome, and filter any SNPs with elevated 12-mer frequency from the GWAS (or to categorize them independently). I also have a question regarding the choice of k=12 in kmer profile analyses. Did the authors perform any GWAS with other values of K? If so, how did the results change? I would expect that as K is increased, the associations would become more specific to individual repeat families, possibly to the point where only cis-acting loci are detected. The authors show convincing evidence that k=12 is appropriate; however, I would be interested to see if/how GWAS results vary among e.g. k=10, 12, 15, 18. 
- 
      Reviewer #2 (Public review): Summary: The authors introduce a K-mer-based method for profiling repeat content within a species, applied here to 1,142 A. thaliana genomes sequenced with short reads. This approach allowed them to bypass the challenges of genome assembly, particularly for repetitive regions, while still quantifying copy number variation. Their analysis identified >50 trans-acting loci regulating repeat abundance, enriched for genes involved in DNA repair, replication, and methylation. They also speculate on the role of selection in shaping genome repeat content, arguing that purifying selection tends to suppress alleles that promote repeat expansion. The work presents a scalable way to extract meaningful insights from the large quantities of short-read datasets available. However, I have several concerns regarding the … Reviewer #2 (Public review): Summary: The authors introduce a K-mer-based method for profiling repeat content within a species, applied here to 1,142 A. thaliana genomes sequenced with short reads. This approach allowed them to bypass the challenges of genome assembly, particularly for repetitive regions, while still quantifying copy number variation. Their analysis identified >50 trans-acting loci regulating repeat abundance, enriched for genes involved in DNA repair, replication, and methylation. They also speculate on the role of selection in shaping genome repeat content, arguing that purifying selection tends to suppress alleles that promote repeat expansion. The work presents a scalable way to extract meaningful insights from the large quantities of short-read datasets available. However, I have several concerns regarding the methodology, scope of claims, and interpretation of results. Strengths: The authors leverage a large dataset, >1100 samples, of A. thaliana. The scale of the study is impressive and clearly bolsters their findings. Additionally, this provides a framework for future, large-scale studies and offers a solid foundation for hypothesis generation. The k-mer-based method is generally practical for large-scale analysis and should be transferable to other datasets. Finally, the authors are commendably upfront about many of the project's limitations. Weaknesses: The decision to use k=12 is loosely justified. While the authors performed a sweep of k-mer lengths (from 5-20) and noted computational constraints, the choice is highly dataset-specific. Benchmarking across different k values with additional datasets (especially including other species) would strengthen confidence in the robustness of the method. All analyses rely exclusively on the TAIR10 reference genome, which is incomplete and known to collapse certain repetitive regions. This dependence raises concerns that some repeats (especially recently expanded or highly variable ones) are systematically undercounted. With improved A. thaliana assemblies now available, testing the method against a more complete reference would alleviate these concerns. The manuscript's conclusions are framed in very broad terms (e.g., "shaping genome evolution in plants"). However, the study is restricted to a single species, A. thaliana, which may not represent other plants. While the findings may suggest general principles, the claims in the abstract and conclusion should be moderated to reflect the study system more accurately. The identification of >50 trans-acting loci enriched for DNA repair and replication genes is compelling, but the conclusions remain correlational. 
- 
  