Genomic Foundation Models Reveal Chromatin-Domain-Scale Transposable Element Impacts on Rice Genome Architecture
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Alignment-based detection of transposable element (TE) insertion polymorphisms suffers from reference bias and multi-mapping errors in repetitive genomic regions, creating a fundamental validation bottleneck for population-scale structural variant catalogs. Here, we demonstrate that the OneGenome-Rice (OGR) genomic foundation model (GFM)—a 1.25 billion parameter Mixtral architecture trained on 422 rice genomes without TE annotations—provides an entirely orthogonal, alignment-free approach that resolves TE-mediated structural divergence at chromatin-domain resolution. At the CTB4a cold-tolerance locus on chromosome 4, OGR embeddings revealed that the aus subpopulation (NONA_BOKRA) carries 2.2-fold higher structural divergence from indica than japonica, consistent with its 728 subpopulation-exclusive cold-protective TE insertions. Sliding-window analysis across 4.4 megabases identified a 25.6-fold divergence enhancement at TE clusters relative to the conserved CTB4a gene body. Critically, the minimal effective resolution was established at approximately 20 kilobases—corresponding to the median size of topologically associating domains (TADs) in the rice genome—while individual TE sites at 500 base pairs were undetectable (P = 0.94). Non-neural baselines confirmed the signal derives from learned representations of genomic context rather than simple nucleotide statistics. These findings establish GFMs as orthogonal validation tools for population-scale TE genotyping and provide computational evidence that TE functional effects are organized at the chromatin-domain level, with direct implications for prioritizing functional TE variants in crop breeding.