Predicting functional constraints across evolutionary timescales with phylogeny-informed genomic language models
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Arcadia Science)
Abstract
Genomic language models (gLMs) have emerged as a powerful approach for learning genome-wide functional constraints directly from DNA sequences. However, standard gLMs adapted from natural language processing often require extremely large model sizes and computational resources, yet still fall short of classical evolutionary models in predictive tasks. Here, we introduce GPN-Star (Genomic Pretrained Network with Species Tree and Alignment Representation), a biologically grounded gLM featuring a phylogeny-aware architecture that leverages whole-genome alignments and species trees to model evolutionary relationships explicitly. Trained on alignments spanning vertebrate, mammalian, and primate evolutionary timescales, GPN-Star achieves state-of-the-art performance across a wide range of variant effect prediction tasks in both coding and non-coding regions of the human genome. Analyses across timescales reveal task-dependent advantages of modeling more recent versus deeper evolution. To demonstrate its potential to advance human genetics, we show that GPN-Star substantially outperforms prior methods in prioritizing pathogenic and fine-mapped GWAS variants; yields unprecedented enrichments of complex trait heritability; and improves power in rare variant association testing. Extending beyond humans, we train GPN-Star for five model organisms – Mus musculus, Gallus gallus, Drosophila melanogaster, Caenorhabditis elegans , and Arabidopsis thaliana – demonstrating the robustness and generalizability of the framework. Taken together, these results position GPN-Star as a scalable, powerful, and flexible new tool for genome interpretation, well suited to leverage the growing abundance of comparative genomics data.
Article activity feed
-
A diagram of the GPN-Star model architecture. The input to the model is a whole-genome alignment window. The target sequences and source sequences are constructed from the alignment window. The source sequences are compressed into clade-level embeddings via attention pooling following the species tree. The target sequences are encoded through a stack of GPN-Star encoder blocks, where the phylogeny-informed cross-attention module integrates information from the source sequences guided by evolutionary distances between species based on the species tree.
I think this is a fantastic contribution, and it makes absolute sense that accounting for evolutionary relationships using the attention mechanism would afford such a performance boost!
I do wonder though, taking this idea one step forward, how much might be lost by using a fixed species …
A diagram of the GPN-Star model architecture. The input to the model is a whole-genome alignment window. The target sequences and source sequences are constructed from the alignment window. The source sequences are compressed into clade-level embeddings via attention pooling following the species tree. The target sequences are encoded through a stack of GPN-Star encoder blocks, where the phylogeny-informed cross-attention module integrates information from the source sequences guided by evolutionary distances between species based on the species tree.
I think this is a fantastic contribution, and it makes absolute sense that accounting for evolutionary relationships using the attention mechanism would afford such a performance boost!
I do wonder though, taking this idea one step forward, how much might be lost by using a fixed species tree topology to bias clade-level attention, as opposed to using per-window trees to bias per-sequence attention. Obviously the latter would come with significant additional computational cost, but I can't help but suspect that meaningful signal is being lost by not accounting for genomic heterogeneity of evolutionary relationships.
-