Predicting functional constraints across evolutionary timescales with phylogeny-informed genomic language models

Chengzhong Ye
Gonzalo Benegas
Carlos Albors
Jianan Canal Li
Sebastian Prillo
Peter D. Fields
Brian Clarke
Yun S. Song

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

Genomic language models (gLMs) have emerged as a powerful approach for learning genome-wide functional constraints directly from DNA sequences. However, standard gLMs adapted from natural language processing often require extremely large model sizes and computational resources, yet still fall short of classical evolutionary models in predictive tasks. Here, we introduce GPN-Star (Genomic Pretrained Network with Species Tree and Alignment Representation), a biologically grounded gLM featuring a phylogeny-aware architecture that leverages whole-genome alignments and species trees to model evolutionary relationships explicitly. Trained on alignments spanning vertebrate, mammalian, and primate evolutionary timescales, GPN-Star achieves state-of-the-art performance across a wide range of variant effect prediction tasks in both coding and non-coding regions of the human genome. Analyses across timescales reveal task-dependent advantages of modeling more recent versus deeper evolution. To demonstrate its potential to advance human genetics, we show that GPN-Star substantially outperforms prior methods in prioritizing pathogenic and fine-mapped GWAS variants; yields unprecedented enrichments of complex trait heritability; and improves power in rare variant association testing. Extending beyond humans, we train GPN-Star for five model organisms – Mus musculus, Gallus gallus, Drosophila melanogaster, Caenorhabditis elegans , and Arabidopsis thaliana – demonstrating the robustness and generalizability of the framework. Taken together, these results position GPN-Star as a scalable, powerful, and flexible new tool for genome interpretation, well suited to leverage the growing abundance of comparative genomics data.

Arcadia Science
Sep 25, 2025

A diagram of the GPN-Star model architecture. The input to the model is a whole-genome alignment window. The target sequences and source sequences are constructed from the alignment window. The source sequences are compressed into clade-level embeddings via attention pooling following the species tree. The target sequences are encoded through a stack of GPN-Star encoder blocks, where the phylogeny-informed cross-attention module integrates information from the source sequences guided by evolutionary distances between species based on the species tree.

I think this is a fantastic contribution, and it makes absolute sense that accounting for evolutionary relationships using the attention mechanism would afford such a performance boost!

I do wonder though, taking this idea one step forward, how much might be lost by using a fixed species …

A diagram of the GPN-Star model architecture. The input to the model is a whole-genome alignment window. The target sequences and source sequences are constructed from the alignment window. The source sequences are compressed into clade-level embeddings via attention pooling following the species tree. The target sequences are encoded through a stack of GPN-Star encoder blocks, where the phylogeny-informed cross-attention module integrates information from the source sequences guided by evolutionary distances between species based on the species tree.

I think this is a fantastic contribution, and it makes absolute sense that accounting for evolutionary relationships using the attention mechanism would afford such a performance boost!

I do wonder though, taking this idea one step forward, how much might be lost by using a fixed species tree topology to bias clade-level attention, as opposed to using per-window trees to bias per-sequence attention. Obviously the latter would come with significant additional computational cost, but I can't help but suspect that meaningful signal is being lost by not accounting for genomic heterogeneity of evolutionary relationships.

Read the original source
Version published to 10.1101/2025.09.21.677619 on bioRxiv
Sep 21, 2025

Protein Language Models Capture Structural and Functional Epistasis in a Zero-Shot Setting

This article has 5 authors:
1. Ananthan Nambiar
2. Sayantani B. Littlefield
3. Carlos Cuellar
4. Rohit Khorana
5. Sergei Maslov
This article has no evaluationsLatest version Sep 17, 2025
NucleicBERT: Deciphering the language of nucleic acids by a large-language model

This article has 4 authors:
1. Utkarsh Upadhyay
2. Julian Herold
3. Markus Götz
4. Alexander Schug
This article has no evaluationsLatest version Sep 6, 2025
Pretrained protein language models choose between sequence novelty and structural completeness

This article has 3 authors:
1. Arjuna M. Subramanian
2. Zachary A. Martinez
3. Matt Thomson
This article has no evaluationsLatest version Oct 3, 2025

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Protein Language Models Capture Structural and Functional Epistasis in a Zero-Shot Setting

NucleicBERT: Deciphering the language of nucleic acids by a large-language model

Pretrained protein language models choose between sequence novelty and structural completeness