Coalescence and Translation: A Language Model for Population Genetics

Kevin Korfmann
Nathaniel Pope
Melinda Meleghy
Aurelien Tellier
Andrew D. Kern

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Probabilistic models such as the sequentially Markovian coalescent (SMC) have long provided a powerful framework for population genetic inference, enabling reconstruction of demographic history and ancestral relationships from genomic data. However, these methods are inherently specialized, relying on predefined assumptions and/or limited scalability. Recent advances in simulation and deep learning provide an alternative approach: learning directly to generalize from synthetic genetic data to infer specific hidden evolutionary processes. Here we reframe the inference of coalescence times as a problem of translation between two biological languages: the sparse, observable patterns of mutation along the genome and the unobservable ancestral recombination graph (ARG) that gave rise to them. Inspired by large language models, we develop cxt, a decoder-only transformer that autoregressively predicts coalescent events conditioned on local mutational context. We show that cxt performs on par with state-of-the-art MCMC-based likelihood models across a broad range of demographic scenarios, including both in-distribution and out-of-distribution settings. Trained on simulations spanning the stdpopsim catalog, the model generalizes robustly and enables efficient inference at scale, producing over a million coalescence predictions in minutes. In addition cxt produces a well calibrated approximate posterior distribution of its predictions, enabling principled uncertainty quantification. Our work moves towards a foundation model for population genetics, bridging deep learning and coalescent theory to enable flexible, scalable inference of genealogical history from genomic data.

Version published to 10.1101/2025.06.24.661337v1 on bioRxiv
Jun 27, 2025

From Likelihood to Fitness: Improving Variant Effect Prediction in Protein and Genome Language Models

This article has 4 authors:
1. Charles W. J. Pugh
2. Paulina G. Nuñez-Valencia
3. Mafalda Dias
4. Jonathan Frazer
Reviewed by Arcadia Science

This article has 4 evaluationsAppears in 1 listLatest version May 24, 2025Latest activity Jun 6, 2025
Coalescent theory of the ψ directionality index

This article has 2 authors:
1. Egor Lappo
2. Noah A. Rosenberg
This article has no evaluationsLatest version Jun 24, 2025
Massively scalable inference of level-1 phylogenetic networks

This article has 3 authors:
1. Nathan Kolbow
2. Sungsik Kong
3. Claudia Solis-Lemus
This article has no evaluationsLatest version May 23, 2025

Listed in

Abstract

Article activity feed

Related articles

From Likelihood to Fitness: Improving Variant Effect Prediction in Protein and Genome Language Models

Coalescent theory of the ψ directionality index

Massively scalable inference of level-1 phylogenetic networks