Coalescence and Translation: A Language Model for Population Genetics
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Arcadia Science)
Abstract
Probabilistic models such as the sequentially Markovian coalescent (SMC) have long provided a powerful framework for population genetic inference, enabling reconstruction of demographic history and ancestral relationships from genomic data. However, these methods are inherently specialized, relying on predefined assumptions and/or limited scalability. Recent advances in simulation and deep learning provide an alternative approach: learning directly to generalize from synthetic genetic data to infer specific hidden evolutionary processes. Here we reframe the inference of coalescence times as a problem of translation between two biological languages: the sparse, observable patterns of mutation along the genome and the unobservable ancestral recombination graph (ARG) that gave rise to them. Inspired by large language models, we develop cxt, a decoder-only transformer that autoregressively predicts coalescent events conditioned on local mutational context. We show that cxt performs on par with state-of-the-art MCMC-based likelihood models across a broad range of demographic scenarios, including both in-distribution and out-of-distribution settings. Trained on simulations spanning the stdpopsim catalog, the model generalizes robustly and enables efficient inference at scale, producing over a million coalescence predictions in minutes. In addition cxt produces a well calibrated approximate posterior distribution of its predictions, enabling principled uncertainty quantification. Our work moves towards a foundation model for population genetics, bridging deep learning and coalescent theory to enable flexible, scalable inference of genealogical history from genomic data.
Article activity feed
-
robustness
One more dimension of robustness that I think could be useful to explore is sources of error in data (e.g. genotype/phase etc.). Actually adding some small amount of noise to genotypes during training could make cxt quite robust to real dataset errors giving it a further performance edge over alternatives like Singer.
-
the model autoregressively learns the conditional distribution:
In principle do you reckon it would be possible to use a random masking approach (like in models like ESM2) for this problem? Currently one (entire) side of a region informs the models prediction on the focal window, while in principle the most informative regions are both to the immediate left/right of the focal window. Random masking as a strategy could allow the model to leverage this information bidirectionally, but technically could be more challenging.
-
Our work moves towards a foundation model for population genetics, bridging deep learning and coalescent theory to enable flexible, scalable inference of genealogical history from genomic data.
We greatly enjoyed reading this preprint for our internal journal club. This seems like a very principled and useful application of the transformer architecture in biology.
-
Figure 5:
What do the different colors correspond to here (the figure is missing a legend)? You state in the text that the black line is the expectation, only describe one more - the inference limit, though it's unclear what color this is. I suspect this is the single line in red, whereas the estimated approximate posterior distribution from cxt is in blue?
-