Modeling the spatiotemporal spread of beneficial alleles using ancient genomes

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This manuscript is of broad interest for evolutionary biologists who seek to understand the dynamics of strongly advantageous mutations across time and space. It presents an elegant framework for inferring the strength of natural selection and spread of adaptive variants that accounts for spatially and temporal patterns of genetic variation. The authors extend a previously developed statistical inference method, performs some tests of the performance of their method on simulated data and apply the method to two well-known targets of selection. The development of the method is timely given the growing availability of ancient DNA collections, which have the power to largely increase the accuracy of selection inferences and parameter estimates.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #2 agreed to share their name with the authors.)

This article has been Reviewed by the following groups

Read the full article

Abstract

Ancient genome sequencing technologies now provide the opportunity to study natural selection in unprecedented detail. Rather than making inferences from indirect footprints left by selection in present-day genomes, we can directly observe whether a given allele was present or absent in a particular region of the world at almost any period of human history within the last 10,000 years. Methods for studying selection using ancient genomes often rely on partitioning individuals into discrete time periods or regions of the world. However, a complete understanding of natural selection requires more nuanced statistical methods which can explicitly model allele frequency changes in a continuum across space and time. Here we introduce a method for inferring the spread of a beneficial allele across a landscape using two-dimensional partial differential equations. Unlike previous approaches, our framework can handle time-stamped ancient samples, as well as genotype likelihoods and pseudohaploid sequences from low-coverage genomes. We apply the method to a panel of published ancient West Eurasian genomes to produce dynamic maps showcasing the inferred spread of candidate beneficial alleles over time and space. We also provide estimates for the strength of selection and diffusion rate for each of these alleles. Finally, we highlight possible avenues of improvement for accurately tracing the spread of beneficial alleles in more complex scenarios.

Article activity feed

  1. Author Response

    Reviewer #2 (Public Review):

    Dr Muktupavela et al. present a novel likelihood-based method for inferring the strength of natural selection and basic demographic parameters, such as mobility rates, from time-stamped ancient DNA data in a spatially explicit framework. This is an elegant method that is, in many ways, a natural extension of previous work in the field that has focussed mainly on inferring natural selection from temporal data to a spatial setting. In addition to the simplest scenarios of isotropic dispersal the authors also consider models with different dispersal rate in longitudinal and latitudinal directions, as well as biased dispersal. Selection strength, dispersal rates and bias are assumed to be constant across space and piecewise constant in time (but it would be very straightforward to relax these assumptions). The bias component of the model is an interesting addition that, in principle, allows to broadly account for the effect of long-range dispersals such as the spread of agriculture across Europe from the fertile crescent and Bronze age migrations from the Asian steppes on the spatiotemporal pattern of allele frequencies.

    Although the main idea is clearly communicated, there is room for improvement of the manuscript regarding investigating the properties of the model and presenting the results. Notably, the authors assume that the age of mutation is known and correct in their assessment of the performance of the model on simulated data (which may inflate the reported accuracy of the reconstructions) and use estimates from the literature when the method is applied to empirical data. Although it is necessary to specify the age of the allele, and this could easily have been treated as a free parameter in the framework. I would like to see a discussion of why the method may not be suitable for this, and a more systematic test for the sensitivity of the method to misspecification of the age (which could be very substantial, especially if the population history has been complex). In the cases where the model is run for different allele age estimates in the manuscript, such as for the lactase persistence scenario, the authors should present the (approximate maximum) likelihoods for the different scenarios in the text.

    An explanation as to why we do not infer the age of the allele (see text below) has been added to the main text under section “Parameter search” (lines 531-533). Briefly, we chose to construct our method in a way that uses the age of the allele as an input parameter rather than estimating it since there are multiple equally possible solutions with various combinations of allele age and selection coefficient values. This is demonstrated Appendix A3.

    We also added a description of log-likelihood values when we vary the allele ages under section “Robustness of parameters to the assumed age of the allele” in lines 324-329, the results of which are presented in supplementary Figure 6–Figure Supplement 9 and Figure 8–Figure Supplement 6.

    Briefly, we assessed the likelihood of the best fitted models by varying the ages of the rs4988235(T) and rs1042602(A) alleles. We can see that in the case of rs4988235(T) allele the allele age used in this study (7,441 years) gives the most likely solution among the explored ages. In the case of the rs1042602(A) allele, we found that there are multiple nearly equally likely ages when looking at ages at least as old as 15,000 years.

    A further weakness of the method is that it uses the Fisher information matrix to estimate uncertainty. While this works well if the posterior distribution is narrow, it can severely underestimate the uncertainty if this is not case, in particular if the distribution is non-gaussian in the tails. It would be better, but perhaps computationally prohibitively expensive, to report Bayesian posterior distributions for the parameters as well as Bayes factors that could be used to formally compare the fit of different models to the data.

    We agree with the reviewer that implementing Bayesian parameter fitting would likely provide a more robust understanding of the uncertainty of the estimates as well as an opportunity to formally compare different models using Bayes factors (although at the cost of an increase of computational intensity). Changing the inference engine of our method in this manner (while keeping it computationally feasible) is something we are currently investigating and hope to release as part of a future Bayesian version of our method. In the meantime, we have added a discussion of this caveat in our manuscript (sixth paragraph).

    Finally, although the rationale behind the model is clearly described, the detailed descriptions of the model and the numerical implementation have some shortcomings. First, there are typos in the appendix where the continuous model is derived from a discrete approximation (the right-hand side of Eq. (8) should not contain the term p(x,y,t) for it to be consistent with Eqs. (9) and (10)). Second, any differential equation model is incomplete without specifying the boundary conditions. This is especially important here as the assumption of uniform diffusion and advection on the grid is violated by the constraints imposed by the land mask, where the population is assumed to vanish on water areas (suggesting an absorbing boundary condition). Further down in the methods, details are also missing on how Eq. (10) was solved numerically, merely that it was discretized at a certain resolution.

    Looking more closely at the Eq (8), we do believe that the term p(x,y,t) should be there since it is moved to the left-hand side of the Eq (9) by simple algebraic rearrangements of the terms of the equation.

  2. Evaluation Summary:

    This manuscript is of broad interest for evolutionary biologists who seek to understand the dynamics of strongly advantageous mutations across time and space. It presents an elegant framework for inferring the strength of natural selection and spread of adaptive variants that accounts for spatially and temporal patterns of genetic variation. The authors extend a previously developed statistical inference method, performs some tests of the performance of their method on simulated data and apply the method to two well-known targets of selection. The development of the method is timely given the growing availability of ancient DNA collections, which have the power to largely increase the accuracy of selection inferences and parameter estimates.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #2 agreed to share their name with the authors.)

  3. Reviewer #1 (Public Review):

    Muktupavela and colleagues extended a model based on two-dimensional partial differential equations to infer the allele frequency trajectories of recent beneficial mutations across space and time. Specifically, the approach fit the model to the allele frequencies computed from a series of ancient DNA samples associated to their corresponding radiocarbon dates to infer the dispersal-related parameters as well as the selection strength under which the allele has been evolving. The authors test the performance of their statistical approach using deterministic and spatially explicit simulations and conclude that the method provides accurate estimates of selection coefficients and relatively acceptable estimates of diffusion and advection parameters. When applied to cases of mutations for which there is currently strong evidence of recent positive selection (such as the case of the lactose-persistence allele and one of the alleles associated with skin pigmentation - LCT/MCM6 and TYR, respectively), the statistical method developed by Muktupavela et al appears to provide estimates for the selection coefficients of the two mutations generally in agreement with previous estimations.

    Strengths:

    1. Bringing in genotype information from ancient DNA collections to fit the model is definitely the major improvement of the method proposed by Muktupavela et al in comparison with its previous version. It is an invaluable source of information to built maps of allele frequency trajectories over the recent past history of the European populations. Such maps can thus be used together with archaeological and historical data to pinpoint the events underlying some of the inferred changes in allele frequencies.

    2. Another important aspect of their new statistical approach is the additional layer of realism resulting from including into the model the advection parameters, which account for the effect of population movements in changing the location of highest allele frequency, and allowing the parameters to be fit to different time periods. Such different time periods may represent, for instance, differences in population mobility (such as the ones the authors explain) or differences in the strength of selection.

    3. The approach appears to perform relatively well with respect to inferring selection coefficients and diffusion parameters for time periods for which there is considerable ancient DNA information.

    4. The method appears to be very promising as it can be further extended to accommodate novel features (as discussed by the authors) and thus increase its scope.

    Limitations:

    1. While the method might become a powerful tool to study the evolution of allele frequencies over time and space, its current version is tested under the best case scenario. The authors select alleles that are currently deemed to be among the strongest cases of positive selection in humans, which implicitly also means using ancient DNA collections from the most studied region - Western Eurasia. We now know that cases of positive selection such as the ones studied by the authors are rare (Hernandez RD, et al 2011 Science). Therefore, it remains unclear the extent to which the current version of the method can be applied to other alleles, to other regions and even to other species.

    2. Modelling the spread of beneficial alleles continuously across time and space is clearly an advantage in relation to other available methods. However, spatially explicit approaches are sensitive to sampling heterogeneity. While it is true that ancient DNA is accumulating at a very fast speed and temporal/spatial sampling gaps will be less of an issue, that not the case for a large part of the globe. Yet the authors do not assess the impact of such data limitations in the accuracy of the parameter estimates.

    3. Finally, although parameter estimates such as selection coefficients, diffusion parameters and the geographical origin of the allele are in general well estimated, it is important to keep in mind they might not be robust to the misspecification of the allele age or incorrect inference of the geographical origin of the allele.

  4. Reviewer #2 (Public Review):

    Dr Muktupavela et al. present a novel likelihood-based method for inferring the strength of natural selection and basic demographic parameters, such as mobility rates, from time-stamped ancient DNA data in a spatially explicit framework. This is an elegant method that is, in many ways, a natural extension of previous work in the field that has focussed mainly on inferring natural selection from temporal data to a spatial setting. In addition to the simplest scenarios of isotropic dispersal the authors also consider models with different dispersal rate in longitudinal and latitudinal directions, as well as biased dispersal. Selection strength, dispersal rates and bias are assumed to be constant across space and piecewise constant in time (but it would be very straightforward to relax these assumptions). The bias component of the model is an interesting addition that, in principle, allows to broadly account for the effect of long-range dispersals such as the spread of agriculture across Europe from the fertile crescent and Bronze age migrations from the Asian steppes on the spatiotemporal pattern of allele frequencies.

    Although the main idea is clearly communicated, there is room for improvement of the manuscript regarding investigating the properties of the model and presenting the results. Notably, the authors assume that the age of mutation is known and correct in their assessment of the performance of the model on simulated data (which may inflate the reported accuracy of the reconstructions) and use estimates from the literature when the method is applied to empirical data. Although it is necessary to specify the age of the allele, and this could easily have been treated as a free parameter in the framework. I would like to see a discussion of why the method may not be suitable for this, and a more systematic test for the sensitivity of the method to misspecification of the age (which could be very substantial, especially if the population history has been complex). In the cases where the model is run for different allele age estimates in the manuscript, such as for the lactase persistence scenario, the authors should present the (approximate maximum) likelihoods for the different scenarios in the text.

    A further weakness of the method is that it uses the Fisher information matrix to estimate uncertainty. While this works well if the posterior distribution is narrow, it can severely underestimate the uncertainty if this is not case, in particular if the distribution is non-gaussian in the tails. It would be better, but perhaps computationally prohibitively expensive, to report Bayesian posterior distributions for the parameters as well as Bayes factors that could be used to formally compare the fit of different models to the data.

    Finally, although the rationale behind the model is clearly described, the detailed descriptions of the model and the numerical implementation have some shortcomings. First, there are typos in the appendix where the continuous model is derived from a discrete approximation (the right-hand side of Eq. (8) should not contain the term p(x,y,t) for it to be consistent with Eqs. (9) and (10)). Second, any differential equation model is incomplete without specifying the boundary conditions. This is especially important here as the assumption of uniform diffusion and advection on the grid is violated by the constraints imposed by the land mask, where the population is assumed to vanish on water areas (suggesting an absorbing boundary condition). Further down in the methods, details are also missing on how Eq. (10) was solved numerically, merely that it was discretized at a certain resolution.

    In summary, this is an elegant framework that accounts for spatially and temporal patterns and is a welcome addition to the existing range of tools for evolutionary inference from ancient DNA data. Although the manuscript in its current incarnation has some shortcomings, I am sure these can be easily overcome, and I look forward to the next version of the manuscript.