Broad-scale variation in human genetic diversity levels is predicted by purifying selection on coding and non-coding elements

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This paper uses state-of-the-art methods and the latest data to answer the question of whether variation in polymorphism levels along the human genome is mostly driven by linked purifying selection or selective sweeps. It makes a very strong case for the former. The paper is exceptionally well written, and should be of interest to anyone wishing to understand patterns of polymorphism.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 agreed to share their name with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Analyses of genetic variation in many taxa have established that neutral genetic diversity is shaped by natural selection at linked sites. Whether the mode of selection is primarily the fixation of strongly beneficial alleles (selective sweeps) or purifying selection on deleterious mutations (background selection) remains unknown, however. We address this question in humans by fitting a model of the joint effects of selective sweeps and background selection to autosomal polymorphism data from the 1000 Genomes Project. After controlling for variation in mutation rates along the genome, a model of background selection alone explains ~60% of the variance in diversity levels at the megabase scale. Adding the effects of selective sweeps driven by adaptive substitutions to the model does not improve the fit, and when both modes of selection are considered jointly, selective sweeps are estimated to have had little or no effect on linked neutral diversity. The regions under purifying selection are best predicted by phylogenetic conservation, with ~80% of the deleterious mutations affecting neutral diversity occurring in non-exonic regions. Thus, background selection is the dominant mode of linked selection in humans, with marked effects on diversity levels throughout autosomes.

Article activity feed

  1. Author Response:

    Reviewer #3 (Public Review):

    Murphy et al. further develop the linked selection model of Elyashiv et al. (2016) and apply it to human genetic variation data. This model is itself an extension of the McVicker et al. (2009) paper, which developed a statistical inference method around classic background selection (BGS) theory (Hudson and Kaplan, 1995, Nordborg et al., 1996). These methods fit a composite likelihood model to diversity data along the chromosome, where the level of diversity is reduced by a local factor from some initial "neutral" level π0 down to observed levels. The level of reduction is determined by a combination of both BGS and the expected reduction around substitutions due to a sweep (though the authors state that these models are robust to partial and soft sweeps). The expected reduction factor is a function of local recombination rates and genomic annotation (such as exonic and phylogenetically conserved sequences), as well as the selection parameters (i.e. mutation rates and selection coefficients for different annotation classes). Overall, this work is a nice addition to an important line of work using models of linked selection to differentiate selection processes. The authors find that positive selection around substitutions explains little of the variation in diversity levels across the genome, whereas a background selection model can explain up to 80% of the variance in diversity. Additionally, their model seems to have solved a mystery of the McVicker et al. (2009) paper: why the estimated deleterious mutation rate was unreasonably high. Throughout the paper, the authors are careful not only in their methodology but also in their interpretation of the results. For example, when interpreting the good fit of the BGS model, the authors correctly point out that stabilizing selection on a polygenic trait can also lead to BGS-like reductions.

    Furthermore, the authors have carefully chosen their model's exogenous parameters to avoid circularity. The concern here is that if the input data into the model - in particular the recombination maps and segments liked to be conserved - are estimated or identified using signals in genetic variation, the model's good fit to diversity may be spurious. For example, often recombination maps are estimated from linkage disequilibrium (LD) data which is itself obtained from variation along the chromosome. Murphy et al. use a recombination map based on ancestry switches in African Americans which should prevent "information leakage" between the recombination map and the BGS model from leading to spuriously good fits. Likewise, the authors use phylogenetic conservation maps rather than those estimated from diversity reductions (such as McVicker et al.'s B maps) to avoid circularity between the conserved annotation track and diversity levels being modeled. Additionally, the authors have carefully assessed and modified the original McVicker et al. algorithm, reducing relative error (Figure A2).

    One could raise the concern that non-equilibrium demography confounds their results, but the authors have a very nice analysis in Section 7 of the supplementary material showing that their estimates are remarkably stable when the model is fit separately in different human populations (Figure A35). Supporting previous work that emphasizes the dependence between BGS and demography, the authors find evidence of such an interaction with a clever decomposition of variance approach (Figure A37). The consistency of BGS estimates across populations (e.g. Figures A35 and A36) is an additional strong bit of evidence that BGS is indeed shaping patterns of diversity; readers would benefit if some of these results were discussed in the main text.

    We appreciate the reviewer’s kind remarks. With regards to the results included in the main text vs the supplement, we attempted to strike a balance between having the main text remain communicative to a larger readership and providing experts with details they may find useful. We have, however, done our best for the supplementary analyses to be written clearly.

    I have three major concerns about this work. First, it's unclear how accurate the selection coefficient estimates are given the non-equilibrium demography of humans (pre-Out of Africa split, and thus not addressed by the separate population analyses). The authors do not make a big point about the selection coefficient estimates in the main section of the paper, so I don't find this to be a big problem. Still, some mention of this issue might be helpful to readers trying to interpret the results presented in the supplementary text.

    As the reviewer notes, we chose not to emphasize the inferred distributions of selection coefficients. Our main reason for this choice is the technical issue addressed in Appendix Section 1.5 (L561-564): “Second, thresholding potentially biases our estimates of the distribution of selection effects. While this bias is probably smaller than the bias without thresholding, its form and magnitude are not obvious. This is why we decided not to report the inferred distributions of selection effects in the Main Text.” We agree that if we were to focus on our estimates of the distribution of selection effects, the effects of demographic history would also need to be considered. This is, however, not the focus here.

    Second, I'm curious whether the composite likelihood BGS model could overfit any variance along the chromosome - even neutral variance. At some level, the composite likelihood approach may behave like a sort of smoothing algorithm, albeit with a functional form and parameters of a BGS model. The fact that there is information sharing across different regions with the same annotation class should in principle prevent overfitting to local noise. Still, there are two ways I think to address this overfitting concern. First, a negative neutral control could help - how much variation in diversity along the chromosome can this model explain in a purely neutral simulation? I imagine very little, likely less than 5%, but I think this paper would be much stronger with the addition of a negative control like this. Second, I think the main text should include the R2 values from out-sample predictions, rather than just the R2 estimates from the model fit on the entire data. For example, one could fit the model on 20 chromosomes, use the estimated θΒ parameters to predict variation on the remaining two. The authors do a sort of leave-one-out validation at the window level (Figure A31); however, this may not be robust to linkage disequilibrium between adjacent windows in the way leaving out an entire chromosome would be.

    The two requested analyses were done and their results are described above, in response to essential revisions (p. 2-3 here). In brief, there is no overfitting of neutral patterns or otherwise. We elaborate on why this finding is expected below.

    Finally, I feel like this paper would be stronger with realistic forward simulations. The deterministic simulations described in the supplementary materials show the implementation of the model is correct, but it's an exact simulation under the model - and thus not testing the accuracy of the model itself against realistic forward simulations. However, this is a sizable task and efforts to add selection to projects like Standard PopSim are ongoing.

    We agree that forward simulations would be a nice addition, but believe that it is a project in itself. Indeed, a major complication is that when, for computational tractability, purifying selection is simulated in small populations with realistic population-scaled parameters, the reduction in diversity due to selection at unlinked sites has a major effect on neutral diversity levels (see, e.g., Robertson 1961). We hope to address this issue in future work. Meanwhile, we note that the theory that we rely on has been tested against simulations in the past (e.g., Charlesworth et al., 1993; Hudson and Kaplan, 1995; Nordborg et al., 1996).

  2. Evaluation Summary:

    This paper uses state-of-the-art methods and the latest data to answer the question of whether variation in polymorphism levels along the human genome is mostly driven by linked purifying selection or selective sweeps. It makes a very strong case for the former. The paper is exceptionally well written, and should be of interest to anyone wishing to understand patterns of polymorphism.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 agreed to share their name with the authors.)

  3. Reviewer #1 (Public Review):

    Since Kimura, it is has been clear that most of the DNA sequence polymorphism in the population must, in general, be selectively neutral (or effectively so). At the same time, it has been clear that selection will affect the pattern of polymorphism through linkage. It will leave footprints. In particular, levels of polymorphism can be reduced by the constant elimination of deleterious mutations by purifying selection - and also by repeated adaptive substitutions ("selective sweeps"). Both could explain the observed pattern of less polymorphism in gene-dense regions. But because they are very different mechanisms, it is of interest to figure out which one is most important. Flipping the argument around could help us understand how prevalent and strong each is.

    This paper is a rigorous, state-of-the-art attempt to do this in the human genome. Using the latest data and computational methods, the paper convincingly argues that purifying selection must be the dominant force - indeed, it provides a surprisingly good predictive, mechanistic model, which is rare in population genetics. New insights include the major contribution of purifying selection on non-coding regions, which is evident from sequence conservation. The paper, which is exceptionally well written, should be of interest to anyone interested in molecular population genetics.

  4. Reviewer #2 (Public Review):

    Murphy et al. extend previous models of linked selection to evaluate diversity in the human genome. They find that patterns of diversity along the genome can be largely explained by models of background selection without the need to include selective sweeps or even much functional information beyond the presence of evolutionary constraints.

    This paper is excellent. One of the papers I have most enjoyed reading in some time. Many of the concerns or questions I had while reading were immediately answered by the authors, who have addressed almost every concern I can think of as well as many many more. In multiple places, while reading the supplement I found myself thinking "wow that's a good point I never thought of that but I'm glad they did". The figures are clear and well explained. The supplement is a veritable trove of thoughtful background, careful consideration of caveats and concerns, and well-reasoned arguments for the choices the authors made.

    Given the above, it may not be surprising that my concerns with the paper are relatively few:

    The authors address polygenic adaptation, but I wouldn't mind additional discussion of how polygenic adaptation to a rapidly fluctuating optimum might or might not be captured by the background selection model.

    I would like to see a bit better introduction for readers not already steeped in linked selection. Inclusion of the basic equations from Hudson and Kaplan or Nordborg might go a long way to helping an average reader (who may not have the stamina for 80 pages of supplement) to nonetheless understand the basic parameters and model that the authors build off of.

    Similarly, the last bit before the conclusions could be expanded a bit. How do the authors propose empiricists could use their results most effectively for e.g. demographic estimation? What are other areas/uses/implications of the results for other evolutionary genetic work?

  5. Reviewer #3 (Public Review):

    Murphy et al. further develop the linked selection model of Elyashiv et al. (2016) and apply it to human genetic variation data. This model is itself an extension of the McVicker et al. (2009) paper, which developed a statistical inference method around classic background selection (BGS) theory (Hudson and Kaplan, 1995, Nordborg et al., 1996). These methods fit a composite likelihood model to diversity data along the chromosome, where the level of diversity is reduced by a local factor from some initial "neutral" level π0 down to observed levels. The level of reduction is determined by a combination of both BGS and the expected reduction around substitutions due to a sweep (though the authors state that these models are robust to partial and soft sweeps). The expected reduction factor is a function of local recombination rates and genomic annotation (such as exonic and phylogenetically conserved sequences), as well as the selection parameters (i.e. mutation rates and selection coefficients for different annotation classes).

    Overall, this work is a nice addition to an important line of work using models of linked selection to differentiate selection processes. The authors find that positive selection around substitutions explains little of the variation in diversity levels across the genome, whereas a background selection model can explain up to 80% of the variance in diversity. Additionally, their model seems to have solved a mystery of the McVicker et al. (2009) paper: why the estimated deleterious mutation rate was unreasonably high. Throughout the paper, the authors are careful not only in their methodology but also in their interpretation of the results. For example, when interpreting the good fit of the BGS model, the authors correctly point out that stabilizing selection on a polygenic trait can also lead to BGS-like reductions.

    Furthermore, the authors have carefully chosen their model's exogenous parameters to avoid circularity. The concern here is that if the input data into the model - in particular the recombination maps and segments liked to be conserved - are estimated or identified using signals in genetic variation, the model's good fit to diversity may be spurious. For example, often recombination maps are estimated from linkage disequilibrium (LD) data which is itself obtained from variation along the chromosome. Murphy et al. use a recombination map based on ancestry switches in African Americans which should prevent "information leakage" between the recombination map and the BGS model from leading to spuriously good fits. Likewise, the authors use phylogenetic conservation maps rather than those estimated from diversity reductions (such as McVicker et al.'s B maps) to avoid circularity between the conserved annotation track and diversity levels being modeled. Additionally, the authors have carefully assessed and modified the original McVicker et al. algorithm, reducing relative error (Figure A2).

    One could raise the concern that non-equilibrium demography confounds their results, but the authors have a very nice analysis in Section 7 of the supplementary material showing that their estimates are remarkably stable when the model is fit separately in different human populations (Figure A35). Supporting previous work that emphasizes the dependence between BGS and demography, the authors find evidence of such an interaction with a clever decomposition of variance approach (Figure A37). The consistency of BGS estimates across populations (e.g. Figures A35 and A36) is an additional strong bit of evidence that BGS is indeed shaping patterns of diversity; readers would benefit if some of these results were discussed in the main text.

    I have three major concerns about this work. First, it's unclear how accurate the selection coefficient estimates are given the non-equilibrium demography of humans (pre-Out of Africa split, and thus not addressed by the separate population analyses). The authors do not make a big point about the selection coefficient estimates in the main section of the paper, so I don't find this to be a big problem. Still, some mention of this issue might be helpful to readers trying to interpret the results presented in the supplementary text.

    Second, I'm curious whether the composite likelihood BGS model could overfit any variance along the chromosome - even neutral variance. At some level, the composite likelihood approach may behave like a sort of smoothing algorithm, albeit with a functional form and parameters of a BGS model. The fact that there is information sharing across different regions with the same annotation class should in principle prevent overfitting to local noise. Still, there are two ways I think to address this overfitting concern. First, a negative neutral control could help - how much variation in diversity along the chromosome can this model explain in a purely neutral simulation? I imagine very little, likely less than 5%, but I think this paper would be much stronger with the addition of a negative control like this. Second, I think the main text should include the R2 values from out-sample predictions, rather than just the R2 estimates from the model fit on the entire data. For example, one could fit the model on 20 chromosomes, use the estimated θΒ parameters to predict variation on the remaining two. The authors do a sort of leave-one-out validation at the window level (Figure A31); however, this may not be robust to linkage disequilibrium between adjacent windows in the way leaving out an entire chromosome would be.

    Finally, I feel like this paper would be stronger with realistic forward simulations. The deterministic simulations described in the supplementary materials show the implementation of the model is correct, but it's an exact simulation under the model - and thus not testing the accuracy of the model itself against realistic forward simulations. However, this is a sizable task and efforts to add selection to projects like Standard PopSim are ongoing.