Local adaptation and archaic introgression shape global diversity at human structural variant loci
Curation statements for this article:-
Curated by eLife
Evaluation Summary:
The technical challenges of identifying and quantifying the frequency of structural variants (SV) on a population scale has been a major limitation to the study of recent human adaptation. This manuscript applies a recent graph-based genotyping method that leverages a library of SVs identified by long-read sequencing to identify SVs in large short-read based cohorts. This is a sensible and powerful approach that highlights several examples of likely adaptive SV evolution in different human populations. The key findings and examples are well supported by the data and methods used. However, the manuscript would benefit from further comparisons and context from previous studies, and deeper exploration of the biological significance. In addition to providing novel examples of adaptive SV evolution, this analysis may serve as a template for future analyses that merge long-read and short-read datasets.
(This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (eLife)
Abstract
Large genomic insertions and deletions are a potent source of functional variation, but are challenging to resolve with short-read sequencing, limiting knowledge of the role of such structural variants (SVs) in human evolution. Here, we used a graph-based method to genotype long-read-discovered SVs in short-read data from diverse human genomes. We then applied an admixture-aware method to identify 220 SVs exhibiting extreme patterns of frequency differentiation – a signature of local adaptation. The top two variants traced to the immunoglobulin heavy chain locus, tagging a haplotype that swept to near fixation in certain southeast Asian populations, but is rare in other global populations. Further investigation revealed evidence that the haplotype traces to gene flow from Neanderthals, corroborating the role of immune-related genes as prominent targets of adaptive introgression. Our study demonstrates how recent technical advances can help resolve signatures of key evolutionary events that remained obscured within technically challenging regions of the genome.
Article activity feed
-
-
Author Response:
Reviewer #1:
Yan et al. take a comprehensive look at structural variants in the 1000 Genomes Project high-coverage dataset, using recent developments that can link short- and long-read data. Combined with genomic simulations, they identify and characterize the timing and origin of a likely selected region in Southeast Asian populations. The combination of multiple data types adds depth to the interpretation.
The study is timely, combing recently released data and methods, and had interesting biological implications. Tree main areas would help interpretation and robustness of the paper:
Thank you for sharing your enthusiasm for our work!
- Further context and interpretation of the original SV set found is needed, for example comparisons to previous work to identify clearer "positive controls" or sanity checks on the …
Author Response:
Reviewer #1:
Yan et al. take a comprehensive look at structural variants in the 1000 Genomes Project high-coverage dataset, using recent developments that can link short- and long-read data. Combined with genomic simulations, they identify and characterize the timing and origin of a likely selected region in Southeast Asian populations. The combination of multiple data types adds depth to the interpretation.
The study is timely, combing recently released data and methods, and had interesting biological implications. Tree main areas would help interpretation and robustness of the paper:
Thank you for sharing your enthusiasm for our work!
- Further context and interpretation of the original SV set found is needed, for example comparisons to previous work to identify clearer "positive controls" or sanity checks on the method, and to understand what the contribution of the method/dataset/paper is.
Thank you for this suggestion, which was shared with those of other reviewers. We agree that the previous version of the manuscript placed too much responsibility on readers to track down the relevant content in the references and that a more direct and transparent comparison is warranted. Our set of SVs was carefully curated based on PacBio long- read sequencing data from 15 diverse samples by Audano et al. (2019). We now provide a detailed comparison of these curated SVs to two sets of SVs discovered from short-read sequencing of diverse human samples (Almarri et al., 2020; Sudmant et al., 2015) (lines 80-97). We find that this long-read-discovered SV set includes 89,979 variants (83.4% of long-read SVs) that are not represented in the 1000 Genomes Project (1KGP) or the Human Genome Diversity Project (HGDP). These long-read-specific variants include 30,229 that are “common” (AF ≥ 0.05), or 72.3% of all common SVs. We were also able to rediscover a large proportion of the short-read-discovered SVs in these two datasets, including 66.0% and 17.7% of common SVs in 1KGP and HGDP, respectively (Fig. 1 - S2 and Fig. 1 - S3). These results are consistent with reports from previous studies (Zhao et al., 2021).
The overlap we describe above is notable given that the much smaller size of the long- read sample set (15 individuals vs. 2,504 for 1KGP and 911 for HGDP), and that the sample sets do not overlap completely (i.e., we expect that many rare or singleton SVs should not be represented in both datasets). We expect that the SVs unique to the short- read datasets reflect both differences in the discovery sample set (i.e., many of the long- read sequenced individuals are also in 1KGP, while none are in HGDP) and a high rate of false positives in short-read-based SV discovery (Nattestad et al., 2018).
Furthermore, we have released all of our code, along with the SV genotypes (among the long-read sequenced samples [i.e. the input set], as well as based on graph genotyping of the 1000 Genomes cohort). This will enable future work based on these SV genotype calls, while also ensuring reproducibility and facilitating improvements to the genotyping methods. Indeed, we are aware that the data that we released are already being used in several other studies and that the genotyping strategy that we outlined has motivated additional studies being proposed in grant applications by other groups.
- The above is particularly important across ancestries/populations which differ in their LD levels. How does population-specific LD patterns impact the ability to detect these SV patterns? and therefore to make cross-population comparisons or infer differences in frequency that are central to the selection scan and the 220 highly differentiated SVs of interest. Perhaps this is in the original methods paper, but is central to this paper so should at least be explained or analyzed.
The graph genotyping approach does not leverage LD per se, though it is feasible that multiple linked variants could be spanned by a single long read. Instead, the Paragraph genotyping algorithm relies on an on-the-fly realignment of the primary short read sequencing data to a graph encoding the reference genome as well as the variant sequence. There are, however, some interesting implications of the differences in LD across populations for the use of SV genotypes. We quantified the population differences in LD between SVs and nearby SNPs on lines 174-184 and in Fig. 1 - S7. One implication of this result, mirroring the situation for other classes of variation, is that the accuracy of imputation of SVs based on knowledge of SNPs will be lowest in African populations. Conversely, these low rates of LD may improve fine mapping in the same populations, allowing future studies to test whether SVs are enriched for causal effects on expression and other phenotypes. While detailed investigation of imputation and fine-mapping are outside of the scope of our current study, we now discuss these implications in the section that describes patterns of LD (lines 181-184).
- The genomic simulations to infer the strength selection was a nice addition, a step beyond common empirically-driven work. It would help to know how to interpret the ABC model in the context of the later finding that the region was introgressed from Neanderthals--the model seems to not include this aspect.
Thank you for appreciating the value of this section. We believe that introgression of the adaptive IGH haplotype from Neanderthals should not impact our ABC results within the time scale of our simulation. This is because our simulation begins after the introgression event has already occurred and the Neanderthal haplotype is segregating within the human population. A recent study showed that situations like these, in which introgressed variants persist at low frequencies and later undergo selection, may have occurred frequently in human evolutionary history (Yair et al., 2021).
However, we agree that the impact of the introgression event on our simulation requires clarification. We now discuss this point in the simulation section (lines 489-491), and also cite the paper above. We have additionally moved this section to the end of the paper to better emphasize that it focuses on the history of the Neanderthal haplotype in humans, rather than the introgression event itself.
-
Reviewer #3 (Public Review):
This paper demonstrates the additional utility that can be extracted from short-read genome resources such as the genomes from the 1000 Genomes Project by leveraging variant discovery in long-read platforms. These genotyped variants can be used for eQTL studies, or to identify potential signatures of selection. Thus, low-coverage population-scale sequencing datasets such as the 1000 Genomes data can still be of use when coupled with other datasets.
One of the challenges I have with this manuscript however is clearly understanding the novel aspects of the reported results in the context of previous work in this field. Initially, it is unclear how many of the genotyped variants are already in the 1000 Gnomes dataset, this should be clearly reported. Comparisons of LD to nearby SNPs does not take into account …
Reviewer #3 (Public Review):
This paper demonstrates the additional utility that can be extracted from short-read genome resources such as the genomes from the 1000 Genomes Project by leveraging variant discovery in long-read platforms. These genotyped variants can be used for eQTL studies, or to identify potential signatures of selection. Thus, low-coverage population-scale sequencing datasets such as the 1000 Genomes data can still be of use when coupled with other datasets.
One of the challenges I have with this manuscript however is clearly understanding the novel aspects of the reported results in the context of previous work in this field. Initially, it is unclear how many of the genotyped variants are already in the 1000 Gnomes dataset, this should be clearly reported. Comparisons of LD to nearby SNPs does not take into account that the SV discovery in the 1000-genomes project was done separately from the SNP calling. Thus, while it is suggested as presented that most of these variants were previously intractable, this is insufficiently explored. Additionally, discussion of low LD with SVs is well documented in 1KG and elsewhere. Subsequently, the eQTL analyses are "broadly consistent" with previously reported eQTL analyses from both the 1000 genomes project and GTEx, but no direct comparison is performed. If the overall goal is to point out that using additional datasets can identify new variants that can be genotyped, it is important to perform comparisons to other population-scale datasets such as HGDP and SGDP (Almarri et al Cell, Hseih et al Science, etc). In these cases, higher coverage sequencing allowed discovery of variants which could then be genotyped, similar to this paper's assertion that long-read sequencing provided a new discovery set for subsequent genotyping. Indeed, the two highly stratified variants selected for follow up are reported in gnomAD. The paper mostly focusses on the identification of highly stratified loci. Again, comparison to previously reported highly stratified loci (1KG, Sudmant et al 2015, and Almarri 2020, Hseih et al) is necessary here.
Furthermore, while the analyses of the IGH hapotype are clearly presented and interesting, as noted in the manuscript, these have already been identified. The authors mention that this locus was already identified but suggest it was "not further examined," due to "stringent filtering" however this locus was reported as one of 11 "high frequency introgressed regions" thus this description seems to mischaracterize Browning et al's recognition of the importance of this locus. The strongest part of the manuscript is the ABC modelling of the IGH haplotype elucidating the putatively extremely strong selective signatures at this locus. More focus on these results and the importance of following up and fully understanding such loci would benefit the manuscript. Broadly, this paper is well written and clearly presented however would be very much strengthened by placing it more broadly in the context of previous work and focusing more on the novel modelling analyses of specific loci that are performed.
-
Reviewer #2 (Public Review):
The technical challenges of identifying and quantifying the frequency of structural variants (SV) on a population scale has been a major limitation to the study of recent human adaptation. This manuscript applies a recent graph-based genotyping method that leverages a library of SVs identified by long-read sequencing to identify SVs in large short-read based cohorts. This is a sensible and powerful approach that highlights several examples of likely adaptive SV evolution in different human populations. The key findings and examples are well supported by the data and methods used. However, the manuscript would benefit from: 1) testing more hypotheses rather than listing examples and 2) more framing of how the results and methods expand on several recent studies of SVs across populations. In addition to …
Reviewer #2 (Public Review):
The technical challenges of identifying and quantifying the frequency of structural variants (SV) on a population scale has been a major limitation to the study of recent human adaptation. This manuscript applies a recent graph-based genotyping method that leverages a library of SVs identified by long-read sequencing to identify SVs in large short-read based cohorts. This is a sensible and powerful approach that highlights several examples of likely adaptive SV evolution in different human populations. The key findings and examples are well supported by the data and methods used. However, the manuscript would benefit from: 1) testing more hypotheses rather than listing examples and 2) more framing of how the results and methods expand on several recent studies of SVs across populations. In addition to providing novel examples of adaptive SV evolution, I anticipate this analysis can serve as a template for future analyses that merge long-read and short-read datasets.
-
Reviewer #1 (Public Review):
Yan et al. take a comprehensive look at structural variants in the 1000 Genomes Project high-coverage dataset, using recent developments that can link short- and long-read data. Combined with genomic simulations, they identify and characterize the timing and origin of a likely selected region in Southeast Asian populations. The combination of multiple data types adds depth to the interpretation.
The study is timely, combing recently released data and methods, and had interesting biological implications. Tree main areas would help interpretation and robustness of the paper:
Further context and interpretation of the original SV set found is needed, for example comparisons to previous work to identify clearer "positive controls" or sanity checks on the method, and to understand what the contribution of the …
Reviewer #1 (Public Review):
Yan et al. take a comprehensive look at structural variants in the 1000 Genomes Project high-coverage dataset, using recent developments that can link short- and long-read data. Combined with genomic simulations, they identify and characterize the timing and origin of a likely selected region in Southeast Asian populations. The combination of multiple data types adds depth to the interpretation.
The study is timely, combing recently released data and methods, and had interesting biological implications. Tree main areas would help interpretation and robustness of the paper:
Further context and interpretation of the original SV set found is needed, for example comparisons to previous work to identify clearer "positive controls" or sanity checks on the method, and to understand what the contribution of the method/dataset/paper is.
The above is particularly important across ancestries/populations which differ in their LD levels. How does population-specific LD patterns impact the ability to detect these SV patterns? and therefore to make cross-population comparisons or infer differences in frequency that are central to the selection scan and the 220 highly differentiated SVs of interest. Perhaps this is in the original methods paper, but is central to this paper so should at least be explained or analyzed.
The genomic simulations to infer the strength selection was a nice addition, a step beyond common empirically-driven work. It would help to know how to interpret the ABC model in the context of the later finding that the region was introgressed from Neanderthals--the model seems to not include this aspect.
-
Evaluation Summary:
The technical challenges of identifying and quantifying the frequency of structural variants (SV) on a population scale has been a major limitation to the study of recent human adaptation. This manuscript applies a recent graph-based genotyping method that leverages a library of SVs identified by long-read sequencing to identify SVs in large short-read based cohorts. This is a sensible and powerful approach that highlights several examples of likely adaptive SV evolution in different human populations. The key findings and examples are well supported by the data and methods used. However, the manuscript would benefit from further comparisons and context from previous studies, and deeper exploration of the biological significance. In addition to providing novel examples of adaptive SV evolution, this analysis may …
Evaluation Summary:
The technical challenges of identifying and quantifying the frequency of structural variants (SV) on a population scale has been a major limitation to the study of recent human adaptation. This manuscript applies a recent graph-based genotyping method that leverages a library of SVs identified by long-read sequencing to identify SVs in large short-read based cohorts. This is a sensible and powerful approach that highlights several examples of likely adaptive SV evolution in different human populations. The key findings and examples are well supported by the data and methods used. However, the manuscript would benefit from further comparisons and context from previous studies, and deeper exploration of the biological significance. In addition to providing novel examples of adaptive SV evolution, this analysis may serve as a template for future analyses that merge long-read and short-read datasets.
(This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)
-
-