Imputation of structural variants using a multi-ancestry long-read sequencing panel enables identification of disease associations

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife Assessment

    This fundamental work significantly enhances our understanding of how structural variants influence human phenotypes. The conclusion is convincingly supported by rigorous analyses of long-read sequencing data. If the raw data are made publicly available, these high-quality datasets and findings will further advance our knowledge of genetic variation in the human population.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Advancements in long-read sequencing technology have accelerated the study of large structural variants (SVs). We created a curated, publicly available, multi-ancestry SV imputation panel by long-read sequencing 888 samples from the 1000 Genomes Project. This high-quality panel was used to impute SVs in approximately 500,000 UK Biobank participants. We demonstrated the feasibility of conducting genome-wide SV association studies at biobank scale using 32 disease-relevant phenotypes related to respiratory, cardiometabolic and liver diseases, in addition to 1,463 protein levels. This analysis identified thousands of genome-wide significant SV associations, including hundreds of conditionally independent signals, thereby enabling novel biological insights. Focusing on genetic association studies of lung function as an example, we demonstrate the added value of SVs for prioritising causal genes at gene-rich loci compared to traditional GWAS using only short variants. We envision that future post-GWAS gene-prioritisation workflows will incorporate SV analyses using this SV imputation panel and framework.

Article activity feed

  1. eLife Assessment

    This fundamental work significantly enhances our understanding of how structural variants influence human phenotypes. The conclusion is convincingly supported by rigorous analyses of long-read sequencing data. If the raw data are made publicly available, these high-quality datasets and findings will further advance our knowledge of genetic variation in the human population.

  2. Reviewer #1 (Public review):

    Summary:

    The authors sequenced 888 individuals from the 1000 Genomes Project using the Oxford Nanopore long-read sequencing method to achieve highly sensitive, genome-wide detection of structural variants (SVs) at the population level. They conducted solid benchmarking of SV calling and systematically characterized the identified SVs. While short-read sequencing methods, including those used in the 1000 Genomes Project, have been widely applied, they exhibit high accuracy in detecting single nucleotide variants (SNVs) and small insertions and deletions but have limited sensitivity for SV detection. This study significantly enhances SV detection capabilities, establishing it as a valuable resource for human genetic research. Furthermore, the authors constructed an SV imputation panel using the generated data and imputed SVs in 488,130 individuals from the UK Biobank. They then conducted a proof-of-principle genome-wide association study (GWAS) analysis based on the imputed SVs and selected traits within the UK Biobank. Their findings demonstrate that incorporating SV-GWAS analysis provides additional insights beyond conventional GWAS frameworks focusing on SNVs, particularly in improving fine mapping.

    Strengths:

    The authors constructed a high-sensitivity reference panel of genome-wide SVs at the population level, addressing a critical gap in the field of human genetics. This resource is expected to significantly advance research in human genetics. They demonstrated the imputation of SVs in individuals from the UK Biobank using this panel and conducted a proof-of-concept SV-based GWAS. Their findings highlight a novel and effective strategy for integrating SVs into GWAS, which will facilitate the analysis of human genetic data from the UK Biobank and other datasets. Their conclusions are supported by comprehensive analyses.

    Weaknesses:

    (1) Although the authors employ state-of-the-art analytical approaches for the identification of SVs, the overall accuracy remains suboptimal, as indicated by an F1 score of 74.0%, particularly in tandem repeat regions. To enhance accuracy, it would be beneficial to explore alternative SV detection methods or develop novel approaches. Given the value of the reference panel and the fact that improved SV accuracy would lead to more precise SV imputation and GWAS results, investing effort in methodological refinement is highly encouraged.

    (2) From the Methods section, it appears that the authors employed Beagle for both the "leave-one-out" imputation and the UK Biobank imputation. It would be better to explicitly clarify this in the Results section and provide a detailed description of the corresponding procedures and parameters in the Methods section for both analyses, as this represents a key aspect of the study. Additionally, Beagle is not specifically designed for SV imputation, the imputation quality of SVs is generally lower than that of SNVs. Exploring strategies to improve SV imputation, such as developing a novel method with reference panel data, may enhance performance. It is also important to assess how this reduced imputation quality may influence GWAS results. For instance, it would be useful to examine whether associated SVs exhibit higher imputation quality and whether SVs with lower quality are less likely to achieve significant association signals. In addition, the lower imputation quality observed for INV, DUP, and BND variants (Figure 3) may be due to their greater lengths (Figure 2). It is better to investigate the relationship between SV length and imputation quality.

    (3) All examples presented in the manuscript focus on SVs that overlap with genes. It may also be valuable to investigate SVs that do not overlap with genes but intersect with enhancer regions. SVs can contribute to disease by altering regulatory elements, such as enhancers, which play a crucial role in gene expression. Including such analyses would further demonstrate the utility of SV-GWAS and provide deeper insights into the functional impact of SVs.

    (4) The data availability link currently provides only a VCF file ("sniffles2_joint_sv_calls.vcf.gz") containing the identified SVs. It would be beneficial for the authors to make all raw sequencing data (FASTQ files) and key processed datasets (such as alignment results and merged SV and SNV files) available. Providing these resources would enable other researchers to develop improved SV detection and imputation methods or conduct further genetic analyses. Furthermore, establishing a dedicated website for data access, along with a genome browser for SV visualization, could significantly enhance the impact and accessibility of the study. Additionally, all code, particularly the SV imputation pipeline accompanied by a detailed tutorial, should be deposited in a public repository such as GitHub. This would support researchers in imputing SVs and conducting SV-GWAS on their own datasets.

  3. Reviewer #2 (Public review):

    Summary:

    The authors aimed to develop a novel and efficient method for SV detection, utilizing data from the 1000 Genomes Project (1KGP) for modeling and calibration. This method was subsequently validated using UK population data and applied to identify structural variants associated with specific disease phenotypes.

    Strengths:

    Third-generation single-molecule sequencing data offers several advantages over traditional high-throughput sequencing methods, particularly due to its long-read lengths, which provide valuable insights into significant forms of genomic variation. The authors have developed an efficient method for detecting structural variations and optimizing the utilization of genomic data. We hope that this method will continue to be refined, enabling researchers to more effectively leverage long-read data, high-throughput data, or even a synergistic combination of both.

    Weaknesses:

    Although this research contributes to our ability to more effectively utilize long-length and high-throughput data, there are some key issues that need to be addressed in terms of analyzing the specific results as well as writing the article.

  4. Reviewer #3 (Public review):

    Summary:

    This study successfully identified genetic loci associated with various traits by generating large-scale long-read sequencing data from a diverse set of samples. This study is significant because it not only produces large-scale long-read genome sequencing data but also demonstrates its application in actual genetics research. Given its potential utility in various fields, this study is expected to make a valuable contribution to the academic community and to this journal. However, there are several critical aspects that could be improved. Below are specific comments for consideration.

    Strengths:

    Producing high-quality, large-scale variant datasets and imputation datasets

    Weaknesses:

    (1) Data availability

    Currently, it appears that only the Genomic Lens SV Panel is available on the webpage described in the Data Availability section. It is unclear whether the authors intend to release the raw sequencing data. Since the study utilized samples from the 1000 Genomes Project, there should be no restriction on making the data publicly accessible. Given this, would the authors consider making the raw sequencing reads publicly available? If so, NCBI SRA or EBI ENA would be the most appropriate repositories for data deposition. I strongly encourage the authors to consider public data release.

    Additionally, accessing the Genomic Lens SV Panel data does not seem straightforward. The manuscript should provide a more detailed description of how researchers can access and utilize these data. In my opinion, the best approach would be to upload the variant data (VCF files) to a public database such as the European Variation Archive (EVA) hosted by EBI.

    I strongly request that the authors publicly deposit the variant data. At a minimum:

    a) The joint genotype data for all 888 samples from the 1000 Genomes Project must be publicly available.
    b) For the UK Biobank samples, at least allele frequency data should be disclosed.

    Since eLife has a well-established data-sharing policy, compliance with these guidelines is essential for publication in this journal.

    (2) Long-read sequencing data quality

    While the manuscript presents N50 read length and mean or median read base quality for each sample in a table, it would be highly beneficial to visualize these data in figures as well. A violin plot or similar visualization summarizing these distributions would significantly improve data presentation.

    Notably, the base quality of ONT long-read sequencing data appears lower than expected. This may be attributed to the use of pore version 9.4.1, but the unexpectedly low base quality still warrants attention. It would be helpful to include a small figure within Figure 2 to illustrate this point. A visual representation of read length distribution and base quality distribution would strengthen the manuscript.

    (3) Variant detection precision, recall, and F1 score

    This study focuses on insertions and deletions (indels) {greater than or equal to}50 bp, but it remains unclear how well variants <50 bp are detected. I am particularly interested in the precision, recall, and F1 score for variants between 5-49 bp.

    While ONT base quality is relatively low, single-base variants are challenging to analyze, but variants {greater than or equal to}5 bp should still be detectable as their read accuracy is still approximately 90%, making analysis feasible. Given that Sniffles supports the detection of variants as small as 1 bp, I strongly encourage the authors to conduct an additional analysis.

    A simple two-category classification (e.g., 5-49 bp and {greater than or equal to}50 bp) should suffice. Additionally, a comparative analysis with HiFi and short-read sequencing data would be highly valuable. If possible, I strongly recommend that all detected variants {greater than or equal to}5 bp be made publicly available as VCF files.

    (4) Assembly-based methods

    Given the low read accuracy and low sequencing depth in this dataset, it is understandable that genome assembly is challenging. However, the latest high-quality human genome datasets-such as those produced by the Human Pangenome Reference Consortium (HPRC)-demonstrate that assembly-based approaches provide significant advantages, particularly for resolving complex and long structural variants.

    Since HPRC data also utilize 1000 Genomes Project samples, it would be highly informative to compare the accuracy of ONT sequencing in this study with HPRC's assembly-based genome data. The recent publication on 47 HPRC samples provides a valuable reference for such a comparison. Given its relevance, the authors should consider providing a comparative analysis with HPRC data.

    References:

    (1) A draft human pangenome reference
    https://www.nature.com/articles/s41586-023-05896-x

    (2) The Human Pangenome Project: a global resource to map genomic diversity
    https://www.nature.com/articles/s41586-022-04601-8

    (3) A pangenome reference of 36 Chinese populations
    https://www.nature.com/articles/s41586-023-06173-7

    (4) Long-read sequencing of 3,622 Icelanders provides insight into the role of structural variants in human diseases and other traits
    https://www.nature.com/articles/s41588-021-00865-4

    (5) Increased mutation and gene conversion within human segmental duplications
    https://www.nature.com/articles/s41586-023-05895-y

    (6) Structural polymorphism and diversity of human segmental duplications
    https://www.nature.com/articles/s41588-024-02051-8

    (7) Highly accurate Korean draft genomes reveal structural variation highlighting human telomere evolution
    https://academic.oup.com/nar/article/53/1/gkae1294/7945385