Korea4K: whole genome sequences of 4,157 Koreans with 107 phenotypes derived from extensive health check-ups

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

Phenome-wide association studies (PheWASs) have been conducted on Asian populations, including Koreans, but many were based on chip or exome genotyping data. Such studies have limitations regarding whole genome–wide association analysis, making it crucial to have genome-to-phenome association information with the largest possible whole genome and matched phenome data to conduct further population-genome studies and develop health care services based on population genomics.

Results

Here, we present 4,157 whole genome sequences (Korea4K) coupled with 107 health check-up parameters as the largest genomic resource of the Korean Genome Project. It encompasses most of the variants with allele frequency >0.001 in Koreans, indicating that it sufficiently covered most of the common and rare genetic variants with commonly measured phenotypes for Koreans. Korea4K provides 45,537,252 variants, and half of them were not present in Korea1K (1,094 samples). We also identified 1,356 new genotype–phenotype associations that were not found by the Korea1K dataset. Phenomics analyses further revealed 24 significant genetic correlations, 14 pleiotropic associations, and 127 causal relationships based on Mendelian randomization among 37 traits. In addition, the Korea4K imputation reference panel, the largest Korean variants reference to date, showed a superior imputation performance to Korea1K across all allele frequency categories.

Conclusions

Collectively, Korea4K provides not only the largest Korean genome data but also corresponding health check-up parameters and novel genome–phenome associations. The large-scale pathological whole genome–wide omics data will become a powerful set for genome–phenome level association studies to discover causal markers for the prediction and diagnosis of health conditions in future studies.

Article activity feed

  1. AbstractWe present 4,157 whole-genome sequences (Korea4K) coupled with 107 health check-up parameters as the largest whole genomic resource of Koreans. Korea4K provides 45,537,252 variants and encompasses most of the common and rare variants in Koreans. We identified 1,356 new geno-phenotype associations which were not found by the previous Korea1K dataset. Phenomics analyses revealed 24 genetic correlations, 1,131 pleiotropic variants, and 127 causal relationships from Mendelian randomization. Moreover, the Korea4K imputation reference panel showed a superior imputation performance to Korea1K. Collectively, Korea4K provides the most extensive genomic and phenomic data resources for discovering clinically relevant novel genome-phenome associations in Koreans.Competing Interest StatementS.J., Y. J., H. R., Y.J.K., C.K, Yeonkyung K., Younghui K., Y. J. W., and B. C. K. are employees and Jong B. is the CEO of Clinomics Inc. The authors declare no other competing interests.

    Reviewer 2: Taras K Oleksyk, Ph.D.

    Comments to Author: The authors contribute 4,157 whole-genome sequences (Korea4K) coupled with 107 health check-up parameters as the largest genomic resource of the Korean Genome Project. It has likely characterized most of the common and very common genetic variants with commonly measured phenotypes for Koreans. It also discusses its applicability not only for the Korean population but also for other East Asian populations, and possibly to other national genome projects as well. This work makes a significant contribution of data that can be used in future genome-wide association studies in the context of the Korean population. The manuscript appears to cover a lot of ground: from methodological issues to the real-world applications of the dataset in healthcare. The authors adopt innovative methods like GREML, which have been reported to have higher accuracy compared to older methods. The authors are transparent about the limitations of their study, such as sample size and lack of sufficient data for rare diseases. They also acknowledge that phenomics analyses were not powerful enough for novel discoveries, indicating areas for future research. However, given the increasing importance of genomic data in healthcare and personalized medicine, the paper appears to be highly relevant. While the paper is well formulated, there are some issues that need to be addressed before is accepted for publication. See below:

    1. You referred to the UK Biobank data for some of your analyses. Were there any limitations or caveats in comparing your dataset to the UK Biobank? What about other national genomic projects that are out there? How transferable do you think the Korea4K dataset would be to studies focusing on other populations outside East Asia?
    2. Could you expand on any ethical considerations that were taken into account, especially in terms of data privacy and informed consent?
    3. How was the data cleaned and preprocessed, and were there any missing data points? If so, how were these handled? What number of reads(before and after QC), and other quality metrics do the sequenced reads have? What was the average coverage across the genome? What was the read length?
    4. How did you ensure the quality of the genomic data collected from different sources such as Korea1K and public data archives? The paper mentions mitigating batch effects through allele balance and manual checks. Could you provide more details on the methodology behind these checks and their efficiency?
    5. Could you provide more information about the control group? Was it matched for age, sex, or other variables? How was the sample size determined, and does it provide enough statistical power to support your conclusions?
    6. You mentioned that the statistical power of your study will increase with more participants. Would this have implications for other national genomes that are making similar projects? Please elaborate on how your sensitivity analysis could apply to other populations outside Korea.
    7. The paper acknowledges the sample size as not sufficiently large for detecting weak associations, and admits that the sample size was not large enough to detect weak association signals. Have you considered statistical methods that can boost power in small samples?
    8. Could you provide more details on the 107 clinical parameters used for the Korea4K phenome dataset? Were these parameters standardized across the different clinics and hospitals?
    9. What criteria were used for initial sample filtering, particularly for excluding kinship? Could you clarify the steps taken to identify and filter the 64,301,272 SNVs and 8,776,608 Indels? How did you correct for batch effects arising from different Illumina NGS platforms and library preparations? Did you use specialized SNV calling software, or only GATK?
    10. How were allele frequencies calculated and what considerations were made to interpret their biological significance? You mention that more than half of the singleton and doubleton variants were newly discovered. Could you elaborate on the methodology used to confirm these as novel variants?
    11. The section on phenotypic correlations mentions 2,274 trait-trait relationships. How would you address the potential for population stratification affecting the results of your genetic and phenotypic correlations? How did you account for multiple comparisons in determining significant genetic correlations, and what corrections were applied to maintain the FDR? What measures were taken to ensure that the traits considered in this section were not subject to confounding and/or collider biases.
    12. In your findings, Waist-Creatine showed opposite directions for genetic and phenotypic correlations. Could you elaborate on the potential implications or causes of this discrepancy?
    13. Were there any other surprising or unexpected correlations, and what are their potential implications?
    14. You mentioned that phenomics analyses were not powerful enough for novel discoveries. Could you elaborate more on what would be needed to make them more effective?
    15. For the future implications, in terms of healthcare and personalized medicine, what do you see as the most immediate applications of the Korea4K dataset?

    Re-review: Thank you for providing an extensive answers to my questions. I am happy to recommend your paper to publication in its revised form.

  2. AbstractWe present 4,157 whole-genome sequences (Korea4K) coupled with 107 health check-up parameters as the largest whole genomic resource of Koreans. Korea4K provides 45,537,252 variants and encompasses most of the common and rare variants in Koreans. We identified 1,356 new geno-phenotype associations which were not found by the previous Korea1K dataset. Phenomics analyses revealed 24 genetic correlations, 1,131 pleiotropic variants, and 127 causal relationships from Mendelian randomization. Moreover, the Korea4K imputation reference panel showed a superior imputation performance to Korea1K. Collectively, Korea4K provides the most extensive genomic and phenomic data resources for discovering clinically relevant novel genome-phenome associations in Koreans.

    This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae014), and has published the reviews under the same license. These are as follows.

    Reviewer 1: Pui-Yan Kwok

    Comments to Author: This manuscript describes the second phase of the Korean Genome Project (KGP) with 4,157 sets of whole-genome data (designated Korea4K). After error correction and sequencing data curation, the whole-genome sequencing (WGS) data from 3,614 unrelated were used in the analyses. They also analyzed 107 types of clinical traits from 2,685 healthy participants' health check-up reports over a 4-year period (2016-2019). They performed a range of analyses and claimed that this new data performed better than Korea1K, the first phase KGP dataset, in a number of ways. A larger Korean dataset adds to the global genome resource and provides further insights into the Korean population. However, the results are mostly descriptive and serve as a catalog without significant new insights. The results are as expected (Korea4K is a better imputation reference panel than Korea1K, new variants are identified in the population, new variants are found in association with various phenotypes, etc.) and this dataset is sufficiently large to capture all the common variants found in the homogeneous Korean population. The authors should address several issues:

    1. The use of whole genome sequencing data in GWAS. The Bonferroni correction the authors used in their analysis was that for SNP array studies. They must do a formal correction with the many more variants found in WGS data and use a statistically sound correction for their analysis. The severe penalty for multiple testing using WGS data for GWAS is why few such studies have been done. I suspect that many of the associations will not reach statistical significance after proper correction, as the dataset is quite small for most traits under study.
    2. The authors should use the new genome references for their variant calling (T2T reference and the Human Pangenome Reference), as the GRCh38 is no longer the gold standard and the results will be quite different with the most up-to-date references. Using the best human genome reference will make Korea4K more valuable.
    3. The authors should clarify how many of the participants who contributed clinical data are unrelated.

    Re-review:

    The authors made attempts to address the issues raised previously but did not do so adequately.

    1. Using the same GWAS cutoff of P <5E-8 and adding the FDR correction (Benjamini-Hochberg) does not solve the problem of multiple testing using whole genome sequencing data (where there are orders of magnitude more variants than those on typical SNP arrays) for the study. With clinical data available for only 2,262 samples, each phenotype under study will have a very small number of individuals, making the result of 2,314 variants from 30 clinical traits with significant association highly suspect. The authors should consult statisticians with experience using whole genome sequencing data for association studies to come up with a better statistical study design.
    2. The authors acknowledge that using the newer reference will be a good approach but will not do so because the "T2T reference lacks enough annotation data" is not an adequate response. The point is to have the best variant calls for the Korea4K data, annotation is irrelevant until variants with significant association are identified. Claiming that they will do so in future versions of the project diminishes the significance of the current manuscript.