Common, low-frequency, rare, and ultra-rare coding variants contribute to COVID-19 severity

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

The combined impact of common and rare exonic variants in COVID-19 host genetics is currently insufficiently understood. Here, common and rare variants from whole-exome sequencing data of about 4000 SARS-CoV-2-positive individuals were used to define an interpretable machine-learning model for predicting COVID-19 severity. First, variants were converted into separate sets of Boolean features, depending on the absence or the presence of variants in each gene. An ensemble of LASSO logistic regression models was used to identify the most informative Boolean features with respect to the genetic bases of severity. The Boolean features selected by these logistic models were combined into an Integrated PolyGenic Score that offers a synthetic and interpretable index for describing the contribution of host genetics in COVID-19 severity, as demonstrated through testing in several independent cohorts. Selected features belong to ultra-rare, rare, low-frequency, and common variants, including those in linkage disequilibrium with known GWAS loci. Noteworthily, around one quarter of the selected genes are sex-specific. Pathway analysis of the selected genes associated with COVID-19 severity reflected the multi-organ nature of the disease. The proposed model might provide useful information for developing diagnostics and therapeutics, while also being able to guide bedside disease management.

Article activity feed

  1. SciScore for 10.1101/2021.09.03.21262611: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Experimental Models: Organisms/Strains
    SentencesResources
    For instance, in the case of a gene belonging to an autosome with 2 common variants (named A and B), 3 combinations are possible (A, B, and AB), and (consequently) 3 Boolean features were defined both for the AD and AR model.
    AB
    suggested: RRID:BDSC_203)
    Software and Algorithms
    SentencesResources
    Library enrichment was tested by qPCR, and the size distribution and concentration were determined using Agilent Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA).
    Agilent Bioanalyzer
    suggested: None
    Variant calling was performed according to the GATK4[24] best practice guidelines, using BWA [25] for mapping and ANNOVAR [26] for annotating.
    BWA
    suggested: (BWA, RRID:SCR_010910)
    ANNOVAR
    suggested: (ANNOVAR, RRID:SCR_012821)
    Finally, annotation was performed using Variant Effect Predictor (VEP, version 101).
    Variant Effect Predictor
    suggested: None
    Variants were genotyped with the GATK GenotypeGVCFs tool v4.1.8.1.
    GATK
    suggested: (GATK, RRID:SCR_001876)
    Therefore, the genetic ancestry of the patients was estimated using a random forest classifier trained on samples from the 1000 genomes project and using as input features the first 20 principal components computed from the common variants by PLINK [27].
    PLINK
    suggested: (PLINK, RRID:SCR_001757)
    ABS(meanβ)∗count∗F Pathway enrichment analysis was made using the GSEA-preranked module (v. 7.2.4) of the Genepattern platform [26], on several pathway categories (BIOCARTA,
    Genepattern
    suggested: (GenePattern, RRID:SCR_003201)
    KEGG, REACTOME, GOBP, HALLMARKS, C7 and C8), limiting the size of genesets to the 10-300 range and performing 10,000 permutations.
    KEGG
    suggested: (KEGG, RRID:SCR_012773)
    The networks showing similarity of significant pathways were built using the EnrichmentMap algorithm [30] in the Cytoscape suite (v. 3.8.2) [31–32].
    EnrichmentMap
    suggested: (EnrichmentMap, RRID:SCR_016052)
    Cytoscape
    suggested: (Cytoscape, RRID:SCR_003032)

    Results from OddPub: Thank you for sharing your code.


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    Results from scite Reference Check: We found no unreliable references.


    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.