Common, low-frequency, rare, and ultra-rare coding variants contribute to COVID-19 severity
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (ScreenIT)
Abstract
The combined impact of common and rare exonic variants in COVID-19 host genetics is currently insufficiently understood. Here, common and rare variants from whole-exome sequencing data of about 4000 SARS-CoV-2-positive individuals were used to define an interpretable machine-learning model for predicting COVID-19 severity. First, variants were converted into separate sets of Boolean features, depending on the absence or the presence of variants in each gene. An ensemble of LASSO logistic regression models was used to identify the most informative Boolean features with respect to the genetic bases of severity. The Boolean features selected by these logistic models were combined into an Integrated PolyGenic Score that offers a synthetic and interpretable index for describing the contribution of host genetics in COVID-19 severity, as demonstrated through testing in several independent cohorts. Selected features belong to ultra-rare, rare, low-frequency, and common variants, including those in linkage disequilibrium with known GWAS loci. Noteworthily, around one quarter of the selected genes are sex-specific. Pathway analysis of the selected genes associated with COVID-19 severity reflected the multi-organ nature of the disease. The proposed model might provide useful information for developing diagnostics and therapeutics, while also being able to guide bedside disease management.
Article activity feed
-
-
SciScore for 10.1101/2021.09.03.21262611: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
Experimental Models: Organisms/Strains Sentences Resources For instance, in the case of a gene belonging to an autosome with 2 common variants (named A and B), 3 combinations are possible (A, B, and AB), and (consequently) 3 Boolean features were defined both for the AD and AR model. ABsuggested: RRID:BDSC_203)Software and Algorithms Sentences Resources Library enrichment was tested by qPCR, and the size distribution and concentration were determined using Agilent Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA). Agilent Bioanalyzersuggested: NoneVariant calling was performed according to the GATK4[24] best practice … SciScore for 10.1101/2021.09.03.21262611: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
Experimental Models: Organisms/Strains Sentences Resources For instance, in the case of a gene belonging to an autosome with 2 common variants (named A and B), 3 combinations are possible (A, B, and AB), and (consequently) 3 Boolean features were defined both for the AD and AR model. ABsuggested: RRID:BDSC_203)Software and Algorithms Sentences Resources Library enrichment was tested by qPCR, and the size distribution and concentration were determined using Agilent Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA). Agilent Bioanalyzersuggested: NoneVariant calling was performed according to the GATK4[24] best practice guidelines, using BWA [25] for mapping and ANNOVAR [26] for annotating. BWAsuggested: (BWA, RRID:SCR_010910)ANNOVARsuggested: (ANNOVAR, RRID:SCR_012821)Finally, annotation was performed using Variant Effect Predictor (VEP, version 101). Variant Effect Predictorsuggested: NoneVariants were genotyped with the GATK GenotypeGVCFs tool v4.1.8.1. GATKsuggested: (GATK, RRID:SCR_001876)Therefore, the genetic ancestry of the patients was estimated using a random forest classifier trained on samples from the 1000 genomes project and using as input features the first 20 principal components computed from the common variants by PLINK [27]. PLINKsuggested: (PLINK, RRID:SCR_001757)ABS(meanβ)∗count∗F Pathway enrichment analysis was made using the GSEA-preranked module (v. 7.2.4) of the Genepattern platform [26], on several pathway categories (BIOCARTA, Genepatternsuggested: (GenePattern, RRID:SCR_003201)KEGG, REACTOME, GOBP, HALLMARKS, C7 and C8), limiting the size of genesets to the 10-300 range and performing 10,000 permutations. KEGGsuggested: (KEGG, RRID:SCR_012773)The networks showing similarity of significant pathways were built using the EnrichmentMap algorithm [30] in the Cytoscape suite (v. 3.8.2) [31–32]. EnrichmentMapsuggested: (EnrichmentMap, RRID:SCR_016052)Cytoscapesuggested: (Cytoscape, RRID:SCR_003032)Results from OddPub: Thank you for sharing your code.
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:- Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
- Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
- No protocol registration statement was detected.
Results from scite Reference Check: We found no unreliable references.
-