Using machine learning to predict COVID-19 infection and severity risk among 4510 aged adults: a UK Biobank cohort study

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Many risk factors have emerged for novel 2019 coronavirus disease (COVID-19). It is relatively unknown how these factors collectively predict COVID-19 infection risk, as well as risk for a severe infection (i.e., hospitalization). Among aged adults (69.3 ± 8.6 years) in UK Biobank, COVID-19 data was downloaded for 4510 participants with 7539 test cases. We downloaded baseline data from 10 to 14 years ago, including demographics, biochemistry, body mass, and other factors, as well as antibody titers for 20 common to rare infectious diseases in a subset of 80 participants with 124 test cases. Permutation-based linear discriminant analysis was used to predict COVID-19 risk and hospitalization risk. Probability and threshold metrics included receiver operating characteristic curves to derive area under the curve (AUC), specificity, sensitivity, and quadratic mean. Model predictions using the full cohort were marginal. The “best-fit” model for predicting COVID-19 risk was found in the subset of participants with antibody titers, which achieved excellent discrimination (AUC 0.969, 95% CI 0.934–1.000). Factors included age, immune markers, lipids, and serology titers to common pathogens like human cytomegalovirus. The hospitalization “best-fit” model was more modest (AUC 0.803, 95% CI 0.663–0.943) and included only serology titers, again in the subset group. Accurate risk profiles can be created using standard self-report and biomedical data collected in public health and medical settings. It is also worthwhile to further investigate if prior host immunity predicts current host immunity to COVID-19.

Article activity feed

  1. SciScore for 10.1101/2020.06.09.20127092: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    Institutional Review Board StatementConsent: All UK Biobank participants gave written, informed consent.
    IRB: Ethics approval for the UK Biobank study was obtained from the National Health Service Health Research Authority North West - Haydock Research Ethics Committee (16/NW/0274).
    RandomizationSerology Measures for Non COVID-19 Infectious Diseases: As described (http://biobank.ctsu.ox.ac.uk/crystal/crystal/docs/infdisease.pdf), among 9,695 randomized UK Biobank participants selected from the full 500,000 participant cohort, baseline serum was thawed and pathogen-specific assays run in parallel using flow cytometry on a Luminex bead platform19.
    Blindingnot detected.
    Power Analysisnot detected.
    Sex as a biological variableSex was coded as 0 for female and 1 for male.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    SPSS 27 was used for all analyses and Alpha set at .05.
    SPSS
    suggested: (SPSS, RRID:SCR_002865)

    Results from OddPub: We did not detect open data. We also did not detect open code. Researchers are encouraged to share open data when possible (see Nature blog).


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    Several major limitations should be noted. The number of UK Biobank participants with COVID-19 and serology data is low, particularly for positive test cases. This could consequently lead to model overfitting or misestimation. Several steps were taken to guard against this problem, including feature reduction through LDA, bootstrapped parameter estimation to guard against parametric assumption violations, and several cross-validation steps to maximize robustness. We also rigorously tested each predictor or set of predictors in the main sample and serology sub-group, where we found that model fit was not overly biased in general despite sample size differences. Nonetheless, we recognize future work must use much larger sample sizes to verify the usefulness of serology data. Another limitation was that using test case data nested within a participant violates the assumption of independence, which can lead to gross misestimation. While we ameliorated this issue using permutation testing, other latent concerns with the data like type 2 error may be present. We also chose to use LDA over other machine learning algorithms, where LDA tends to provide more conservative estimates. This was intentional, because it is still largely unknown how risk factors alone or additively reflect overall risk for COVID-19 infection and disease severity. Finally, we only looked at the so called main effects of all predictors instead of complex interactions, such as darker skin, vitamin D levels, and ...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.