Exploring selection bias in COVID-19 research: Simulations and prospective analyses of two UK cohort studies

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

Non-random selection into analytic subsamples could introduce selection bias in observational studies of SARS-CoV-2 infection and COVID-19 severity (e.g. including only those have had a COVID-19 PCR test). We explored the potential presence and impact of selection in such studies using data from self-report questionnaires and national registries.

Methods

Using pre-pandemic data from the Avon Longitudinal Study of Parents and Children (ALSPAC) (mean age=27.6 (standard deviation [SD]=0.5); 49% female) and UK Biobank (UKB) (mean age=56 (SD=8.1); 55% female) with data on SARS-CoV-2 infection and death-with-COVID-19 (UKB only), we investigated predictors of selection into COVID-19 analytic subsamples. We then conducted empirical analyses and simulations to explore the potential presence, direction, and magnitude of bias due to selection when estimating the association of body mass index (BMI) with SARS-CoV-2 infection and death-with-COVID-19.

Results

In both ALSPAC and UKB a broad range of characteristics related to selection, sometimes in opposite directions. For example, more educated participants were more likely to have data on SARS-CoV-2 infection in ALSPAC, but less likely in UKB. We found bias in many simulated scenarios. For example, in one scenario based on UKB, we observed an expected odds ratio of 2.56 compared to a simulated true odds ratio of 3, per standard deviation higher BMI.

Conclusion

Analyses using COVID-19 self-reported or national registry data may be biased due to selection. The magnitude and direction of this bias depends on the outcome definition, the true effect of the risk factor, and the assumed selection mechanism.

Key messages

  • Observational studies assessing the association of risk factors with SARS-CoV-2 infection and COVID-19 severity may be biased due to non-random selection into the analytic sample.

  • Researchers should carefully consider the extent that their results may be biased due to selection, and conduct sensitivity analyses and simulations to explore the robustness of their results. We provide code for these analyses that is applicable beyond COVID-19 research.

Article activity feed

  1. SciScore for 10.1101/2021.12.10.21267363: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    Ethicsnot detected.
    Sex as a biological variablenot detected.
    Randomizationnot detected.
    Blindingnot detected.
    Power Analysisnot detected.

    Table 2: Resources

    No key resources detected.


    Results from OddPub: Thank you for sharing your code.


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    Strengths and limitations: We used both empirical analyses and simulations to comprehensively investigate the potential presence and impact of selection bias in COVID-19 studies. We used two cohorts with pre-pandemic data allowing us to identify potential determinants of selection. We were able to compare across these cohorts that have contrasting sources of COVID-19 data (from questionnaires in ALSPAC and national registries in UKB). In addition, a strength of our simulations is that we based most of the parameters on either cohort data or other secondary sources to try to reflect realistic scenarios. In the analyses presented here we make several assumptions about or simplifications of the data. Both ALSPAC and UKB are subject to pre-pandemic selection bias due to non-random recruitment into these studies and loss to follow-up, which we do not account for here. Overall, we considered misclassification of the comparison groups (e.g. infected as non-infected) but not of the case groups (e.g. non-infected as infected). This may be particularly problematic for self-reported COVID-19 data and cause of death attributed to COVID-19 early in the pandemic [23]. We have focussed analyses here on the first wave of the COVID-19 pandemic in the UK. Selection bias may change over time as the pandemic progresses, which may explain some of the differences between ALSPAC and UKB. In ALSPAC, the comparison of SARS-CoV-2 (+) with everyone else, including participants who did not reply to the ...

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • No funding statement was detected.
    • No protocol registration statement was detected.

    Results from scite Reference Check: We found no unreliable references.


    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.