A data driven approach to handling missing data in the UK Millennium Cohort Study
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Missing data arising from sweep non-response is a major challenge in longitudinal cohort studies, threatening statistical power and the validity of inferences. In the UK Millennium Cohort Study (MCS), non-response has increased substantially from sweep 1 (9 months old) to sweep 7 (17 years old), underscoring the need for robust strategies to handle non-response. We applied a systematic, data-driven approach to identify predictors of non-response at each sweep of the MCS, drawing on all available survey data at the time of analysis. The strongest and most consistent predictor of non-response was prior sweep non-response. Additional robust predictors included lower parental occupational social class, parental non-participation in the latest general elections, parent not being in paid work, higher cohort member’s age and lower cognitive test scores. We then evaluated whether incorporating the identified predictors of non-response as auxiliary variables in multiple imputation (MI) or as covariates in inverse probability weighting (IPW) improved sample representativeness. Validation analyses, using both external benchmarks (2021 Census) and internal comparisons to known early-life distributions, showed that MI and IPW models including the identified predictors substantially reduced or eliminated bias in key variables such as housing tenure and parental social class. Our findings demonstrate that the use of systematically identified auxiliary variables can improve the validity of inferences drawn from the MCS. The resulting predictor set offers a practical resource for applied researchers using MCS data and provides a replicable framework for addressing sweep non-response in other longitudinal studies.