Evaluating Imputation Methods for Handling Missing Data in Complex Survey Designs: Evidence from the India DHS 2017–18

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Missing data are a pervasive challenge in large-scale population-based surveys such as the DHS, and inadequate handling of missingness can lead to biased estimates and reduced statistical power. This study examined patterns of missing data and compared the performance of multiple imputation approaches using data from the IDHS 2017–18. The prevalence and structure of missingness across key socio-demographic and health-related variables were assessed, and Little’s test was applied to evaluate whether data were missing completely at random. Five analytical strategies were compared: complete case analysis, MICE, machine learning–based Decision Tree imputation, KNN imputation, and latent class–based imputation. Survey-weighted logistic regression models accounting for stratification, clustering, and sampling weights were fitted to each dataset, with antibiotic use for childhood fever selected as the outcome for model comparison. Several variables exhibited substantial missingness, particularly partner/husband’s occupation, respondent’s occupation, and vaccination status. Little’s test indicated that the missing data were not completely at random. Compared with complete case analysis, all imputation approaches improved estimate precision and revealed meaningful associations. Among the evaluated methods, the Decision Tree approach produced the most stable and consistent results, identifying significant predictors that were not detected using traditional methods. Although MICE and latent class–based imputation yielded improved estimates, their performance declined under extreme levels of missingness. These findings highlight the importance of appropriate missing data handling in complex survey analyses and demonstrate that machine learning–based Decision Tree imputation offers a flexible and robust alternative for addressing extensive missingness in large public health datasets.

Article activity feed