Application of machine learning technics to similar case imputation for missing values in Demographic and Health Survey (DHS) data
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: The study sought to assess the performance of machine learning algorithms for handling missing data in demographic and Health Survey (DHS) dataset using similar case imputation method. In quantitative data analysis, imputation has the potential to greatly improve the knowledge available for mining high-quality compounds by supplying accurate predictions to fill in the gaps. Methods: This study used data from three rounds of the Rwanda Demographic and Health Survey (RDHS) conducted between 2015 and 2020. The study was conducted using three datasets with a total of 400,656 to 459,102 observations, each including 9,002, 7,856, and 8,092 rows with 51 columns. Twenty numerical variables and thirty-one categorical variables made up the merged dataset after concatenation. The Multiple Imputation Regression for ‘m’ iteration, Decision Tree based, Support Vector Machine (SVM) as classification algorithm, K-nearest neighbor (KNN) as Clustering algorithms, and Random Forest (RF) machine learning algorithms were applied and compared using performance metrics to identify the best algorithm to impute missing values. Results: It was found that Support Vector Machine (SVM) ranked first for imputing categorical variables, with an accuracy of 100% and precision of 100%. It was followed by Decision Tree, which achieved an accuracy of 79.9% and precision of 100%, Random Forest came in third with an accuracy of 78.0% and precision of 99.2%, while KNN ranked last with an accuracy of 67.1% and precision of 69.4%. For imputing numerical data, Random Forest performed the best, with both MSE and MAE values of 0. It was followed by Multiple Imputation by Chained Equations (MICE), which had an MSE of 1.53e^09 and an R² of 0.81. Finally, Support Vector Machine (SVM) regression ranked last among the machine learning models used for numerical data imputation. Conclusions: Performance measurements revealed that Random Forest is the optimal algorithm model for numerical variables, while Support Vector Machine classification, and Decision Tree, Random Forest are the best for categorical variables. It was concluded that the best machine learning algorithm for managing missing values, both categorically and numerically, is Random Forest.