Application of machine learning technics to similar case imputation for missing values in Demographic and Health Survey (DHS) data

Cyprien HABINSHUTI
Prof. François NIRAGIRE

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: The study sought to assess the performance of machine learning algorithms for handling missing data in demographic and Health Survey (DHS) dataset using similar case imputation method. In quantitative data analysis, imputation has the potential to greatly improve the knowledge available for mining high-quality compounds by supplying accurate predictions to fill in the gaps. Methods: This study used data from three rounds of the Rwanda Demographic and Health Survey (RDHS) conducted between 2015 and 2020. The study was conducted using three datasets with a total of 400,656 to 459,102 observations, each including 9,002, 7,856, and 8,092 rows with 51 columns. Twenty numerical variables and thirty-one categorical variables made up the merged dataset after concatenation. The Multiple Imputation Regression for ‘m’ iteration, Decision Tree based, Support Vector Machine (SVM) as classification algorithm, K-nearest neighbor (KNN) as Clustering algorithms, and Random Forest (RF) machine learning algorithms were applied and compared using performance metrics to identify the best algorithm to impute missing values. Results: It was found that Support Vector Machine (SVM) ranked first for imputing categorical variables, with an accuracy of 100% and precision of 100%. It was followed by Decision Tree, which achieved an accuracy of 79.9% and precision of 100%, Random Forest came in third with an accuracy of 78.0% and precision of 99.2%, while KNN ranked last with an accuracy of 67.1% and precision of 69.4%. For imputing numerical data, Random Forest performed the best, with both MSE and MAE values of 0. It was followed by Multiple Imputation by Chained Equations (MICE), which had an MSE of 1.53e^09 and an R² of 0.81. Finally, Support Vector Machine (SVM) regression ranked last among the machine learning models used for numerical data imputation. Conclusions: Performance measurements revealed that Random Forest is the optimal algorithm model for numerical variables, while Support Vector Machine classification, and Decision Tree, Random Forest are the best for categorical variables. It was concluded that the best machine learning algorithm for managing missing values, both categorically and numerically, is Random Forest.

Version published to 10.21203/rs.3.rs-9380937/v1 on Research Square
Apr 14, 2026

Development and Validation of a Machine Learning Model for Hepatitis C Virus Exposure: A Demographic Screening Approach for the US Population

This article has 5 authors:
1. Dorian G Ding
2. Taoyi Chen
3. Yu Sheng
4. Jeffrey S.H. Lin
5. Ye Yuan
This article has no evaluationsLatest version Apr 15, 2026
Construction and Validation of an Interpretable Machine Learning Model with SHAP for Identifying Infectious Diseases in Fever of Unknown Origin

This article has 5 authors:
1. Fei Li
2. Xu Zhang
3. Juan Zhang
4. Yang Yu
5. Jie Yang
This article has no evaluationsLatest version Apr 9, 2026
A Bayesian Approach to Correcting Measurement Error in Estimating Childhood Malnutrition Prevalence fromPooled Demographic and Health Surveys Data

This article has 2 authors:
1. Romuald Daniel BOY-NGBOGBELE
2. Raymond Affossogbe
This article has no evaluationsLatest version Apr 15, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Development and Validation of a Machine Learning Model for Hepatitis C Virus Exposure: A Demographic Screening Approach for the US Population

Construction and Validation of an Interpretable Machine Learning Model with SHAP for Identifying Infectious Diseases in Fever of Unknown Origin

A Bayesian Approach to Correcting Measurement Error in Estimating Childhood Malnutrition Prevalence fromPooled Demographic and Health Surveys Data