Simulation comparison of the effects of missing data imputation methods on classification performance in high dimensional data

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The study aims to examine the performance of different missing data imputation methods on accurately estimating missing data in high dimensional datasets and their impact on classification using extreme learning machines (ELM). Random datasets were generated with n = 150 observations, p = 500 independent variables, and different missing data rates. Various imputation methods were used, including mean, median, random, k-nearest neighbors (KNN), missing value imputation with random forests (I-RF), multivariate imputations by chained equations with classification and regression trees (MICE-CART), as well as direct and indirect use of regularized regression (DURR and IURR) methods specifically developed for high dimensional data. The performance of the methods was evaluated based on their proximity to the reference classification scores obtained using ELM. I-RF, MICE-CART, DURR, and IURR, followed by KNN methods, exhibited better performance at low missing rates, while DURR and IURR methods stood out at high missing rates.

Article activity feed