Assessing Imputation Techniques for Missing Data in Small and Multicollinear Datasets: Insights from Craniofacial Morphometry

Norli Anida Abdullah
Firdaus Hariri
Mohamad Norikmal Fazli Hisam
Siti Fatimah Binti Hassan

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Analyses of craniofacial morphology are crucial for various medical and research applications, including the study of craniofacial development, dysmorphologies, and planning surgical interventions. Missing data in midfacial measurements can occur due to patient movement during imaging and scanning errors from the machine that may lead to biased conclusion and reduced statistical power. Objective This study evaluates various imputation techniques to determine the most effective approach for replacing missing values in a small, highly correlated, and high-dimensional midfacial morphometric dataset. Methods 42 midface variables were measured from 32 observations. The missing data structure was set to be at random with 268 (20%) missing values. Five common imputation techniques namely Mean/Median imputation, k-Nearest Neighbors (kNN), Multiple Imputation by Chained Equations (MICE), Random Forest (RF), and Decision Tree, were considered. The performance of the imputation technique was quantified using Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Variance Preservation. Results RF Imputation demonstrated the best overall performance, with the lowest RMSE (1.3987) and MAE (0.4902), indicating a high level of accuracy in imputing missing values. It also maintained a relatively close to 1 variance preservation (0.8961), suggesting its effectiveness in retaining the original variability in the dataset. MICE present lower accuracy with high RMSE (3.0869) and MAE (1.1246) however appear to have the closest variance preservation to 1 (1.0580). Conclusion The findings emphasize the importance of selecting appropriate imputation techniques for small, high-dimensional, and correlated datasets such as those used in midfacial morphometry analysis. RF can provide a balance between accuracy and variance retention, while MICE may be preferable for preserving data distribution.

Version published to 10.21203/rs.3.rs-6947829/v1 on Research Square
Aug 4, 2025

Comparison of Imputation Strategies for Incomplete Electronic Health Data

This article has 5 authors:
1. Shuo Zhang
2. Zhilong Zhang
3. Yuxi Zhou
4. Shenda Hong
5. Huixin Liu
This article has no evaluationsLatest version Aug 5, 2025
Enhancing Propensity Score Analysis with data Missing Not at Random: Introducing Dual-Forest Proximity Imputation

This article has 2 authors:
1. Yongseok Lee
2. Walter Leite
This article has no evaluationsLatest version Jul 18, 2025
Missing Values Are Valuable: Shifting Focus from Amount to Form of Missing Data

This article has 3 authors:
1. Ehsan Zangene
2. Veit Schwammle
3. Mohieddin JAFARI
This article has no evaluationsLatest version Aug 27, 2025

Listed in

Abstract

Article activity feed

Related articles

Comparison of Imputation Strategies for Incomplete Electronic Health Data

Enhancing Propensity Score Analysis with data Missing Not at Random: Introducing Dual-Forest Proximity Imputation

Missing Values Are Valuable: Shifting Focus from Amount to Form of Missing Data