Assessing Imputation Techniques for Missing Data in Small and Multicollinear Datasets: Insights from Craniofacial Morphometry
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Analyses of craniofacial morphology are crucial for various medical and research applications, including the study of craniofacial development, dysmorphologies, and planning surgical interventions. Missing data in midfacial measurements can occur due to patient movement during imaging and scanning errors from the machine that may lead to biased conclusion and reduced statistical power. Objective This study evaluates various imputation techniques to determine the most effective approach for replacing missing values in a small, highly correlated, and high-dimensional midfacial morphometric dataset. Methods 42 midface variables were measured from 32 observations. The missing data structure was set to be at random with 268 (20%) missing values. Five common imputation techniques namely Mean/Median imputation, k-Nearest Neighbors (kNN), Multiple Imputation by Chained Equations (MICE), Random Forest (RF), and Decision Tree, were considered. The performance of the imputation technique was quantified using Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Variance Preservation. Results RF Imputation demonstrated the best overall performance, with the lowest RMSE (1.3987) and MAE (0.4902), indicating a high level of accuracy in imputing missing values. It also maintained a relatively close to 1 variance preservation (0.8961), suggesting its effectiveness in retaining the original variability in the dataset. MICE present lower accuracy with high RMSE (3.0869) and MAE (1.1246) however appear to have the closest variance preservation to 1 (1.0580). Conclusion The findings emphasize the importance of selecting appropriate imputation techniques for small, high-dimensional, and correlated datasets such as those used in midfacial morphometry analysis. RF can provide a balance between accuracy and variance retention, while MICE may be preferable for preserving data distribution.