Interpretable Machine Learning for Life Expectancy Prediction: A Comparative Study of Linear Regression, Decision Tree, and Random Forest
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Life expectancy is a fundamental indicator of population health and socio-economic well-being, yet accurately forecasting it remains challenging due to the interplay of demographic, environmental, and healthcare factors. This study evaluates three machine learning models—Linear Regression (LR), Re- gression Decision Tree (RDT), and Random Forest (RF), using a real-world da- taset drawn from World Health Organization (WHO) and United Nations (UN) sources. After extensive preprocessing to address missing values and inconsist- encies, each model’s performance was assessed with R2, Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE). Results show that RF achieves the highest predictive accuracy (R2 = 0.9423), significantly outperforming LR and RDT. Interpretability was prioritized through p-values for LR and feature- importance metrics for the tree-based models, revealing immunization rates (diphtheria, measles) and demographic attributes (HIV/AIDS, adult mortality) as critical drivers of life-expectancy predictions. These insights underscore the syn- ergy between ensemble methods and transparency in addressing public-health challenges. Future research should explore advanced imputation strategies, alter- native algorithms (e.g., neural networks), and updated data to further refine pre- dictive accuracy and support evidence-based policymaking in global health con- texts.