Correcting Algorithmic Bias in Machine Learning Prediction of Healthcare utilization in India

John Tayu Lee
Vincent Cheng-Sheng Li
Sheng Hui Hsu
Tzu-Pin Lu
Charlotte Wang
Arokiasamy Perianayagam
Kanya Anindya
Rifat Atun

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objective

This study investigates how historical disparities in healthcare access influence machine learning (ML) predictions of healthcare utilization among older adults in India through algorithmic bias. We examine the extent to which standard ML models underestimate utilization in disadvantaged populations and quantify the resulting distortion in national-level cost projections.

Methods

Using data from 55,698 respondents in the Longitudinal Ageing Study in India (LASI), we trained two sets of ML models to predict outpatient and inpatient utilization: one on the full population (Model 1), and another on a subsample with met healthcare needs (Model 2). Gradient Boosting was selected as the best-performing algorithm. To interpret model predictions and identify key drivers of healthcare utilization, we applied SHapley Additive exPlanations (SHAP). We compared model outputs across socioeconomic subgroups and extrapolated predicted utilization to national population estimates using WHO-CHOICE unit costs.

Findings

Model 1 consistently underestimated healthcare utilization relative to Model 2, particularly among lower-income and caste-identified groups. Overall, outpatient and inpatient predictions from Model 2 were 8.92 (95% CI: 8.87–8.99)% and 9.59 (9.28–9.85)% higher, respectively. Nationally, this translated to an underestimation of I$390.7 (391.2–391.5) million in outpatient care and I$88.4 (86.2–90.1) million in inpatient care. The largest gaps were concentrated in the poorest and most marginalized subgroups. The SHAP analysis suggests that self-rated health (SRH), economic status (MPCE), and chronic conditions are consistently influential in predicting outpatient and inpatient visits, with some shifts in feature importance between models.

Conclusion

Machine learning models trained on unadjusted population data lead to algorithmic bias and risk perpetuating structural inequities by underrepresenting unmet need. Models based on fulfilled care scenarios yield more equitable and accurate projections.

Version published to 10.1101/2025.09.07.25335256 on medRxiv
Sep 8, 2025