Predicting Low Birth Weight in India Using Machine Learning Techniques: Insights from NFHS-5

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Low birth weight (LBW) remains a critical public health challenge in India, affecting approximately 17% of newborns and contributing significantly to neonatal mortality and long-term adverse health outcomes. Despite its importance, the application of advanced machine learning techniques to predict LBW using large-scale demographic health survey data in low- and middle-income countries remains limited. This study aims to identify key predictors of LBW and develop robust predictive models using machine learning algorithms applied to nationally representative data from India. Methods We analysed data from 23,247 births recorded in the National Family Health Survey-5 (NFHS-5), conducted between 2019 and 2021 across India. LBW was defined as birth weight below 2,500 grams. Eighteen predictor variables were included, encompassing maternal demographics, reproductive health, healthcare utilisation, socioeconomic factors, and child characteristics. To address the class imbalance in the dataset (16.24% LBW cases), we employed the Synthetic Minority Over-sampling Technique (SMOTE) for data rebalancing. Five machine learning algorithms were developed and compared: Logistic Regression, Decision Tree, Random Forest, XGBoost, and Neural Networks. Model performance was evaluated using accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC-ROC). Results The prevalence of LBW was 17.48% in our sample. LBW rates were highest among teenage mothers (19.74%), mothers with no formal education (19.95%), those from the poorest wealth quintile (20.63%), and mothers with severe anaemia (22.52%). After applying SMOTE rebalancing, the Random Forest model demonstrated superior performance with 90% accuracy, 97% precision, 83% recall, 89% F1-score, and an AUC of 0.95. Feature importance analysis revealed that maternal weight, height, wealth index, age, and anaemia status were the most significant predictors of LBW. Performance metrics for LBW prediction improved substantially across all models following rebalancing, with Random Forest and XGBoost showing the most robust discriminative ability. Conclusions Machine learning approaches, particularly Random Forest models applied to rebalanced data, can effectively predict LBW risk using routinely collected demographic and health data. The identified risk factors, primarily maternal nutritional status, socioeconomic position, and healthcare access, underscore the need for targeted interventions addressing social determinants of health. These predictive models offer actionable intelligence for developing early identification systems in maternal and child healthcare settings, enabling timely interventions to reduce LBW prevalence and improve health outcomes in India.

Article activity feed