NSCH-Flourishing-ML: A Curated Dataset and Reproducible Pipeline for Machine Learning Analysis of Child Flourishing
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large-scale population surveys provide valuable information for studying child well-being, yet their structure often limits direct application of machine learning methods. The National Survey of Children’s Health (NSCH) is one of the most comprehensive datasets for monitoring children’s health and development in the United States, but the raw survey files contain skip patterns, categorical variables, and complex survey design elements that require substantial preprocessing before predictive analysis can be performed. This study presents a curated machine-learning-ready dataset derived from the 2023 NSCH survey together with a fully reproducible computational pipeline for studying child flourishing. The pipeline constructs a binary flourishing indicator based on four survey items capturing curiosity, persistence, emotional regulation, and engagement in learning. After removing skip codes and missing responses, 1,978 valid observations were retained from the original dataset of more than 55,000 records. Feature selection using mutual information was applied to produce a reduced set of interpretable predictors suitable for benchmarking and educational use. Baseline experiments using logistic regression and random forest models show moderate predictive performance, suggesting that child flourishing cannot be accurately predicted using demographic and household variables alone. A methodological comparison between weighted and unweighted models further shows that incorporating survey weights consistently reduces predictive performance. By releasing both the curated dataset and the reproducible pipeline, this study provides a reusable resource for machine learning research on child well-being.