Behavioral and Sociodemographic determinants of poor self-rated health among U.S. adults: an interpretable machine learning analysis
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Self-rated health (SRH) is a validated, single-item measure that captures morbidity, functional status, and social vulnerability in population health. Understanding the determinants of poor SRH can support targeted public health interventions and policy planning. Methods: Using the 2023 Behavioral Risk Factor Surveillance System (BRFSS), this study examined behavioral, sociodemographic, and clinical determinants of poor SRH among 302,125 U.S. adults. We trained Light Gradient-Boosting Machine (LGBM), Extreme Gradient Boosting, Random Forest, and Logistic Regression models. Class imbalance was addressed using SMOTE-NC (oversampling) versus algorithm-level class-weighting, and models were calibrated via isotonic regression. Variable importance was interpreted using Shapley Additive Explanations (SHAP) and validated via weighted multivariable logistic regression. Subgroup analyses examined performance variations across demographic and socioeconomic groups. Results: Class-weighted LGBM provided the best balance of performance, achieving a ROC-AUC of 0.83, sensitivity of 0.75, and specificity of 0.76, outperforming data-level oversampling strategies. Multivariable regression identified frequent poor mental-health days (≥15 days/month) as the strongest predictor (adjusted odds ratio [aOR] = 4.23), followed by diabetes (aOR = 2.43), annual household income <$25,000 (aOR = 2.02), physical inactivity (aOR = 1.99), and obesity (aOR = 1.70). Subgroup analyses revealed significant variation in model sensitivity across age and socioeconomic strata. Conclusions: Findings underscore the intertwined effects of mental health challenges, socioeconomic disadvantage, and chronic conditions on perceived health. This study demonstrates a transparent, equity-oriented machine learning approach to guide data-driven public health strategies.