Forecasting State Employment Relativity with Health, Labor Flow, and Small Establishment Indicators: A Leakage Safe Temporal Machine Learning Pipeline
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Standard employment forecasting often absorbs common national drift, which can obscure the reasons states persistently differ in employment intensity. This study develops and evaluates a forecasting framework for state employment relativity, defined as the log ratio of a state employment per capita measure to its contemporaneous national counterpart. We construct a state year panel for 2014 to 2024 by integrating administrative labor market aggregates, population weighted health and socioeconomic indicators, and small establishment structure measures that proxy local service capacity. To prevent temporal leakage, all tuning and validation procedures preserve chronological ordering, and the main model comparison is based on a fixed forward holdout that trains on 2014 to 2021 and evaluates on 2022 to 2024. Among the prespecified learners, LightGBM delivers the strongest forward performance, with holdout R^2=0.831, RMSE =0.0536, and MAE =0.0430 in log points. The main interpretability analysis therefore focuses on SHAP values from the selected LightGBM model, while a stacked ensemble is used only as a supplementary robustness diagnostic for temporal stability of predictor families. Across these analyses, the most important signals center on education composition, health burden and access, local service ecosystem capacity, and hiring intensity. A state level case study for Texas and North Carolina shows how similar relative employment positions can arise from different feature combinations and different model error profiles. The results support a forecasting interpretation in which state deviations from national employment intensity reflect nonlinear interactions among human capital composition, population health conditions, and local business service structure, and they show why clear separation between model selection, model interpretation, and robustness analysis is essential for credible empirical claims.