Machine Learning-Based Forecasting of Tuberculosis Incidence in Taiwan: A Comprehensive Comparison of Traditional and Deep Learning Approaches with Projections to 2035
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Tuberculosis (TB) remains a significant public health challenge globally, with 10.8 million incident cases in 2023. Accurate forecasting is crucial for resource allocation and evaluating progress toward WHO End TB Strategy targets. This study developed and validated multiple machine learning models to forecast TB cases in Taiwan through 2035. Methods We analyzed 17 years of monthly TB surveillance data (January 2008–July 2025, n = 206 observations) from Taiwan's national electronic TB register. Five modeling approaches were systematically evaluated: Random Forest, XGBoost, LightGBM, ensemble methods (including a novel 70% XGBoost + 30% LightGBM hybrid), and hybrid LSTM-CNN deep learning architectures. Models incorporated temporal features, autoregressive lags (1, 3, 6 months), rolling averages, and stratified demographic data (age, gender, migration status). Two age stratification schemes were compared: 7 groups (0–14, 15–24, 25–34, 35–44, 45–54, 55–64, ≥ 65 years) versus 4 groups (0–24, 25–44, 45–64, ≥ 65 years). Performance was assessed using expanding-window time-series cross-validation over 36 months (August 2022–July 2025) with metrics including R², RMSE, MAPE, and directional accuracy (Hit Rate). Comprehensive sensitivity analyses evaluated forecast robustness. Scenario analyses explored intervention impacts on projected incidence. Results XGBoost with 7 age groups demonstrated superior performance (R²=0.705, RMSE = 60.2, MAPE = 21.7%, Hit Rate = 97.2%), followed by LightGBM (R²=0.698, RMSE = 61.1, MAPE = 22.0%, Hit Rate = 97.2%) and ensemble methods (R²=0.690, RMSE = 61.8, MAPE = 22.2%, Hit Rate = 97.2%). The LSTM-CNN model achieved competitive results with 7 age groups (R²=0.682, RMSE = 63.4, MAPE = 22.8%, Hit Rate = 94.4%) but performance degraded with simplified 4-group stratification. The hybrid ensemble (70% XGBoost + 30% LightGBM) forecasts Taiwan's TB incidence at 14.2 per 100,000 population in 2030 (95% CI: 12.4–16.0) and 14.6 per 100,000 in 2035 (95% CI: 12.7–16.5), representing approximately 3,247 annual cases. This reflects a 50% decline from 2023 baseline (28 per 100,000) but falls short of WHO End TB Strategy targets (< 9 per 100,000 by 2030, < 4.5 per 100,000 by 2035). Scenario analyses indicate that a 30% case reduction through enhanced interventions could achieve 9.9 per 100,000 by 2035. Sensitivity analyses confirmed forecast robustness with < 4% variation across model configurations. Conclusions Machine learning approaches, particularly gradient boosting methods (XGBoost, LightGBM) and their hybrids, provide accurate and robust TB forecasting for Taiwan. The projected trajectory suggests successful maintenance of low TB burden but insufficient progress toward elimination goals under current conditions. Achieving WHO 2030 and 2035 targets requires intensified interventions including expanded preventive therapy, enhanced active case finding, and systematic screening of high-risk populations. This validated forecasting pipeline can be institutionalized for routine surveillance, policy planning, and intervention evaluation.