Machine Learning Classification of Favorable vs Unfavorable Tuberculosis Treatment Outcomes Using Clinical and Sociodemographic Data from Brazil’s SINAN-TB (2001–2023)
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Tuberculosis (TB) remains a significant public health concern, particularly in low- and middle-income countries such as Brazil. Predicting treatment outcomes is essential to guide clinical decisions and strengthen public health strategies. Thus, this study evaluated the application of machine learning (ML) models to predict TB treatment outcomes using data from the Brazilian national SINAN-TB database (2001–2023). Five methodological scenarios were designed, differing in temporal scope, completeness of records, and inclusion of attributes derived from expert recommendations and social determinants of health. Tree-based ML models (Decision Tree, Random Forest, Gradient Boosting, and XGBoost) were trained on preprocessed datasets with balancing techniques (undersampling, oversampling, and SMOTE) and evaluated using metrics such as F1-Macro, AUC-ROC, and the Matthews Correlation Coefficient (MCC). The best results were obtained in the scenario that combined a broader temporal range with additional derived attributes, in which the Random Forest model achieved an MCC of 0.715. Scenarios that incorporated domain-informed variables, particularly those related to treatment duration and contact tracing, demonstrated notable performance gains, although some evidence of overfitting emerged. The TabPFN (Tabular Prior-Data Few-Shot Network) model, despite restrictions on data volume, also delivered competitive results when enriched with contact-tracing information. These findings demonstrate the value of integrating ML approaches with clinical, social, and demographic data, reinforcing their potential to inform targeted interventions, reduce unfavorable treatment outcomes, such as abandonment, and mitigate TB-related morbidity and mortality.