Machine Learning Classification of Favorable vs Unfavorable Tuberculosis Treatment Outcomes Using Clinical and Sociodemographic Data from Brazil’s SINAN-TB (2001–2023)

Maicon Herverton Lino Ferreira da Silva Barros
José Mário Nunes da Silva
Virginia Vilhena
José Roberto Ferreira Melo
Larissa Souza França
Lucia Rolim Santana de Freitas
Lívia Teixeira de Souza Maia
Patricia Takako Endo
Walter Massa Ramalho

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Tuberculosis (TB) remains a significant public health concern, particularly in low- and middle-income countries such as Brazil. Predicting treatment outcomes is essential to guide clinical decisions and strengthen public health strategies. Thus, this study evaluated the application of machine learning (ML) models to predict TB treatment outcomes using data from the Brazilian national SINAN-TB database (2001–2023). Five methodological scenarios were designed, differing in temporal scope, completeness of records, and inclusion of attributes derived from expert recommendations and social determinants of health. Tree-based ML models (Decision Tree, Random Forest, Gradient Boosting, and XGBoost) were trained on preprocessed datasets with balancing techniques (undersampling, oversampling, and SMOTE) and evaluated using metrics such as F1-Macro, AUC-ROC, and the Matthews Correlation Coefficient (MCC). The best results were obtained in the scenario that combined a broader temporal range with additional derived attributes, in which the Random Forest model achieved an MCC of 0.715. Scenarios that incorporated domain-informed variables, particularly those related to treatment duration and contact tracing, demonstrated notable performance gains, although some evidence of overfitting emerged. The TabPFN (Tabular Prior-Data Few-Shot Network) model, despite restrictions on data volume, also delivered competitive results when enriched with contact-tracing information. These findings demonstrate the value of integrating ML approaches with clinical, social, and demographic data, reinforcing their potential to inform targeted interventions, reduce unfavorable treatment outcomes, such as abandonment, and mitigate TB-related morbidity and mortality.

Version published to 10.21203/rs.3.rs-7502054/v1 on Research Square
Sep 3, 2025

Machine Learning-Based Forecasting of Tuberculosis Incidence in Taiwan: A Comprehensive Comparison of Traditional and Deep Learning Approaches with Projections to 2035

This article has 1 author:
1. Mei-Mei Kuan¹
This article has no evaluationsLatest version Apr 1, 2026
A Multicenter, Interpretable Machine Learning-Based Survival Model for Predicting 28-Day Mortality Risk in Sepsis Patients with Heart Failure: A Retrospective Cohort Study and Performance Comparison with the SOFA Score

This article has 9 authors:
1. Yucan Zhou
2. Jian Wang
3. Yunchong Li
4. Jiahao Zou
5. Zhaoxin Huang
6. Yue Li
7. Mengyuan Hou
8. Zhibin Ma
9. Chunlong Liu
This article has no evaluationsLatest version Mar 9, 2026
Predicting Mortality and Risk Factors in Cystic Fibrosis Using a Boruta- Enhanced Machine Learning Pipeline: Comparative Evaluation of Ensemble and Penalized Regression Models

This article has 4 authors:
1. Farzaneh Hamidi
2. Anoshirvan Kazemnejad
3. Maryam Hassanzad
4. Mina Jahangiri
This article has no evaluationsLatest version Mar 27, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Machine Learning-Based Forecasting of Tuberculosis Incidence in Taiwan: A Comprehensive Comparison of Traditional and Deep Learning Approaches with Projections to 2035

A Multicenter, Interpretable Machine Learning-Based Survival Model for Predicting 28-Day Mortality Risk in Sepsis Patients with Heart Failure: A Retrospective Cohort Study and Performance Comparison with the SOFA Score

Predicting Mortality and Risk Factors in Cystic Fibrosis Using a Boruta- Enhanced Machine Learning Pipeline: Comparative Evaluation of Ensemble and Penalized Regression Models