Household Clustering of High-Risk Contacts in Smear-Positive TB Patient Families: Evidence for Hotspot Households and Risk Stratification in Rural Eastern Cape
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Household contacts of smear-positive tuberculosis (TB) patients face an elevated risk of infection and disease progression, particularly young children and individuals living in overcrowded households. Despite WHO recommendations for systematic contact screening and provision of TB preventive therapy (TPT), implementation remains suboptimal in high-burden rural areas. This study aimed to develop a practical framework for identifying and prioritizing high-risk families by examining demographic predictors, household clustering, and machine learning-based risk models. Methods: A total of 437 household contacts linked to smear-positive index cases were assessed and classified as high or low risk. Statistical analyses included descriptive measures, χ2 tests, Z-tests for age-group differences, and multivariable logistic regression. Household-level vulnerability patterns were explored using network visualizations, clustered heatmaps, and risk-ranking charts. Three machine learning models, logistic regression, random forest, and gradient boosting, were trained using demographic and household variables with 5-fold cross-validation and an 80/20 hold-out test split. Model performance was evaluated using the AUROC, AUPRC, accuracy, F1-score, calibration curves, and decision curve analysis. Results: Of the 437 contacts, 290 (66.4%) were classified as high risk. A younger age was strongly associated with high-risk status (χ2 = 16.61, p = 0.005), with children aged 0–4 years being significantly more likely to be in a high-risk category (Z = 2.706). Gender showed no significant association (p = 0.523). Logistic regression identified younger age (aOR = 2.41, 95% CI: 1.48–3.94) and larger household size (aOR = 1.12 per additional member, 95% CI: 1.01–1.25) as independent predictors of the outcome. Visual analytics revealed apparent clustering of high-risk individuals within “hotspot families,” enabling prioritization through composite risk scores. Gradient boosting achieved the strongest performance (AUROC = 0.65; AUPRC = 0.76), with acceptable calibration (Brier score = 0.21) and a positive net clinical benefit in the decision curve analysis. Conclusions: TB risk is highly clustered at the household level, with large families and young children carrying disproportionate vulnerability. Combining demographic risk assessment, household-level visualization, and predictive modeling provides a practical, data-driven approach to prioritizing households during contact investigation. These findings support the WHO’s family-centered strategy and underscore the need to strengthen clinical governance and community-engaged education to optimize TB prevention in resource-limited rural settings.