Cross-Geographic Validation Demonstrates Universal Transcriptomic Signatures for Tuberculosis Diagnosis: A Machine Learning Study
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Transcriptomic biomarkers for tuberculosis (TB) diagnosis have shown promise in high-income settings, but concerns persist about their generalizability to high-burden endemic regions due to population-specific immune responses, genetic backgrounds, and environmental factors. We performed cross-geographic validation to test whether TB diagnostic signatures are universal or population-specific. Methods: We obtained RNA-sequencing data from two independent cohorts: GSE107991 (London, UK; n=2; 21 active TB, 21 latent TB infection [LTBI]) and GSE101705 (South India; n=; 28 active TB, 16 LTBI). Raw count matrices were downloaded from NCBI GEO, normalized to log2-counts per million (CPM), and aligned on 39,376 common genes. A Random Forest classifier was trained on the London cohort using 5-fold cross-validation and validated on the India cohort. Performance was assessed using area under the receiver operating characteristic curve (AUC), sensitivity, and specificity. Hyperparameters: n_estimators=100, max_depth=10, min_samples_split=2, class_weight='balanced', random_state=42. Parameters were not optimized on validation set to avoid overfitting. Batch Effect Assessment: PCA showed disease status (active TB vs. LTBI) as primary variation source, not cohort origin, indicating minimal batch effects. Participant Characteristics: All participants were HIV-negative as per original study inclusion criteria. Active TB patients were treatment-naive at sample collection. BCG vaccination status and M. tuberculosis lineage information were not available. Results: The Random Forest model achieved an AUC of 0.873 (95% CI: 0.76-0.98, SD ±0.090) in London cross-validation. Unexpectedly, validation on the India cohort yielded superior performance (AUC 0.932 (95% CI: 0.85-1.00) (95% CI: 0.85-1.00), 9% CI: 0.8-1.00), with accuracy 90.9% (95% CI: 78.8%-96.4%), sensitivity 89.3% (95% CI: 72.8%-96.3%), and specificity 93.8% (95% CI: 71.7%-98.9%). The negative generalization gap (-0.09) indicates the model performed better on the validation cohort than training, challenging the hypothesis of population-specific signatures. The difference was not statistically significant (z-test, p=0.304), indicating consistent performance. Conclusions: TB transcriptomic signatures for distinguishing active disease from latent infection appear biologically universal rather than population-specific. This finding supports the development of global diagnostic biomarker panels and reduces the need for region-specific validation studies. The superior performance on an independent endemic cohort strengthens the case for implementing transcriptional signature-based diagnostics worldwide.