Cross-Geographic Validation Demonstrates Universal Transcriptomic Signatures for Tuberculosis Diagnosis: A Machine Learning Study

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Transcriptomic biomarkers for tuberculosis (TB) diagnosis have shown promise in high-income settings, but concerns persist about their generalizability to high-burden endemic regions due to population-specific immune responses, genetic backgrounds, and environmental factors. We performed cross-geographic validation to test whether TB diagnostic signatures are universal or population-specific. Methods: We obtained RNA-sequencing data from two independent cohorts: GSE107991 (London, UK; n=2; 21 active TB, 21 latent TB infection [LTBI]) and GSE101705 (South India; n=; 28 active TB, 16 LTBI). Raw count matrices were downloaded from NCBI GEO, normalized to log2-counts per million (CPM), and aligned on 39,376 common genes. A Random Forest classifier was trained on the London cohort using 5-fold cross-validation and validated on the India cohort. Performance was assessed using area under the receiver operating characteristic curve (AUC), sensitivity, and specificity. Hyperparameters: n_estimators=100, max_depth=10, min_samples_split=2, class_weight='balanced', random_state=42. Parameters were not optimized on validation set to avoid overfitting. Batch Effect Assessment: PCA showed disease status (active TB vs. LTBI) as primary variation source, not cohort origin, indicating minimal batch effects. Participant Characteristics: All participants were HIV-negative as per original study inclusion criteria. Active TB patients were treatment-naive at sample collection. BCG vaccination status and M. tuberculosis lineage information were not available. Results: The Random Forest model achieved an AUC of 0.873 (95% CI: 0.76-0.98, SD ±0.090) in London cross-validation. Unexpectedly, validation on the India cohort yielded superior performance (AUC 0.932 (95% CI: 0.85-1.00) (95% CI: 0.85-1.00), 9% CI: 0.8-1.00), with accuracy 90.9% (95% CI: 78.8%-96.4%), sensitivity 89.3% (95% CI: 72.8%-96.3%), and specificity 93.8% (95% CI: 71.7%-98.9%). The negative generalization gap (-0.09) indicates the model performed better on the validation cohort than training, challenging the hypothesis of population-specific signatures. The difference was not statistically significant (z-test, p=0.304), indicating consistent performance. Conclusions: TB transcriptomic signatures for distinguishing active disease from latent infection appear biologically universal rather than population-specific. This finding supports the development of global diagnostic biomarker panels and reduces the need for region-specific validation studies. The superior performance on an independent endemic cohort strengthens the case for implementing transcriptional signature-based diagnostics worldwide.

Article activity feed