Predicting Clinical Outcomes in Helicobacter pylori- positive Patients using Supervised Learning through the Integration of Demographic and Genomic Features
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Helicobacter pylori (H. pylori) infection is widespread globally and is linked to outcomes ranging from chronic gastritis to gastric cancer. However, only a minority of infected individuals progress to malignancy, influenced by a mix of bacterial, host, and environmental factors. Current predictive approaches are limited due to relying mainly on clinical and lifestyle data. Genomic approaches have been sparsely used, and thus their incorporation into machine learning models could ensure early and personalized detection. This study aimed to evaluate the impact of integrating host metadata with genomic features from H. pylori to predict gastric cancer outcomes and identify associated variables.
Methods
1,363 publicly available H. pylori genomes with associated host information between 1991 and 2024 were collected from NCBI and EnteroBase. Demographic features, virulence genes, sequence-derived and variant-based features were extracted. Machine learning models were then developed to classify infection outcomes into gastric cancer and non-gastric cancer. Logistic regression, an interpretable baseline model, was compared against higher-performance ensemble models (XGBoost, Random Forest). Model performance was assessed using recall, precision, AUROC, and AUPRC curves.
Results
The logistic regression model achieved a recall of 0.736 (95% CI: 0.644-0.831) for gastric cancer and an AUROC of 0.888 (95% CI: 0.843-0.929). Both XGBoost and Random Forest models outperformed the baseline model with AUROC values ranging from 0.950-0.954 (95% CI: 0.904-0.976). Black-box model recall for gastric cancer detection improved compared to the baseline by 8.3% for XGBoost (0.797, 95% CI: 0.711-0.877), and 11.4% for Random Forest (0.820, 95% CI: 0.734-0.896). Across models, patient age consistently emerged as the strongest predictor of gastric cancer, with several sequence-derived genomic features beyond pre-established virulence genes contributing to the infection outcome differences.
Conclusion
This study demonstrates that combining pathogen genomics with host demographics uncovers novel risk factors and ensures early detection with high predictive power. The use of explainability methods like SHAP allows for greater interpretability by clinical professionals and improves informed decision-making processes. Validation and translation into clinical practice can be carried out with broader, diverse datasets along with the inclusion of additional host and lifestyle variables.