Predicting clinical outcome of Escherichia coli O157:H7 infections using explainable Machine Learning

Julian A. Paganini
Suniya Khatun
Sean McAteer
Lauren Cowley
David R. Greig
David L. Gally
Claire Jenkins
Timothy J. Dallman

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Shiga toxin-producing Escherichia coli (STEC) O157:H7 is a globally dispersed zoonotic pathogen capable of causing severe disease outcomes, including bloody diarrhoea and haemolytic uraemic syndrome. While variations in Shiga toxin subtype are well-recognised drivers of disease severity, many unexplained differences remain among strains carrying the same toxin profile.

Results

We applied explainable machine learning approaches—Random Forest and Extreme Gradient Boosting—to whole-genome sequencing data from 1,030 STEC O157:H7 isolates to predict patient clinical outcomes, using data collected over two years of routine surveillance in England. A phylogeny-informed cross-validation strategy was implemented to account for population structure and avoid data leakage, ensuring robust model generalizability. Extreme Gradient Boosting outperformed Random Forest in predicting minority classes and correctly predicted high-risk isolates in traditionally low-risk lineages, illustrating its utility for capturing complex genomic signatures beyond known virulence genes. Feature importance analyses highlighted phage-encoded elements, including potentially novel intergenic regulators, alongside established virulence factors. Moreover, key genomic regions linked to small RNAs and stress-response pathways were enriched in isolates causing severe disease.

Conclusions

These findings underscore the capacity of explainable ML to refine risk assessments, offering a valuable tool for early detection of high-risk STEC O157:H7 and guiding targeted public health interventions.

Version published to 10.1101/2025.06.05.25329036 on medRxiv
Jun 6, 2025

Machine Learning–Driven Discovery of Host Genetic Factors for Paratuberculosis in Goats Within the One Health Framework

This article has 11 authors:
1. Yalçın Yaman
2. Ahmet ESER
3. Devran Coşkun
4. Ramazan Aymaz
5. Yiğit Emir Kişi
6. Murat Keleş
7. Serdar Yağcı
8. Özgül Gülaydın
9. Serkan Süleyman Şengül
10. Kıvanç İrak
11. Memiş Bolacalı
This article has no evaluationsLatest version Jan 30, 2026
AI-Driven Two-Component System Classifier for Pediatric MDR Pathogens

This article has 6 authors:
1. Rajeswari Rajavel
2. Dharani Pandi
3. Grahalakshmi Arunagiri
4. Prithiga Veerasamy
5. Ganesh Irisappan
6. Gurudeeban Selvaraj
This article has no evaluationsLatest version Jan 9, 2026
High-risk extended-spectrum β-lactamase–producing Klebsiella pneumoniae ST307 in a neonatal sepsis outbreak in Zambia: a secondary genomic analysis

This article has 1 author:
1. Frank Chilombolwa Nyondo
This article has no evaluationsLatest version Jan 8, 2026

Discuss this preprint

Listed in

Abstract

Background

Results

Conclusions

Article activity feed

Related articles

Machine Learning–Driven Discovery of Host Genetic Factors for Paratuberculosis in Goats Within the One Health Framework

AI-Driven Two-Component System Classifier for Pediatric MDR Pathogens

High-risk extended-spectrum β-lactamase–producing Klebsiella pneumoniae ST307 in a neonatal sepsis outbreak in Zambia: a secondary genomic analysis