Machine learning approaches for the identification and analysis of enterotoxin genes in Staphylococcus aureus genomes

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

2.

Staphylococcus aureus produces a broad range of enterotoxins that act as superantigens, disrupting host immune responses and resulting in a myriad of clinical symptoms. However, large-scale analyses determining enterotoxin gene diversity, lineage structure and isolate metadata remain scarce. We analysed 15,887 S. aureus RefSeq genomes using a machine learning pipeline combining profile Hidden Markov Model-based enterotoxin gene identification, lineage typing, gene profile-based strain clustering and association rule mining using a broad range of gene and metadata features. This approach identified 35 distinct enterotoxin genes and five variant forms, including two putative novel enterotoxin genes, sel34 and sel35 . HDBSCAN clustering distinguished 45 enterotoxin gene profile groups, revealing strong associations between the two major egc enterotoxin gene cluster variants (OMIWNG and OMIUNG) and Clonal Complex membership: CC5, CC22 and CC45 with OMIWNG; CC30 and CC121 with OMIUNG. Integration of isolate metadata exposed distinct geographic and temporal trends, including a recent rise in non-egc lineages derived from Asia and animal sources. These findings show that S. aureus enterotoxin diversity is structured by lineage, mobile genetic element composition and Clonal Complex association. The discovery of sel34 and sel35 , together with the comprehensive overview of lineage-specific enterotoxin profiles, expands current understanding of S. aureus virulence evolution and provides a scalable analytical framework for monitoring toxin gene dynamics in clinical and environmental populations.

3.

Impact Statement

Understanding how virulence genes evolve and spread in Staphylococcus aureus is vital for predicting pathological potential and managing infection risk for this species. By analysis of over 15,000 publicly available S. aureus genomes, this study provides the most comprehensive overview to date of enterotoxin gene diversity and lineage structure. Using machine learning and large-scale genomic mining, we reveal clear evolutionary and epidemiological patterns linking enterotoxin gene clusters to specific Clonal Complexes and identify two previously unknown enterotoxin genes. These findings highlight how recombination and horizontal gene transfer shape S. aureus toxins across hosts, continents and time. The resulting analytical framework offers a scalable foundation for future genomic surveillance of virulence evolution in both clinical and environmental settings.

4.

Data Summary

Genome assemblies and associated metadata for the 15,887 Staphylococcus aureus strains analysed within this study were generated elsewhere prior to this study and were here downloaded from the NCBI RefSeq database. Secondary datasets and descriptions of algorithmic approaches used to develop them are presented in the main text and within the Supplementary Files 1 and 2.

Article activity feed