Rapid geographical source attribution of Salmonella enterica serovar Enteritidis genomes using hierarchical machine learning

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    This important study presents a machine learning-based classifier that can accurately determine the geographic origin of a Salmonella enterica sample from its whole-genome sequencing data in under five minutes leading to actionable public health insights. Applying the method to 2,313 whole genome sequences collected in the United Kingdom and several external validation datasets, the authors provide convincing evidence that Salmonella genomic data can be used to identify the likely geographic source of a food-borne outbreak and, in most cases, correctly identify the country of origin of an infection acquired overseas. The work presents an excellent case for the potential utility of routine genomics coupled with machine learning for public health microbiology and the methods are likely to be applicable to other pathogens besides Salmonella enterica.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Salmonella enterica serovar Enteritidis is one of the most frequent causes of Salmonellosis globally and is commonly transmitted from animals to humans by the consumption of contaminated foodstuffs. In the UK and many other countries in the Global North, a significant proportion of cases are caused by the consumption of imported food products or contracted during foreign travel, therefore, making the rapid identification of the geographical source of new infections a requirement for robust public health outbreak investigations. Herein, we detail the development and application of a hierarchical machine learning model to rapidly identify and trace the geographical source of S. Enteritidis infections from whole genome sequencing data. 2313 S. Enteritidis genomes, collected by the UKHSA between 2014–2019, were used to train a ‘local classifier per node’ hierarchical classifier to attribute isolates to four continents, 11 sub-regions, and 38 countries (53 classes). The highest classification accuracy was achieved at the continental level followed by the sub-regional and country levels (macro F1: 0.954, 0.718, 0.661, respectively). A number of countries commonly visited by UK travelers were predicted with high accuracy (hF1: >0.9). Longitudinal analysis and validation with publicly accessible international samples indicated that predictions were robust to prospective external datasets. The hierarchical machine learning framework provided granular geographical source prediction directly from sequencing reads in <4 min per sample, facilitating rapid outbreak resolution and real-time genomic epidemiology. The results suggest additional application to a broader range of pathogens and other geographically structured problems, such as antimicrobial resistance prediction, is warranted.

Article activity feed

  1. eLife assessment

    This important study presents a machine learning-based classifier that can accurately determine the geographic origin of a Salmonella enterica sample from its whole-genome sequencing data in under five minutes leading to actionable public health insights. Applying the method to 2,313 whole genome sequences collected in the United Kingdom and several external validation datasets, the authors provide convincing evidence that Salmonella genomic data can be used to identify the likely geographic source of a food-borne outbreak and, in most cases, correctly identify the country of origin of an infection acquired overseas. The work presents an excellent case for the potential utility of routine genomics coupled with machine learning for public health microbiology and the methods are likely to be applicable to other pathogens besides Salmonella enterica.

  2. Reviewer #1 (Public Review):

    In this manuscript the authors describe the development and application of hierarchical machine learning model to identify the likely source of S. Enteritidis using whole genome sequence data. The application makes use of a collection of 2,313 genomes from 4 continents, 11 sub-regions and 38 countries. The approach is, to the best of my knowledge, novel and represents a substantial advance over previous approaches. The model is demonstrated to have good performance at the continental level and - where sufficient training data were available - also at the country level.

    Strengths of the work include the clear exposition of the methods, application to a large and detailed genomic database of clinical S. Enteritidis isolates, and the use of five independent validation data-sets.

    Limitations include lack of validation using post-pandemic data (as the authors state, the model may need retraining in light of changes to the global food network). Also, claimed novelties of the work include greater geographic granularity and faster turnaround time compared to alternative methods, but no explicit comparison to other methods is made.

    Overall, the authors achieve their aims in describing a hierarchical machine learning model for source attribution using pathogen whole genome sequences. The approach is likely to be of broad relevance and considerable public health utility.

  3. Reviewer #2 (Public Review):

    In this study, Bayliss et al. built a machine learning algorithm that predicts which country an isolate of Salmonella Enteritidis has come from based on its genome sequence. The study used S. Enteritidis isolates taken from clinical infections in the UK with recently reported travel, with the recent travel location being assumed as the source of infection.

    The reason for developing this type of algorithm is to use it for source attribution in the case of gastroenteritis cases caused by imported food or cases of gastroenteritis picked up during travel overseas. S. Enteritidis is a major cause of gastroenteritis worldwide. Its transmission is tied in with the food chain, and understanding where it travels and how is key to reducing the burden of these infections. While a country's efforts to reduce the burden of these bacteria within its own borders can have tremendous benefits, imported food can still introduce contaminated meat and produce, and these have indeed become larger proportional risks following control efforts in the UK.

    S. Enteritidis shows strong geographical substructuring across its phylogenetic tree. Traditional phylogenetic analysis is time-consuming (particularly to perform repeatedly on a routine basis) and required highly skilled staff to perform. Machine learning should be able to identify genetic markers linked to clades typically found in a single location, without the need to build and interpret a phylogenetic tree.

    There is some nice methods development work in this paper, with the employment of a hierarchical structure to the ML modelling pipeline and the use of an array of classifier, resampler, feature selection and parameter optimisation techniques to increase accuracy.

    However, the main strength of this paper is how well tailored the model is to a real world use case. Many groups are applying machine learning to genomic data, but often not with a clearly defined use case or realistic training and testing conditions. The results begin by giving the reader an understanding of the current state of this work in a UK context, where all clinically reported cases of Salmonella are sequenced and when appropriate, travel history is recorded. The algorithm is designed to fit into this existing practise and thought has been put into how this would be operationalised. For example, the authors have shown that this work can truly be done in real-time, by developing an algorithm that works directly on raw reads and takes <4 mins to run. A great touch in this work was determining the time horizon over which the model should be retrained to keep up with contemporary geographic distributions of this pathogen. The time horizon itself may not be highly generalizable in genomic epidemiology, but the methods provided make it easier for others to make the same assessment for their pathogen and use case.

    A weakness of the work is the areas where predictions are not as accurate, but this relates to the extent of pathogen sequencing today rather than the method itself. Countries with less accurate predictions are ones which few people return from with an infection and if they do, it tends to be a different strain each time, making building an accurate algorithm for these cases impossible without denser sampling outside of clinical infections or more sequencing of infections occurring in other countries. Without proofs of concept like this, there is less of a strong economic argument to justify these investments. Therefore this work represents an important step in demonstrating the feasibility of the method itself and the value in gathering more data. In contrast, a major strength of this work is that it uses data collected routinely from existing practice in the UK, rather than a bespoke sampling strategy that may not be realistic for routine public health. A comparison of the collection to NCBI also found this sampling to be less biased by specific outbreaks of interest, which is encouraging.

    The training dataset appears to be only based on infections acquired overseas, while I suspect the model would be more useful in investigating infections due to imported contaminated food. An unresolved question from this work is therefore whether the source of travel-acquired infections and infections caused by food imported from the same places is the same, or whether exported vs domestically consumed food around the world is treated differently in important ways that would affect the relative prevalence and success of strains in causing infections. Looking at clinical infections also may bias Salmonella to those that cause more severe forms of infection, as many people don't report to a doctor when they have food poisoning. The large egg-related outbreak that did not feature much at all in the UKHSA dataset is potentially a nice example of this.

    The low accuracy on countries with low infection numbers and high genetic diversity indicates that these algorithms would likely become less accurate over time if food safety is improved, and that individual countries could avoid being confidently attributed as a source of infection by eliminating or controlling major circulating foodborne clones. More clearly communicating when a prediction is uncertain could be helpful in dealing with isolates from countries where it is hard to make a determination.

    One final limitation I see is the exclusion of UK Salmonella isolates - in cases where it is uncertain whether a Salmonella infection is due to import or not, it does not seem possible to make this assessment using the ML tool. This also limits the utility of the tool for other countries that might also benefit.

    The authors have done an excellent job of demonstrating the feasibility of this approach and honing their machine learning workflow to the specific demands of the task. The work presents a clear and well thought out use case with the overall performance of the algorithm broken down into test cases where the algorithm is successful and unsuccessful which provide useful insight into what we can expect from the performance of these approaches.

    Finding a way to better communicate when the source of an outbreak is unclear due to poor representation of a clade or a clade that is found in many countries would be a valuable extension of this work in the future, but as it is the results represent a promising starting point for initiating investigations into the source of Salmonella infections.

    Diarrheal disease is a huge health burden worldwide. Previous work to lower the burden of these infections has shown that targeted interventions can make a substantial difference to the burden of disease and success of clonal outbreaks. The availability of a tool that can be used routinely to assess the most likely overseas origin of an infection could potentially highlight previously unrecognised outbreaks or areas of suddenly increased importation rate. In turn, this could lead to better investigations and targeted improvement of food security.

    This paper provides an excellent case for the value of collecting recent travel history and including it in metadata for pathogen genomic data. If this were done in more countries with different patterns of travel and the data could be shared, this would provide a valuable global resource and start to capture the flow of strains internationally.

    I am curious about the implications of being better able to attribute clinical gastroenteritis cases in the UK (and elsewhere) to food imported or travel to specific countries with respect to trade and regulation. This is well outside the scope of the paper, however the ability to capture isolates commonly picked up from food around the world without the cooperation of these countries raises interesting issues, particularly when factoring in the authors' scenarios of the true country of origin being obscured by uneven travel patterns and complex food supply networks.

  4. Reviewer #3 (Public Review):

    The authors describe a machine learning method for classifying the geographic origin of a Salmonella enterica isolate based on its whole-genome sequencing data. This is done at a continent, region, and country level, and the method is shown to be robust to phylogenetic diversity, temporal trends, and possibly some amount of mislabelling (but please see the first concern below). The authors demonstrate that their pipeline produces results in 5 minutes or less, which makes it applicable to many public health microbiology settings.

    Some clear strengths of the paper include:
    - the use of a hierarchical classification method, which ensures that only those samples that can be unambiguously classified as belonging to a specific region can get assigned to a sub-region within that region (e.g. continent to country)
    - leveraging the UKHSA dataset going back nearly a decade, and containing a comprehensive record of all clinically detected Salmonella enterica infections, which mitigates potential biases and ensures a maximal geographic coverage
    - making all the data (microreact) and the source code (GitHub) public, which facilitates replication as well as enables other researchers and public health microbiologists to use the trained models directly on their own data
    - the use of unitigs as the basis for prediction, which are more informative than K-mers yet more straightforward to identify than SNPs or gene alleles.

    There are several methodological concerns that should ideally be addressed:
    - in addition to the more complex situation of a tourist visiting country A and consuming food from country B, it would be good to rule out a simpler one of the tourist visiting both countries on the same trip (including via a stopover at an airport); the authors should elaborate on the plausibility of missing data on such multi-country trips and their frequency based on the available travel data
    - similarly, there appears to be an underlying assumption that the UK is never at the origin of a Salmonella enterica infection in the dataset selected; the authors should explain why that is a reasonable assumption for this dataset
    - the increase of infection incidence during the summer months might be at least partly attributable to a greater number of trips abroad during that period - if the authors have corrected their data for this, they should explicitly say so
    - lastly, in discussing the outbreak due to Polish eggs, it should be possible to check explicitly what fraction of the training data may have originated from this outbreak to see if this is sufficient to explain the observed poor prediction

    Overall, this is a paper representing a substantial body of work and combining algorithmic advances with practical utility given the rapid turnaround time. It is likely to be generalisable to other pathogens of public health importance and to become integrated into standard protocols for outbreak origin tracing.