Whole-Proteome ESM-2 Embeddings Recover Taxonomy and Enable Geometry-Aware Triage of Foodborne Bacterial Genomes

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Whole-genome sequencing (WGS) has transformed foodborne pathogen surveillance, yet time-sensitive decision-making remains constrained by computationally expensive alignment-centric workflows that scale poorly to outbreak volumes and lack built-in confidence signals. Using 21,657 GenomeTrakr-derived assemblies spanning nine food safety–relevant taxa, we represent each genome by mean-pooling per-protein embeddings from ESM-2 (480 dimensions). The resulting embedding space is dominated by taxonomic structure, exhibiting near-perfect neighborhood consistency for both species and a coarse species/pathotype-derived pathogenicity prior (mean homophily >0.99). Density-based clustering recovered species-coherent structure with high purity and bootstrap stability, while external agreement with the binary pathogenicity prior was only moderate, which is consistent with phylogenetic entanglement by design rather than embedding failure. As a within-genus stress test, kNN separates E. coli O157:H7 from non-pathogenic E. coli with ∼98% accuracy (5-fold CV), demonstrating that known pathotype annotations are preserved in the embedding geometry even among closely related genomes. We position this mean-pooling baseline relative to contextual genome language models that retain protein order or operon-scale context, and outline how embedding geometry (homophily, purity, outliers) can serve as a principled confidence layer in bio-surveillance-oriented triage pipelines.

Article activity feed