Benchmarking strain-level profiling of Escherichia coli in short-read gut metagenomes

Matthew Galbraith
David Williams
Liam P. Shaw
Samuel Lipworth
Nicole Stoesser

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Metagenomes offer the potential to characterise Escherichia coli strain-level diversity within the human gut microbiome, informing our understanding of colonisation diversity and the genetic features distinguishing infection from carriage. Among numerous reference-based tools for short-read metagenomic strain-level profiling, the best approach remains unclear. Here, we benchmarked six published tools—PanTax, PathoScope, StrainGE, Strainify, StrainR2 and StrainScan—for their ability to detect co-existing strains of E. coli and estimate their relative abundance across real and simulated metagenomes of increasing complexity with varying reference database composition. In the ZymoBIOMICS® D6331 dataset, only PanTax achieved zero error when predicting the equal abundance of five E. coli strains. In a differentially abundant four-strain mock community dataset (SRR13355226), StrainScan had the lowest mean absolute proportional error (0.89), driven by reduced sensitivity (0.5), followed by PathoScope (4.08). Across simulated metagenomes reflecting the healthy adult gut microbiome, all tools demonstrated high sensitivity (≥0.833), but specificity, precision and F1 score were selectively improved in some tools through detection thresholds to remove low abundance false positives. Outright, StrainGE achieved the highest F1 score (0.978). Predicted relative abundances of the E. coli K12-MG1655 (phylogroup A) and O157:H7 Sakai (phylogroup E) strains spiked into simulated metagenomes across varying abundance ratios were generally accurate, with PanTax and StrainR2 showing the lowest mean absolute proportional error (0.06). When truly present strains were removed from the reference database, out-of-phylogroup assignments were observed for some tools. Collectively, our results demonstrate that published metagenomic strain-level profiling tools vary in their ability to profile E. coli strains, indicating that method selection should be guided by intended application. These findings will facilitate characterisation of E. coli strain-level diversity within short-read gut metagenomes with greater accuracy than previously possible.

Impact statement

Strain-level diversity within the human gut microbiome can be important for human health, with species such as Escherichia coli existing as both commensal and pathogenic strains. Most existing gut microbiome datasets are from short-read i . e ., Illumina, sequencing, and numerous bioinformatic tools have been developed to profile strain-level variation from these data. However, the existing literature is often difficult to navigate given that the available tools have been benchmarked in various ways and are subject to author bias. This is, to our knowledge, the first independent benchmarking of six published tools for profiling E. coli at strain-level resolution from short-read metagenomes. Using both real and simulated datasets of increasing complexity, we demonstrate substantial variation in tool performance in terms of strain detection and relative abundance estimation, highlighting that tool choice should be guided by the specific research question, as no single method performs optimally across all scenarios. This work provides an unbiased framework for tool selection and will support more accurate and reproducible E. coli strain-level analyses in gut microbiome research from short-metagenomic data.

Data summary

The authors confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. Supplementary methods, six supplementary tables and four supplementary figures are available in the online Supplementary Material. Code for simulating metagenomes using InSilicoSeq, SLURM job scripts for the simulated metagenomes dataset and R visualization and statistical analysis scripts are available within a dedicated public GitHub repository ( https://github.com/mattgal11/benchmarking_short_read_strain_profilers ). The following supplementary data are available on FigShare ( https://doi.org/10.6084/m9.figshare.32125474 ):

Normalised per-contig relative abundances for 98 species assemblies used to construct the baseline gut microbiome profile for InSilicoSeq metagenome simulation (Normalised_relative_abundance_for_InSilicoSeq_simulated_metagenomes_ gut_microbiome_profile.csv)
ZymoBIOMICS® D6331 gut microbiome standard dataset predicted relative abundance data (Zymobiomics_D6331_raw_predicted_abundance.csv)
SRR13355226 mock community (99% human reads; 1% E. coli reads) paired-end reads with human reads depleted (SRR13355226_depleted_R1.fastq.gz & SRR13355226_depleted_R2.fastq.gz)
SRR13355226 mock community dataset raw predicted abundance data, with and without human read removal (SRR13355226_raw_predicted_abundance_with_and_without_human_read_r emoval.csv)
Simulated metagenomes dataset raw call types and detection metric values with increasing detection thresholds (Simulated_metagenomes_raw_call_type_assingments_and_detection_thres holds.csv)
Simulated metagenomes dataset (all references) predicted relative abundance data (Simulated_metagenomes_all_references_raw_predicted_abundances.csv)
Simulated metagenomes dataset (all references) mapped reads for PathoScope and Strainify (all_refs_pathoscope_reads_mapped.csv & all_refs_strainify_reads_mapped.csv)
Simulated metagenomes dataset (reduced reference database) predicted relative abundance data (Simulated_metagenomes_K12_and_Sakai_removed_from_reference_datab ase_raw_predicted_abundance.csv)

Version published to 10.64898/2026.05.19.726160 on bioRxiv
May 19, 2026

16S rRNA sequence captures microbial functional potential

This article has 3 authors:
1. Jia Liu
2. M. Clara De Paolis Kaluza
3. Yana Bromberg
This article has no evaluationsLatest version Apr 18, 2026
VicMAG, an open-source tool for visualizing circular metagenome-assembled genomes highlighting bacterial virulence and antimicrobial resistance

This article has 10 authors:
1. Yusuke Tsuda
2. Yasuhiro Tanizawa
3. Thi My Hanh Vu
4. Yosuke Nishimura
5. Masaki Shintani
6. Haruka Abe
7. Futoshi Hasebe
8. Ikuro Kasuga
9. Miki Nagao
10. Masato Suzuki
This article has no evaluationsLatest version Apr 1, 2026
A Bioinformatic Pipeline for Consensus Taxonomic Classification of Long-Read Amplicons

This article has 5 authors:
1. Ashley A. Paulsen
2. Breah LaSarre
3. Drew Delp
4. Gwyn A. Beattie
5. Larry J. Halverson
This article has no evaluationsLatest version Apr 30, 2026

Discuss this preprint

Listed in

Abstract

Impact statement

Data summary

Article activity feed

Related articles

16S rRNA sequence captures microbial functional potential

VicMAG, an open-source tool for visualizing circular metagenome-assembled genomes highlighting bacterial virulence and antimicrobial resistance

A Bioinformatic Pipeline for Consensus Taxonomic Classification of Long-Read Amplicons