Comparison of gene-by-gene and genome-wide short nucleotide sequence-based approaches to define the global population structure of Streptococcus pneumoniae

Alannah C. King
Narender Kumar
Kate C. Mellor
Paulina A. Hawkins
Lesley McGee
Nicholas J. Croucher
Stephen D. Bentley
John A. Lees
Stephanie W. Lo

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Defining the population structure of a pathogen is a key part of epidemiology, as genomically related isolates are likely to share key clinical features such as antimicrobial resistance profiles and invasiveness. Multiple different methods are currently used to cluster together closely related genomes, potentially leading to inconsistency between studies. Here, we use a global dataset of 26 306 Streptococcus pneumoniae genomes to compare four clustering methods: gene-by-gene seven-locus MLST, core genome MLST (cgMLST)-based hierarchical clustering (HierCC) assignments, life identification number (LIN) barcoding and k-mer-based PopPUNK clustering (known as GPSCs in this species). We compare the clustering results with phylogenetic and pan-genome analyses to assess their relationship with genome diversity and evolution, as we would expect a good clustering method to form a single monophyletic cluster that has high within-cluster similarity of genomic content. We show that the four methods are generally able to accurately reflect the population structure based on these metrics and that the methods were broadly consistent with each other. We investigated further to study the discrepancies in clusters. The greatest concordance was seen between LIN barcoding and HierCC (adjusted mutual information score=0.950), which was expected given that both methods utilize cgMLST, but have different methods for defining an individual cluster and different core genome schema. However, the existence of differences between the two methods shows that the selection of a core genome schema can introduce inconsistencies between studies. GPSC and HierCC assignments were also highly concordant (AMI=0.946), showing that k-mer-based methods which use the whole genome and do not require the careful selection of a core genome schema are just as effective at representing the population structure. Additionally, where there were differences in clustering between these methods, this could be explained by differences in the accessory genome that were not identified in cgMLST. We conclude that for S. pneumoniae , standardized and stable nomenclature is important as the number of genomes available expands. Furthermore, the research community should transition away from seven-locus MLST, whilst cgMLST, GPSC and LIN assignments should be used more widely. However, to allow for easy comparison between studies and to make previous literature relevant, the reporting of multiple clustering names should be standardized within the research.

Version published to 10.1099/mgen.0.001278 on Access Microbiology
Aug 28, 2024
Version published to 10.1101/2024.05.29.596230 on bioRxiv
Jun 2, 2024

Genome-Scale Analysis Reveals Strain Kdesi as a Distinct Evolutionary Lineage and Extensive Cryptic Diversity in the Genus Bdellovibrio

This article has 9 authors:
1. Temidayo Oluyomi Elufisan
2. Isabel Cristina Rodríguez-Luna
3. Yewande Olajumoke Ajao
4. Ibukun John Abulude
5. Alejandro Sánchez-Varela
6. Omotayo Opemipo Oyedara
7. Ronald Ferrera-Cerrato
8. Miguel Angel Villaloboz_Lopez
9. Xianwu Guo
This article has no evaluationsLatest version Jan 9, 2026
Comprehensive genomic and metagenomic profiling of antibiotic resistance genes in Klebsiella pneumoniae isolates from whole-genome sequencing

This article has 4 authors:
1. Mehdi ashkan Shaddel
2. Rohollah Kamyabi
3. Saba Almasi Chegeni
4. Majid Vahed
Reviewed by Access Microbiology

This article has 1 evaluationLatest version Dec 22, 2025Latest activity Jan 6, 2026
Isolation and Whole-Genome Sequencing of a Less Prevalent Indian A. baumannii Strain Reveals Unique Uncharacterized Hypothetical Proteins and AMR-Linked ncRNAs.

This article has 14 authors:
1. Sovon Acharya
2. Parmanand Kushwaha
3. Shailesh Desai
4. Langamba Longjam Angom
5. Arup Ghosh
6. Surajit Gandhi
7. Ankur Verma
8. Biswaroop Chatterjee
9. Munindra Ruwali
10. R Shyam Prasada Rao
11. Sachinandan Sachinandan De
12. Lakshminarasimhan Krishnaswamy
13. Prashanth Suravajhala
14. Gyaneshwer Chaubey
This article has no evaluationsLatest version Dec 11, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Genome-Scale Analysis Reveals Strain Kdesi as a Distinct Evolutionary Lineage and Extensive Cryptic Diversity in the Genus Bdellovibrio

Comprehensive genomic and metagenomic profiling of antibiotic resistance genes in Klebsiella pneumoniae isolates from whole-genome sequencing

Isolation and Whole-Genome Sequencing of a Less Prevalent Indian A. baumannii Strain Reveals Unique Uncharacterized Hypothetical Proteins and AMR-Linked ncRNAs.