Development and Implementation of a Core Genome Multilocus Sequence Typing (cgMLST) scheme for Haemophilus influenzae

Made Ananda Krisna
Keith A. Jolley
William Monteith
Alexandra Boubour
Raph L. Hamers
Angela B. Brueggemann
Odile B. Harrison
Martin C. J. Maiden

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Haemophilus influenzae is part of the human nasopharyngeal microbiota and a pathogen causing invasive disease. The extensive genetic diversity observed in H. influenzae necessitates discriminatory analytical approaches to evaluate its population structure. This study developed a core genome MLST (cgMLST) scheme for H. influenzae using pangenome analysis tools and validated the cgMLST scheme using datasets consisting of complete reference genomes (N=14) and high-quality draft H. influenzae genomes (N=2,297). The draft genome dataset was divided into a development (N=921) and a validation dataset (N=1,376). The development dataset was used to identify potential core genes with the validation dataset used to refine the final core gene list to ensure the reliability of the proposed cgMLST scheme. Functional classifications were made for all resulting core genes. Phylogenetic analyses were performed using both allelic profiles and nucleotide sequence alignments of the core genome to test congruence, as assessed by Spearman’s correlation and Ordinary Least Square linear regression tests. Preliminary analyses using the development dataset identified 1,067 core genes, which were refined to 1,037 with the validation dataset. More than 70% of core genes were predicted to encode proteins essential for metabolism or genetic information processing. Phylogenetic and statistical analyses indicated that the core genome allelic profile accurately represented phylogenetic relatedness among the isolates ( R ² = 0.945). We used this cgMLST scheme to define a high-resolution population structure for H. influenzae , which enhances the genomic analysis of this clinically relevant human pathogen.

Impact statement

Discriminating H. influenzae variants and evaluating population structure has been challenging and largely unstandardised. To address this, we have developed a cgMLST scheme for H. influenzae. Since an accurate typing approach relies on precise reflection of the underlying population structure, we explored various methods to define the scheme. The core genes included in this scheme were predicted to encode functions in essential biological pathways, such as metabolism and genetic information processing, and could be reliably assembled from short-read sequence data. Single-linkage clustering, based on core genome allelic profiles, showed high congruence to genealogy reconstructed by Maximum-Likelihood (ML) methods from the core genome nucleotide alignment. The cgMLST scheme v1 enables rapid and accurate depiction of high-resolution H. influenzae population structure, and making this scheme accessible via the PubMLST database, ensures that microbiology reference laboratories and public health authorities worldwide can use it for genomic surveillance.

Data summary

The H. influenzae cgMLST scheme is accessible via https://pubmlst.org/organisms/haemophilus-influenzae . The list of isolate IDs available publicly from pubmlst.org is provided in Supplementary File 1. The pipeline for cgMLST scheme development and validation is published at https://www.protocols.io/private/EF6DB7FE429311EEB8630A58A9FEAC02 . All in-house R and Python scripts for data processing and analysis are available from https://gitfront.io/r/user-4399403/ZHt8DArALHcY/cgmlst-hinf/ .

Version published to 10.1101/2024.04.15.589521 on bioRxiv
Apr 16, 2024

16S rRNA Variable Region Coverage in Salmonella enterica: Insights for Molecular Surveillance and Diagnostic Accuracy

This article has 4 authors:
1. Anubha Kumari
2. Md Misbaul Rashid
3. Priyambada Kumari
4. Abhishek Kumar Jaiswal
This article has no evaluationsLatest version Jan 22, 2026
One Health Viral Metagenomics for Pandemic Preparedness: Validated mNGS Workflows for Viral Detection and Genome Recovery from Swab and Tissue Specimens

This article has 14 authors:
1. Tristan Russell
2. Elisa Formiconi
3. Alison Murphy
4. Jimmy Hortion
5. Máire McElroy
6. Mícheál Casey
7. Laura Garza Cuartero
8. John F Mee
9. Hanne Jahns
10. Christine Kelly
11. Joanne Byrne
12. Eoin R Feeney
13. Patrick WG Mallon
14. Virginie W Gautier
This article has no evaluationsLatest version Jan 16, 2026
Regional prospective whole-genome sequencing surveillance of ESBL-producing Escherichia coli and Klebsiella pneumoniae in the Netherlands: a multicentre study on nosocomial and interhospital transmission

This article has 9 authors:
1. Julinha M. Thelen
2. Veronica A.T.C. Weterings
3. Andreas L.E. van Arkel
4. Wouter van den Bijllaardt
5. Jean-Luc Murk
6. Jeroen Tjhie
7. Jaco J. Verweij
8. Bas Wintermans
9. Joep J.J.M. Stohr
This article has no evaluationsLatest version Jan 7, 2026

Discuss this preprint

Listed in

Abstract

Impact statement

Data summary

Article activity feed

Related articles

16S rRNA Variable Region Coverage in Salmonella enterica: Insights for Molecular Surveillance and Diagnostic Accuracy

One Health Viral Metagenomics for Pandemic Preparedness: Validated mNGS Workflows for Viral Detection and Genome Recovery from Swab and Tissue Specimens

Regional prospective whole-genome sequencing surveillance of ESBL-producing Escherichia coli and Klebsiella pneumoniae in the Netherlands: a multicentre study on nosocomial and interhospital transmission