Machine learning–assisted selection of informative loci for strain-level phylogenetics of Neisseria gonorrhoeae

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Epidemiological surveillance of Neisseria gonorrhoeae is hindered by the limitations of existing molecular typing methods, such as NG-MAST and MLST, which either suffer from excessive variability or insufficient resolution. In this study, we propose and evaluate a machine learning (ML) algorithm for the automated selection of a minimal set of informative genetic loci for accurate strain classification. Using a collection of 29 reference genomes of N. gonorrhoeae , we developed a pipeline that integrates Random Forest models and DNABERT embeddings to generate optimized gene panels. The results demonstrate that ML-selected panels substantially outperform traditional schemes, yielding markedly improved phylogenetic accuracy and branch support consistently above 90%. The proposed approach significantly reduces computational costs compared to whole-genome analysis and represents a promising resource-efficient tool for routine epidemiological monitoring, tracking transmission pathways, and identifying antibiotic-resistant strains.

Article activity feed