Biased sampling confounds machine learning prediction of antimicrobial resistance
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Antimicrobial resistance (AMR) poses a growing threat to human health. Increasingly, genome sequencing is being applied for the surveillance of bacterial pathogens, producing a wealth of data to train machine learning (ML) applications to predict AMR and identify resistance determinants. However, bacterial populations are highly structured and sampling is biased towards human disease isolates, meaning samples and derived features are not independent. This is rarely considered in applications of ML to AMR. Here, we demonstrate the confounding effects of sample structure by analyzing over 24,000 whole genome sequences and AMR phenotypes from five diverse pathogens, using pathological training data where resistance is confounded with phylogeny. We show resulting ML models perform poorly, and increasing the training sample size fails to rescue performance. A comprehensive analysis of 6,740 models identifies species- and drug-specific effects on model accuracy. We provide concrete recommendations for evaluating future ML approaches to AMR.