Excluding Geographic Variables Does Not Fix Regional Bias in Machine Learning Antimicrobial Resistance Prediction: Analysis of 77,548 Isolates Across 132 Countries

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Machine learning models for antimicrobial resistance (AMR) prediction exhibit geographic performance disparities, with models trained on high-income country data underperforming in low- and middle-income settings. A seemingly straightforward solution is to exclude geographic variables from models. We tested whether this approach eliminates regional bias. Methods We analysed 77,548 bacterial isolates from the BV-BRC database across two cohorts. The Primary Cohort (n = 39,859 Escherichia coli from 132 countries) quantified regional ciprofloxacin resistance prevalence. The Genomic Cohort (n = 37,689 E. coli with fluoroquinolone resistance gene annotations) tested whether a model trained exclusively on genomic features—with geographic variables explicitly excluded—would produce equitable sensitivity across regions. We evaluated sensitivity disparities at multiple classification thresholds. Results Despite excluding all geographic variables, the genomic model produced significantly different prediction scores by region (ANOVA F = 4.99, p = 1.45×10⁻⁴). At a threshold of 0.30, sensitivity ranged from 61.4% (Oceania) to 81.0% (Africa)—a 19.6 percentage point disparity. No threshold achieved sensitivity variation below 10 percentage points across all regions. The underlying cause: resistance gene prevalence itself varies geographically (qnr genes: 1.0% North America vs 6.8% Asia, p < 0.001), meaning any model using these biologically relevant features will inherit geographic structure. Conclusions Excluding geographic variables does not fix regional bias in AMR prediction because the bias is encoded in the underlying biology, not the model's feature set. Resistance genes vary geographically due to antimicrobial pressure, horizontal gene transfer, and clonal expansion. These findings demonstrate that geographic fairness requires region-specific models or thresholds, not simply removing location data. We recommend mandatory geographic stratification in model evaluation and recognition of geography as a protected attribute in medical AI fairness frameworks.

Article activity feed