Excluding Geographic Variables Does Not Fix Regional Bias in Machine Learning Antimicrobial Resistance Prediction: Analysis of 77,548 Isolates Across 132 Countries

Hayden Farquhar

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Machine learning models for antimicrobial resistance (AMR) prediction exhibit geographic performance disparities, with models trained on high-income country data underperforming in low- and middle-income settings. A seemingly straightforward solution is to exclude geographic variables from models. We tested whether this approach eliminates regional bias. Methods We analysed 77,548 bacterial isolates from the BV-BRC database across two cohorts. The Primary Cohort (n = 39,859 Escherichia coli from 132 countries) quantified regional ciprofloxacin resistance prevalence. The Genomic Cohort (n = 37,689 E. coli with fluoroquinolone resistance gene annotations) tested whether a model trained exclusively on genomic features—with geographic variables explicitly excluded—would produce equitable sensitivity across regions. We evaluated sensitivity disparities at multiple classification thresholds. Results Despite excluding all geographic variables, the genomic model produced significantly different prediction scores by region (ANOVA F = 4.99, p = 1.45×10⁻⁴). At a threshold of 0.30, sensitivity ranged from 61.4% (Oceania) to 81.0% (Africa)—a 19.6 percentage point disparity. No threshold achieved sensitivity variation below 10 percentage points across all regions. The underlying cause: resistance gene prevalence itself varies geographically (qnr genes: 1.0% North America vs 6.8% Asia, p < 0.001), meaning any model using these biologically relevant features will inherit geographic structure. Conclusions Excluding geographic variables does not fix regional bias in AMR prediction because the bias is encoded in the underlying biology, not the model's feature set. Resistance genes vary geographically due to antimicrobial pressure, horizontal gene transfer, and clonal expansion. These findings demonstrate that geographic fairness requires region-specific models or thresholds, not simply removing location data. We recommend mandatory geographic stratification in model evaluation and recognition of geography as a protected attribute in medical AI fairness frameworks.

Version published to 10.21203/rs.3.rs-8772201/v1 on Research Square
Feb 5, 2026

Predicting Methicillin Resistance in Staphylococcus aureus from Antibiotic Co-Resistance Profiles: A Machine Learning Approach Using XGBoost

This article has 1 author:
1. Ryan Yi Sheng Neo
This article has no evaluationsLatest version Mar 12, 2026
Genomic and Machine Learning Approaches for Predicting Antimicrobial Resistance: A One Health Scoping Review in Low- and Middle-Income Countries

This article has 7 authors:
1. Zuhura Kimera
2. Majigo Mtebe
3. upendo kibwana
4. salim masoud
5. Doreen Kamori
6. erasto mbugi
7. Mecky Isaac Matee
This article has no evaluationsLatest version Mar 18, 2026
Integrating Phenotypic and Genomic Data with Machine Learning to Predict Antimicrobial Resistance and Identify Genetic Biomarkers in<em> E. coli</em>

This article has 2 authors:
1. Sarah Halleluyah Adeyemi
2. Roshan Paudel
This article has no evaluationsLatest version Mar 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Predicting Methicillin Resistance in Staphylococcus aureus from Antibiotic Co-Resistance Profiles: A Machine Learning Approach Using XGBoost

Genomic and Machine Learning Approaches for Predicting Antimicrobial Resistance: A One Health Scoping Review in Low- and Middle-Income Countries

Integrating Phenotypic and Genomic Data with Machine Learning to Predict Antimicrobial Resistance and Identify Genetic Biomarkers in<em> E. coli</em>