PhenoGenX: A Dual-Engine, Data-Driven Platform for HIV-1 Drug Resistance Interpretation Integrating Ensemble Machine Learning and Rule-Based Algorithms

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background : HIV-1 drug resistance (HIVDR) interpretation relies on expert rule-based algorithms that translate mutation–drug relationships into clinical categories but do not directly model phenotypic susceptibility and may have limited sensitivity to complex mutational patterns. We developed PhenoGenX (PGX), a dual-engine platform combining a phenotype-trained machine learning (ML) model with an extended rule-based system to integrate data-driven inference with expert knowledge for resistance interpretation in LMICs. Methods : From 45,039 HIV-1 clinical isolates, we curated 42,587 genotype–phenotype pairs with phenotypic fold-change (FC) measurements across 22 antiretroviral drugs. PGX integrates two independent engines: an ensemble ML model trained on mutation-level features and a rule-based interpreter derived from curated mutation knowledge bases. Model selection was guided by a Composite Resistance Performance Score (CRPS) incorporating predictive fit, error magnitude, rank correlation, categorical accuracy, and cross-validation stability. Ensemble predictions were calibrated to the PhenoSense assay scale and mapped to clinical resistance categories using safety-oriented cutoffs prioritizing minimization of very major errors. The ML engine was evaluated using an independent phenotypic dataset of 11,769 clinical isolates. The rule-based engine was benchmarked against Stanford HIVDB using 1,945 HIV-1 pol sequences (23,329 drug–sequence pairs) for NRTIs, NNRTIs, and PIs, with an additional 2,539 integrase sequences for INSTI validation. Findings : Ensemble ML models showed consistent predictive performance across drugs (R² range 0.50–0.95). Calibration improved agreement with measured phenotypes (mean log-scale correlation r=0.78), and optimized cutoffs achieved high diagnostic accuracy with low very major error rates. Most drugs achieved AUC values ≥0.80. The rule-based engine demonstrated high concordance with Stanford HIVDB (overall agreement 85.6%, weighted κ=0.72), with exact agreement exceeding 92% for integrase inhibitors. Interpretation : By integrating phenotype-calibrated ensemble ML with an extended rule-based interpreter, PhenoGenX provides a standardized framework for HIVDR interpretation that preserves biological plausibility and concordance with expert systems while maintaining a safety-weighted error profile. This approach may support HIV drug resistance surveillance and treatment decision-making where interpretation relies primarily on genotypic data in the next-generation sequencing era.

Article activity feed