PhenoGenX: A Dual-Engine, Data-Driven Platform for HIV-1 Drug Resistance Interpretation Integrating Ensemble Machine Learning and Rule-Based Algorithms

Yimam Getaneh
Belete Woldesemayat
Kidist Zealiyas
Ghion Mengistu
Minilik Demissie
Zelalem Messele
Gemechu Leta
Yenew Kebede
Getachew Tolera
Lingjie Liao
Yiming Shao

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background : HIV-1 drug resistance (HIVDR) interpretation relies on expert rule-based algorithms that translate mutation–drug relationships into clinical categories but do not directly model phenotypic susceptibility and may have limited sensitivity to complex mutational patterns. We developed PhenoGenX (PGX), a dual-engine platform combining a phenotype-trained machine learning (ML) model with an extended rule-based system to integrate data-driven inference with expert knowledge for resistance interpretation in LMICs. Methods : From 45,039 HIV-1 clinical isolates, we curated 42,587 genotype–phenotype pairs with phenotypic fold-change (FC) measurements across 22 antiretroviral drugs. PGX integrates two independent engines: an ensemble ML model trained on mutation-level features and a rule-based interpreter derived from curated mutation knowledge bases. Model selection was guided by a Composite Resistance Performance Score (CRPS) incorporating predictive fit, error magnitude, rank correlation, categorical accuracy, and cross-validation stability. Ensemble predictions were calibrated to the PhenoSense assay scale and mapped to clinical resistance categories using safety-oriented cutoffs prioritizing minimization of very major errors. The ML engine was evaluated using an independent phenotypic dataset of 11,769 clinical isolates. The rule-based engine was benchmarked against Stanford HIVDB using 1,945 HIV-1 pol sequences (23,329 drug–sequence pairs) for NRTIs, NNRTIs, and PIs, with an additional 2,539 integrase sequences for INSTI validation. Findings : Ensemble ML models showed consistent predictive performance across drugs (R² range 0.50–0.95). Calibration improved agreement with measured phenotypes (mean log-scale correlation r=0.78), and optimized cutoffs achieved high diagnostic accuracy with low very major error rates. Most drugs achieved AUC values ≥0.80. The rule-based engine demonstrated high concordance with Stanford HIVDB (overall agreement 85.6%, weighted κ=0.72), with exact agreement exceeding 92% for integrase inhibitors. Interpretation : By integrating phenotype-calibrated ensemble ML with an extended rule-based interpreter, PhenoGenX provides a standardized framework for HIVDR interpretation that preserves biological plausibility and concordance with expert systems while maintaining a safety-weighted error profile. This approach may support HIV drug resistance surveillance and treatment decision-making where interpretation relies primarily on genotypic data in the next-generation sequencing era.

Version published to 10.21203/rs.3.rs-9056343/v1 on Research Square
Mar 10, 2026

Methods for Continuous-Valued Training Data Generation from Genome-Scale Metabolic Models: Partial-Inhibition FBA with Mixed Essentiality Sampling, Applied to ESKAPE Drug Target Curation

This article has 1 author:
1. Byeongsoo Kang
This article has no evaluationsLatest version Apr 13, 2026
Pathway-based machine learning for breast cancer risk stratification: an interpretable framework validated in two independent cohorts

This article has 2 authors:
1. Suhaan Thayyil
2. Eshaan Nidee
This article has no evaluationsLatest version Apr 8, 2026
A Python-Based Interactive Web Tool for Dual-Task Prediction of Treatment Response and Adverse Events in MINIC3 Immunotherapy: A Proof-of-Concept Study

This article has 4 authors:
1. Puyao Sun
2. Yipu Sai
3. Ruihua Zhao
4. Qinghong Hu
This article has no evaluationsLatest version Apr 10, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Methods for Continuous-Valued Training Data Generation from Genome-Scale Metabolic Models: Partial-Inhibition FBA with Mixed Essentiality Sampling, Applied to ESKAPE Drug Target Curation

Pathway-based machine learning for breast cancer risk stratification: an interpretable framework validated in two independent cohorts

A Python-Based Interactive Web Tool for Dual-Task Prediction of Treatment Response and Adverse Events in MINIC3 Immunotherapy: A Proof-of-Concept Study