Machine Learning Approaches to Identify Communities with High HIV Prevalence in Resource-Limited Settings using Social, Economic and Behavioral Data

Masabho P. Milali
Frey B. Assefa
Sulani Nyimbili
Duncan K. Gathungu
Samuel Mwalili
Suilanji Sivile
Lloyd Mulenga
R. Scott Braithwaithe
Diego F Cuadros
Anna Bershteyn

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Identifying communities with high HIV prevalence is crucial for public health officials, researchers, and policymakers to effectively monitor the epidemic and evaluate interventions. Population-based HIV biomarker surveys face logistical challenges such as cost, need for personnel trained in specimen collection, specimen transport and processing, and participant reluctance to test due to factors such as stigma, history of recent testing, and the perception of being at low risk for HIV infection. This study explores the potential of identifying communities with high HIV prevalence using socio-economic, behavioral, and other community-level data in the absence of direct HIV biomarkers.

Method

Using the methods of Partial Least Squares (PLS) and Random Forests (RF), we developed machine learning models to predict HIV prevalence based on socio-economic and behavioral variables from Population-based HIV Impact Assessments (PHIA) surveys. Community HIV prevalence, derived from the PHIA biomarkers dataset, served as the dependent variable. Initially, models were trained to classify communities into <10% or ≥10% HIV prevalence categories. This procedure was repeated for prevalence thresholds of 5%, 7%, 15%, and 20%.

Results

PLS and RF achieved 79% and 80.5% accuracy, respectively, in classifying communities as having higher or lower HIV prevalence at a 10% threshold. At the 5%, 7%, and 15% thresholds, the models achieved similar accuracies, demonstrating consistent performance across varying thresholds, with RF slightly outperforming PLS. In both models, the variables that contributed most to classification included having a first intercourse experience before age 15, being uncircumcised, having a history of not using condoms, being in the lowest wealth quintile, experiencing physical or sexual violence, and having extramarital partners.

Conclusions

The study demonstrates that socioeconomic and behavioral variables can effectively predict community-level HIV prevalence using machine learning models. These insights have the potential to guide the distribution of HIV resources, particularly where direct community testing is infeasible, and to enhance understanding of the HIV epidemics.

Version published to 10.1101/2025.11.10.25339949 on medRxiv
Nov 14, 2025

Machine Learning-Based Classification of HIV Viral Load Suppression in Low-Resource Settings

This article has 4 authors:
1. Abraham Keffale Mengistu
2. Aynadis Worku Shime
3. Muluken Belachew Mengistie
4. Andualem Enyew Gedefaw
This article has no evaluationsLatest version Jan 6, 2026
Development and Deployment of a Machine Learning–Based Predictive Model for COVID- 19 Infection Using Patient Demographic and Symptom Data in Nigeria

This article has 10 authors:
1. Olanrewaju Eniade
2. Ezekiel Ukwenga
3. Uchenna Akuka
4. Opeyemi Adeniyi
5. Elonna Obak
6. Omolola Adeagbo
7. Peter Babatunde Olaitan
8. Rita Ayanbolade Olowe
9. Tolulope Opakunle
10. Olugbenga Adekunle Olowe
This article has no evaluationsLatest version Jan 25, 2026
Pilot Study of Hypertension Screening and Machine-Learning Prediction Using Community Outreach Data from Nkpokiti, Enugu, Nigeria Short Title: Machine Learning Prediction of Hypertension using Community Blood Pressure Data in Nigeria

This article has 3 authors:
1. Godswill Uzoechina
2. Winnifred Njideka Adiri
3. Osajiuba Treasure
This article has no evaluationsLatest version Dec 18, 2025

Discuss this preprint

Listed in

Abstract

Background

Method

Results

Conclusions

Article activity feed

Related articles

Machine Learning-Based Classification of HIV Viral Load Suppression in Low-Resource Settings

Development and Deployment of a Machine Learning–Based Predictive Model for COVID- 19 Infection Using Patient Demographic and Symptom Data in Nigeria

Pilot Study of Hypertension Screening and Machine-Learning Prediction Using Community Outreach Data from Nkpokiti, Enugu, Nigeria Short Title: Machine Learning Prediction of Hypertension using Community Blood Pressure Data in Nigeria