Machine Learning Approaches to Identify Communities with High HIV Prevalence in Resource-Limited Settings using Social, Economic and Behavioral Data

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Identifying communities with high HIV prevalence is crucial for public health officials, researchers, and policymakers to effectively monitor the epidemic and evaluate interventions. Population-based HIV biomarker surveys face logistical challenges such as cost, need for personnel trained in specimen collection, specimen transport and processing, and participant reluctance to test due to factors such as stigma, history of recent testing, and the perception of being at low risk for HIV infection. This study explores the potential of identifying communities with high HIV prevalence using socio-economic, behavioral, and other community-level data in the absence of direct HIV biomarkers.

Method

Using the methods of Partial Least Squares (PLS) and Random Forests (RF), we developed machine learning models to predict HIV prevalence based on socio-economic and behavioral variables from Population-based HIV Impact Assessments (PHIA) surveys. Community HIV prevalence, derived from the PHIA biomarkers dataset, served as the dependent variable. Initially, models were trained to classify communities into <10% or ≥10% HIV prevalence categories. This procedure was repeated for prevalence thresholds of 5%, 7%, 15%, and 20%.

Results

PLS and RF achieved 79% and 80.5% accuracy, respectively, in classifying communities as having higher or lower HIV prevalence at a 10% threshold. At the 5%, 7%, and 15% thresholds, the models achieved similar accuracies, demonstrating consistent performance across varying thresholds, with RF slightly outperforming PLS. In both models, the variables that contributed most to classification included having a first intercourse experience before age 15, being uncircumcised, having a history of not using condoms, being in the lowest wealth quintile, experiencing physical or sexual violence, and having extramarital partners.

Conclusions

The study demonstrates that socioeconomic and behavioral variables can effectively predict community-level HIV prevalence using machine learning models. These insights have the potential to guide the distribution of HIV resources, particularly where direct community testing is infeasible, and to enhance understanding of the HIV epidemics.

Article activity feed