Machine Learning Approaches to Identify Communities with High HIV Prevalence in Resource-Limited Settings using Social, Economic and Behavioral Data
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Identifying communities with high HIV prevalence is crucial for public health officials, researchers, and policymakers to effectively monitor the epidemic and evaluate interventions. Population-based HIV biomarker surveys face logistical challenges such as cost, need for personnel trained in specimen collection, specimen transport and processing, and participant reluctance to test due to factors such as stigma, history of recent testing, and the perception of being at low risk for HIV infection. This study explores the potential of identifying communities with high HIV prevalence using socio-economic, behavioral, and other community-level data in the absence of direct HIV biomarkers.
Method
Using the methods of Partial Least Squares (PLS) and Random Forests (RF), we developed machine learning models to predict HIV prevalence based on socio-economic and behavioral variables from Population-based HIV Impact Assessments (PHIA) surveys. Community HIV prevalence, derived from the PHIA biomarkers dataset, served as the dependent variable. Initially, models were trained to classify communities into <10% or ≥10% HIV prevalence categories. This procedure was repeated for prevalence thresholds of 5%, 7%, 15%, and 20%.
Results
PLS and RF achieved 79% and 80.5% accuracy, respectively, in classifying communities as having higher or lower HIV prevalence at a 10% threshold. At the 5%, 7%, and 15% thresholds, the models achieved similar accuracies, demonstrating consistent performance across varying thresholds, with RF slightly outperforming PLS. In both models, the variables that contributed most to classification included having a first intercourse experience before age 15, being uncircumcised, having a history of not using condoms, being in the lowest wealth quintile, experiencing physical or sexual violence, and having extramarital partners.
Conclusions
The study demonstrates that socioeconomic and behavioral variables can effectively predict community-level HIV prevalence using machine learning models. These insights have the potential to guide the distribution of HIV resources, particularly where direct community testing is infeasible, and to enhance understanding of the HIV epidemics.