Development and Validation of a Machine Learning Model for Hepatitis C Virus Exposure: A Demographic Screening Approach for the US Population

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Hepatitis C virus (HCV) remains underdiagnosed in the United States despite recommendations for universal screening. A simple approach based on readily available demographic information may help target screening in settings where screening implementation continues to be incomplete. Methods We analyzed 10 NHANES cycles (1999–2014 and 2017–2023) and defined HCV exposure as a positive HCV antibody or RNA result. Using sex, birth year, race/ethnicity, birthplace, and income-to-poverty ratio, we trained and compared logistic regression (LR) and machine learning models in training and validation cohorts (48,434 and 20,762 participants, respectively). Model performance was evaluated based on sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and the area under the receiver operating characteristic curve (AUROC). A web-based calculator was developed to facilitate bedside HCV screening. Results 69,196 participants were included, with 967 showing evidence of HCV exposure. Weighted HCV prevalence remained relatively stable across cycles, ranging from 1.22% to 1.93%. The prevalence did not change significantly after the pandemic. Earlier birth year, male sex, non-Hispanic Black race, US birth, and lower income-to-poverty ratio were independently associated with HCV exposure. XGBoost performed better than LR in the validation cohort (AUROC 0.860 vs 0.762, p < 0.001). Predicted risk separated the population clearly: observed HCV prevalence increased from 0.05% in the lowest-risk decile to 7.95% in the highest, with the top decile containing 58.3% of participants with HCV exposure and the top three deciles containing 85.5%. Conclusions Five demographic variables were sufficient to build a useful HCV risk model in a nationally representative US sample. Most HCV-exposed individuals were concentrated in the highest predicted-risk groups, suggesting that this approach could help prioritize and optimize testing where universal screening uptake remains incomplete. As no laboratory data is required, it may also be practical in data-limited settings and adaptable in other health systems.

Article activity feed