Machine-Learning Screening of Early Cervical Lesions Using HPV Genotyping and Exploratory Fusion-Gene Analysis

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Cervical cancer is the fourth most common malignancy in women worldwide, primarily driven by persistent high-risk human papillomavirus (HPV) infection. However, conventional screening methods such as cytology and HPV DNA testing remain limited in accuracy and scalability, particularly for early or precancerous lesions. Methods We analyzed HPV infection patterns, genotype distribution, and lesion grades in 5,452 women from Shenzhen, China. Among them, 76 HPV16- or HPV52-positive cases underwent exploratory PCR-based fusion-gene detection. Six feature-selection strategies and thirteen machine-learning classifiers were trained using stratified five-fold cross-validation, with SMOTE for class balancing and grid search for hyperparameter tuning. Model interpretability was evaluated using SHapley Additive exPlanations (SHAP). Results The overall HPV infection rate was 30.3%. HPV52 was the most prevalent genotype (6.1%) in the general population, whereas HPV16 predominated in high-grade lesions and cancer. The number of fusion loci increased with lesion severity, but fusion data alone showed limited predictive value (ROC AUC < 0.60). Integrating HPV genotyping with epidemiological features markedly improved performance: Random Forest achieved ROC AUC and PR AUC of 0.95 in cross-validation and 0.86 in the independent test set. SHAP analysis identified infection burden and high-risk HPV status as dominant predictors, jointly explaining over half of the model variance. Conclusions This study establishes a region-specific epidemiological profile of HPV and introduces an explainable, low-cost machine-learning framework based solely on HPV genotyping. The model demonstrates high accuracy and clinical scalability, providing a practical approach for early screening of cervical lesions. Trial registration ChiCTR, ChiCTR2400089277. Registered 5 September 2024, https//www.chictr.org.cn/showproj.html?proj=240825

Article activity feed