Comparative Evaluation of Synthetic and Real-World Data in Predicting Oral Premalignant Lesions: A Machine Learning Approach from Rural India

P M.Sc. Sundar M MD
Siva M

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Oral premalignant lesions (OPLs) represent a major public health burden in rural India, where tobacco and areca nut use is widespread and screening infrastructure limited. Machine learning (ML) models hold potential for early OPL detection, but real-world clinical data from these settings are severely class imbalanced. Synthetic data generation using the Synthetic Minority Oversampling Technique (SMOTE), Generative Adversarial Networks (GANs), and Conditional Variational Autoencoders (CVAEs) are proposed augmentation strategies whose comparative clinical utility remains unvalidated in this context. Methods A prospective cross-sectional cohort of 3,700 participants was recruited from rural population area in Karnataka. The data comprised 24 clinical, sociodemographic, and habit-related features. Synthetic data were generated using SMOTE, GAN-style, and CVAE-style augmentation, which were applied only within the training folds. Six ML classifiers were evaluated: Logistic Regression, Random Forest, Gradient Boosting, Support Vector Machine, Multilayer Perceptron, and Decision Tree. Five-fold stratified cross-validation was performed. Performance was assessed using the area under the receiver operating characteristic curve (AUC-ROC), sensitivity, specificity, F1-score, Matthews Correlation Coefficient (MCC), and Expected Calibration Error (ECE). The random Forest feature importances were computed for model interpretability. Results OPL-positive cases comprised 568 of 3,700 participants (15.35%), yielding a class imbalance ratio of 1:5.51. Real-only Gradient Boosting achieved the highest AUC-ROC (0.691) but near-zero sensitivity (7.2%) owing to class imbalance. SMOTE augmentation produced the best clinically balanced performance: Gradient Boosting / SMOTE attained an AUC-ROC of 0.669, sensitivity of 22.7%, specificity of 94.3%, and the highest overall MCC (0.222). GAN-hybrid augmentation demonstrated architecture-dependent effects, improving performance for Multilayer Perceptron (AUC: 0.514 to 0.628) but inducing threshold collapse in tree-based classifiers. Models trained exclusively on synthetic data showed systematic probability miscalibration (mean ECE 0.326 vs. 0.088 for hybrid models). Feature importance analysis identified difficulty in opening the mouth (trismus), recurrent gingival bleeding, family history of oral cancer, presence of visible lesions, and red or white mucosal patches as strong OPL predictors. Conclusions SMOTE-augmented hybrid ML training provided the most clinically practical and reproducible improvement for OPL detection in this severely imbalanced rural Indian dataset. Models trained exclusively on synthetic data demonstrated systematic calibration failure, disqualifying them from standalone clinical deployment. Feature importance rankings were stable across training paradigms (Spearman ρ = 0.92), confirming that synthetic augmentation preserved the real epidemiological signal. These findings provide actionable evidence for the implementation of ML-assisted oral cancer screening programs in resource-constrained settings across South Asia.

Version published to 10.21203/rs.3.rs-9055319/v1 on Research Square
Apr 9, 2026

Optimizing Deep Learning for Skin Cancer: A Comparative Study of Convolutional and Attention-Based Models

This article has 1 author:
1. Khaled Wael Ezzat
This article has no evaluationsLatest version Apr 8, 2026
Decoding Tumor Phenotypes: A Radiologist-Inspired Deep Learning Framework for Breast Cancer Recurrence Prediction

This article has 17 authors:
1. Tao Tan
2. Chunyao Lu
3. Tianyu Zhang
4. Xinglong Liang
5. Antonio Portaluri
6. Luyi Han
7. Yaqian Chen
8. Nika Rasoolzadeh
9. Ruixiang Qi
10. Yuan Gao
11. Xin Wang
12. Yaofei Duan
13. Zahra Aghdam
14. Muzhen He
15. Jonas Teuwen
16. Maciej Mazurowski
17. Ritse Mann
This article has no evaluationsLatest version Apr 15, 2026
DLNDD: An Explainable Deep Learning Framework for the Early Detection and Classification of Rare Diseases

This article has 1 author:
1. Mian Muhammad Hamza
This article has no evaluationsLatest version Apr 10, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Optimizing Deep Learning for Skin Cancer: A Comparative Study of Convolutional and Attention-Based Models

Decoding Tumor Phenotypes: A Radiologist-Inspired Deep Learning Framework for Breast Cancer Recurrence Prediction

DLNDD: An Explainable Deep Learning Framework for the Early Detection and Classification of Rare Diseases