Comparative Evaluation of Synthetic and Real-World Data in Predicting Oral Premalignant Lesions: A Machine Learning Approach from Rural India

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Oral premalignant lesions (OPLs) represent a major public health burden in rural India, where tobacco and areca nut use is widespread and screening infrastructure limited. Machine learning (ML) models hold potential for early OPL detection, but real-world clinical data from these settings are severely class imbalanced. Synthetic data generation using the Synthetic Minority Oversampling Technique (SMOTE), Generative Adversarial Networks (GANs), and Conditional Variational Autoencoders (CVAEs) are proposed augmentation strategies whose comparative clinical utility remains unvalidated in this context. Methods A prospective cross-sectional cohort of 3,700 participants was recruited from rural population area in Karnataka. The data comprised 24 clinical, sociodemographic, and habit-related features. Synthetic data were generated using SMOTE, GAN-style, and CVAE-style augmentation, which were applied only within the training folds. Six ML classifiers were evaluated: Logistic Regression, Random Forest, Gradient Boosting, Support Vector Machine, Multilayer Perceptron, and Decision Tree. Five-fold stratified cross-validation was performed. Performance was assessed using the area under the receiver operating characteristic curve (AUC-ROC), sensitivity, specificity, F1-score, Matthews Correlation Coefficient (MCC), and Expected Calibration Error (ECE). The random Forest feature importances were computed for model interpretability. Results OPL-positive cases comprised 568 of 3,700 participants (15.35%), yielding a class imbalance ratio of 1:5.51. Real-only Gradient Boosting achieved the highest AUC-ROC (0.691) but near-zero sensitivity (7.2%) owing to class imbalance. SMOTE augmentation produced the best clinically balanced performance: Gradient Boosting / SMOTE attained an AUC-ROC of 0.669, sensitivity of 22.7%, specificity of 94.3%, and the highest overall MCC (0.222). GAN-hybrid augmentation demonstrated architecture-dependent effects, improving performance for Multilayer Perceptron (AUC: 0.514 to 0.628) but inducing threshold collapse in tree-based classifiers. Models trained exclusively on synthetic data showed systematic probability miscalibration (mean ECE 0.326 vs. 0.088 for hybrid models). Feature importance analysis identified difficulty in opening the mouth (trismus), recurrent gingival bleeding, family history of oral cancer, presence of visible lesions, and red or white mucosal patches as strong OPL predictors. Conclusions SMOTE-augmented hybrid ML training provided the most clinically practical and reproducible improvement for OPL detection in this severely imbalanced rural Indian dataset. Models trained exclusively on synthetic data demonstrated systematic calibration failure, disqualifying them from standalone clinical deployment. Feature importance rankings were stable across training paradigms (Spearman ρ = 0.92), confirming that synthetic augmentation preserved the real epidemiological signal. These findings provide actionable evidence for the implementation of ML-assisted oral cancer screening programs in resource-constrained settings across South Asia.

Article activity feed