Developing and Using Data Mining to Detect Healthcare Fraud and Abuse for Health Insurance Companies
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Healthcare fraud and abuse remain a major global challenge, costing billions annually and threatening the sustainability of insurance systems. Traditional manual and rule- based approaches are increasingly ineffective given the scale, complexity, and adaptability of fraudulent schemes. This study develops and validates a data-driven framework leveraging ma- chine learning (ML) to detect fraudulent health insurance claims. Using a large, publicly available dataset, we applied rigorous preprocessing, including feature engineering to create domain- specific predictors and SMOTE resampling applied strictly to the training set to prevent data leakage. Five supervised al- gorithms—Logistic Regression, Decision Tree, SVM, Random Forest, and XGBoost—were compared against a stacking en- semble that combined Random Forest, XGBoost, and Logistic Regression. Performance was evaluated through stratified 10- fold cross-validation using Accuracy, F1-score, and AUC-ROC. Results show that the ensemble model achieved the best and most balanced performance (F1 = 0.81, AUC = 0.95), significantly outperforming individual classifiers. Feature importance analysis further revealed that the model identified clinically meaningful fraud indicators, such as diagnostic diversity and provider claim frequency. These findings highlight the potential of scalable, open- source ML frameworks to strengthen fraud detection, reduce financial risks, and complement commercial detection tools with cost-effective, interpretable solutions.