Multi-Domain Validation of Bayesian Optimized Stacking Ensembles for Next-Generation Credit Risk Modeling with Granular Explainability and Robust Statistical Inference

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Credit Scoring helps financial organizations to provide credit services where the advancement in computing has opened ways for credit scoring approaches with different Machine Learning (ML) techniques becoming increasingly useful. Although complex models provide better predictions, they tend to lack interpretability which is a concern for credit scoring where fairness in decision making is emphasized. This study addresses credit scoring, a vital aspect of financial risk management, by employing advanced machine learning techniques to three distinct datasets: Credit Risk Dataset (CRD), Econometric Analysis (EA), and Default of Credit Card Clients (DCC). A broad spectrum of individual classifiers, including Decision Trees, Logistic Regression, Random Forests, XGBoost, LightGBM, and CatBoost, are systematically trained and evaluated using metrics such as accuracy, F1-score, sensitivity, specificity, MCC, Cohen's Kappa, and ROC AUC. A key contribution is the Bayesian Optimized Stacking Ensemble (BO-StaEnsemble), which leverages Optuna for hyperparameter tuning of its base and meta-learners. Beyond predictive performance, we integrate statistical validation (paired t-tests, McNemar's tests) and Explainable AI (LIME, SHAP, Morris Sensitivity). Furthermore, t-SNE is utilized for visualizing model probability spaces. The BO-StaEnsemble consistently outperforms individual models across all datasets with AUC of 0.998 for CRD, 0.999 for EA, 0.974 for DCC, as well as, demonstrated consistently high agreement and classification reliability across all datasets, achieving Matthews Correlation Coefficient (MCC) and Cohen’s Kappa values of 0.9159 and 0.9147 on the CRD, 0.9903 and 0.9902 on the EA Datasets, respectively, demonstrating the power of ensemble learning, advanced optimization, and comprehensive interpretability for robust credit risk modeling.

Article activity feed