Detecting financial misstatements in emerging markets: a machine learning approach
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study develops a machine learning–based framework for detecting material misstatements in the financial statements of Vietnamese listed companies. Using 10,286 firm-year observations from 2016–2023, the research applies two ensemble algorithms, Random Forest (RF) and Extreme Gradient Boosting (XGBoost), to a binary classification task based on audit-adjusted profit discrepancies. To address data imbalance and improve prediction reliability, the Synthetic Minority Over-sampling Technique (SMOTE) is applied within a stratified cross-validation procedure, while Bayesian optimization tunes hyperparameters to enhance generalization performance. Both RF and XGBoost achieved high predictive accuracy (~ 0.839) and strong discriminative power (AUC-ROC ~ 0.91), outperforming logistic regression. Model interpretability was improved through the Least Absolute Shrinkage and Selection Operator (LASSO), which selected key financial and non-financial predictors from over 50 variables. RF’s feature importance analysis further highlighted the influence of listing exchange characteristics, prior misstatement history, and forward-looking performance indicators. The proposed framework offers auditors and regulators a scalable, data-driven tool for risk-based audit planning and regulatory oversight—particularly valuable in emerging markets with limited confirmed fraud data.