Detecting financial misstatements in emerging markets: a machine learning approach

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This study develops a machine learning–based framework for detecting material misstatements in the financial statements of Vietnamese listed companies. Using 10,286 firm-year observations from 2016–2023, the research applies two ensemble algorithms, Random Forest (RF) and Extreme Gradient Boosting (XGBoost), to a binary classification task based on audit-adjusted profit discrepancies. To address data imbalance and improve prediction reliability, the Synthetic Minority Over-sampling Technique (SMOTE) is applied within a stratified cross-validation procedure, while Bayesian optimization tunes hyperparameters to enhance generalization performance. Both RF and XGBoost achieved high predictive accuracy (~ 0.839) and strong discriminative power (AUC-ROC ~ 0.91), outperforming logistic regression. Model interpretability was improved through the Least Absolute Shrinkage and Selection Operator (LASSO), which selected key financial and non-financial predictors from over 50 variables. RF’s feature importance analysis further highlighted the influence of listing exchange characteristics, prior misstatement history, and forward-looking performance indicators. The proposed framework offers auditors and regulators a scalable, data-driven tool for risk-based audit planning and regulatory oversight—particularly valuable in emerging markets with limited confirmed fraud data.

Article activity feed