Machine Learning for Sentiment-Based Corporate Disclosure Analytics: A Systematic Review of Data, Sentiment Representations, and Predictive Models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Machine learning methods have been widely used to predict stock prices using technical indicators and sentiment features, mostly extracted from social media and news. However, less attention has been given to how sentiment-based textual features obtained from corporate reports are integrated into machine learning pipelines to predict firms' financial outcomes. To examine this issue, we conducted a systematic review of 42 studies published between 2014 and 2025. The review examines how datasets are constructed, how sentiment representations are defined, and how predictive models combine textual features with financial variables. Most studies focus on the U.S. stock market and rely on feature-engineered sentiment indices derived from lexicons or sentence-level classification. Regression-based and other supervised learning approaches remain dominant, while embedding-based representations and end-to-end deep learning architectures appear only sporadically. The literature also reveals constraints, including challenges in processing long financial documents, limited availability of labeled datasets, and strong geographic and linguistic concentration. In addition, the review identifies highly heterogeneous modeling approaches with limited convergence toward shared benchmark tasks. These findings highlight research opportunities for machine learning applications in finance and for the development of sentiment-based corporate disclosure analytics.

Article activity feed