Predicting Daily Stock Price Movements Using Data Mining Techniques: A Comparative Analysis of Logistic Regression, Decision Tree, Random Forest, and XGBoost on Yahoo Finance Time-Series Data

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The study assesses the capability of various supervised machine learning approaches to predict the short-term movements in the stock market through historical financial time-series data. The Yahoo Finance dataset comprising 2018-2023, containing more than 1,200 daily trades, is the foundation for the research work, which seeks to determine if the closing price of the stock for the next day will be either higher or lower. To secure the quality of the data and to avoid temporal leakage, a thorough pre-processing procedure—missing value check, outlier smoothing, feature extraction with technical indicators like moving averages, normalization, and chronological splitting—was carried out. Four data mining models—Logistic Regression, Decision Tree, Random Forest, and XGBoost—were built, and their performance assessed in terms of accuracy, precision, recall, and F1-score, with a time-aware validation method through Time Series Split supporting this. Logistic Regression results indicated the highest recall (1.0) and F1-score (0.67) where it identified all price movements up, while the Random Forest and XGBoost have better precision (0.5248) and overall accuracy (0.5163) which means that a more balanced trade-off between false positives and false negatives has been indicated. The Decision Tree model was easy to interpret but was nonetheless the least effective in a highly fluctuating financial market setting because it was not able to generalize as much. In conclusion, the findings have shown the difficulties of predicting stock markets that are inherently volatile; however, it is still possible through the use of well-designed technical features and supervised learning to uncover patterns that have economic significance. The study finally recommends that the model should be retrained, the market regime should be adapted, multi-stock trading should be expanded, and testing frameworks that integrate back testing should be set up for the real-world applicability.

Article activity feed