The Illusion of Alpha: Quantifying Hidden Data Leakage in Financial Machine Learning

Kavya Bhand
Aadi Joshi

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Data leakage remains one of the most consequential but least transparently documented threats to empirical credibility in financial machine learning, because it can inflate out-of-sample performance while preserving the appearance of statistical rigor. This paper develops a controlled experimental framework to quantify how temporal leakage, cross-sectional leakage, and validation leakage distort portfoliolevel inference across logistic regression, random forest, and XGBoost classifiers in long-short equity prediction. Using a synthetic panel of 30 stocks over 10 years with mean reversion, jumps, delistings, and point-in-time universe variation, alongside external validation on the Kenneth French 49 Industry Portfolios, we show that a 16-day forward contamination in rolling feature normalization inflates annualized Sharpe ratios from 0.15 to 0.57 under clean estimation to 1.15 to 2.84 under leaked estimation, with the largest inflation occurring in XGBoost. Validation leakage is also economically material: random K-fold raises XGBoost Sharpe from 0.17 to 1.75, while leaky retraining raises it from 0.57 to 1.76. Full-sample z-score scaling and PCA generate only minor distortions in this design. Cross-sectional leakage generates smaller absolute gains but still creates misleading performance, especially for survivorship-biased and future-rank features in nonlinear models. Bootstrap confidence intervals separate leaked and corrected Sharpe distributions, sub-period analysis shows the inflation is stable over time, and the disappearance of alpha t-statistics after correction indicates that much of the apparent predictive power is spurious. The paper contributes a leakage taxonomy, an experimentally identified ranking of the most dangerous leakage channels, and a practical audit checklist for reviewers, authors, and institutions conducting financial ML research. JEL Classification: G11; G14; G17; C45; C53.

Version published to 10.21203/rs.3.rs-9180656/v1 on Research Square
Mar 23, 2026

Detecting Illicit Investment in Real Estate: A Machine‑Learning Approach to Rare‑Event AML Risk

This article has 1 author:
1. Mark Lokanan
This article has no evaluationsLatest version Mar 5, 2026
Explainable AI for Financial Distress: Evidence from Market Volatility and Regime Dynamics

This article has 2 authors:
1. Seyed Jalal Tabatabei
2. Mohammad Mahdi Mousavi
This article has no evaluationsLatest version Apr 1, 2026
Flexible Probabilistic Models for Capturing Extreme Risk in Cryptocurrency Returns

This article has 1 author:
1. Sayed Mohammed Zeeshan
This article has no evaluationsLatest version Mar 12, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Detecting Illicit Investment in Real Estate: A Machine‑Learning Approach to Rare‑Event AML Risk

Explainable AI for Financial Distress: Evidence from Market Volatility and Regime Dynamics

Flexible Probabilistic Models for Capturing Extreme Risk in Cryptocurrency Returns