The Illusion of Alpha: Quantifying Hidden Data Leakage in Financial Machine Learning

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Data leakage remains one of the most consequential but least transparently documented threats to empirical credibility in financial machine learning, because it can inflate out-of-sample performance while preserving the appearance of statistical rigor. This paper develops a controlled experimental framework to quantify how temporal leakage, cross-sectional leakage, and validation leakage distort portfoliolevel inference across logistic regression, random forest, and XGBoost classifiers in long-short equity prediction. Using a synthetic panel of 30 stocks over 10 years with mean reversion, jumps, delistings, and point-in-time universe variation, alongside external validation on the Kenneth French 49 Industry Portfolios, we show that a 16-day forward contamination in rolling feature normalization inflates annualized Sharpe ratios from 0.15 to 0.57 under clean estimation to 1.15 to 2.84 under leaked estimation, with the largest inflation occurring in XGBoost. Validation leakage is also economically material: random K-fold raises XGBoost Sharpe from 0.17 to 1.75, while leaky retraining raises it from 0.57 to 1.76. Full-sample z-score scaling and PCA generate only minor distortions in this design. Cross-sectional leakage generates smaller absolute gains but still creates misleading performance, especially for survivorship-biased and future-rank features in nonlinear models. Bootstrap confidence intervals separate leaked and corrected Sharpe distributions, sub-period analysis shows the inflation is stable over time, and the disappearance of alpha t-statistics after correction indicates that much of the apparent predictive power is spurious. The paper contributes a leakage taxonomy, an experimentally identified ranking of the most dangerous leakage channels, and a practical audit checklist for reviewers, authors, and institutions conducting financial ML research. JEL Classification: G11; G14; G17; C45; C53.

Article activity feed