Hierarchical Demand Forecasting at Walmart: Evaluating Reconciliation Methods with Empirical Risk Minimisation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Hierarchical time series forecasting requires predictions that are coherent across all aggregation levels. This study applies and evaluates four reconciliation methods—Bottom-Up (BU), Top-Down (TD), Store-Level Proportional Disaggregation (SLPD), and Empirical Risk Minimisation (ERM)—on the Walmart Store Sales dataset comprising 421,570 weekly observations across 45 stores and 81 departments (February 2010 to October 2012). A Gradient Boosting Machine (GBM) with strict walk-forward cross-validation serves as the base forecaster. All reconciliation weights are computed exclusively from training-period data, ensuring no test-period leakage. Statistical significance is assessed using a Diebold–Mariano test with Newey–West HAC-corrected variance to account for serial correlation in weekly forecast errors. ERM achieves hierarchical coherence with a negligible 1.9% increase in department-level RMSE over the incoherent base model ($3,044 vs. $2,987), while other reconciliation methods degrade department-level accuracy by over 50%. ERM is statistically significantly better than Bottom-Up at department level (DM = −3.228, p = 0.0012) and achieves the best store-level RMSE ($55,613; 2.7% improvement) and chain-level RMSE ($932,247; 6.8% improvement). Top-Down and SLPD both substantially degrade department-level accuracy, confirming that coarse historical proportions are insufficient for heterogeneous retail demand series. An inventory cost simulation further illustrates the practical downstream implications of reconciliation method choice. The complete pipeline is open-source and fully reproducible.