Beyond Attention: Hierarchical Mamba Models for Scalable Spatiotemporal Traffic Forecasting
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Traffic forecasting in cellular networks is a challenging spatiotemporal prediction problem due to strong temporal dependencies, spatial heterogeneity across cells, and the need for scalability to large network deployments. Traditional cell-specific models incur prohibitive training and maintenance costs, while global models often fail to capture heterogeneous spatial dynamics. Recent spatiotemporal architectures based on attention or graph neural networks improve accuracy but introduce high computational overhead, limiting their applicability in large-scale or real-time settings. We propose HiSTM (Hierarchical SpatioTemporal Mamba), a spatiotemporal forecasting architecture built on state-space modeling. HiSTM combines spatial convolutional encoding for local neighborhood interactions with Mamba-based temporal modeling to capture long-range dependencies, followed by attention-based temporal aggregation for prediction. The hierarchical design enables representation learning with linear computational complexity in sequence length and supports both grid-based and correlation-defined spatial structures. Cluster-aware extensions incorporate spatial regime information to handle heterogeneous traffic patterns. Experimental evaluation on large-scale real-world cellular datasets demonstrates that HiSTM achieves better accuracy, outperforming strong baselines. On the Milan dataset, HiSTM reduces MAE by 29.4% compared to STN, while achieving the lowest RMSE and highest R2 score among all evaluated models. In multi-step autoregressive forecasting, HiSTM maintains 36.8% lower MAE than STN and 11.3% lower than STTRE at the 6-step horizon, with a 58% slower error accumulation rate compared to STN. On the unseen Trentino dataset, HiSTM achieves 47.3% MAE reduction over STN and demonstrates better cross-dataset generalization. A single HiSTM model outperforms 10,000 independently trained cell-specific LSTMs, demonstrating the advantage of joint spatiotemporal learning. HiSTM maintains best-in-class performance with up to 30% missing data, outperforming all baselines under various missing data scenarios. The model achieves these results while being 45× smaller than PredRNNpp, 18× smaller than xLSTM, and maintaining competitive inference latency of 1.19 ms, showcasing its effectiveness for scalable 5/6G traffic prediction in resource-constrained environments.