Does More Data Always Help? Input Configuration Impacts on LSTM-based Water Level Prediction
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Floods are among the most devastating natural disasters, necessitating efficient early-warning systems. Machine-learning surrogates are now widely adopted for this task, yet increasing data volume is often assumed to improve machine-learning performance. Using 17 flood events (2012–2024) from the data-scarce Pajiang River detention basin, we quantitatively test whether "more data" necessarily translates into better multi-step water-level forecasts. Three Long Short-Term Memory network(LSTM) input scenarios were designed: using only historical data from the target station (S1), using only upstream station data (S2), and combining both data sources (S3). Surprisingly, expanding the input matrix from S1 to S3, yielded no accuracy gain and even degraded skill beyond 4-h lead time (NSE decreased from 0.97 to 0.44 and peak bias increased from 0.25 to 1.88 m). The highest accuracy at 1–2 hours prediction horizons were achieved with the smallest input set (S1), whereas the most robust longer-lead forecasts (3–4 h) were produced with the moderate set (S2). Parsimonious inputs reduced over-fitting risk and maintained uncertainty within operational thresholds. Our findings caution against unchecked input inflation in data-limited basins and highlight the need for input-selection protocols prior to model deployment.