A Machine Learning-Based Quality Control Algorithm for Heavy Rainfall Using Multi-Source Data
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In this study, a machine learning-based quality control algorithm for heavy rainfall was developed by integrating automatic weather station observations with remote sensing data, minute-level data, and metadata. Based on heavy rainfall samples from 1 June 2022 to 31 December 2024, the performances of four gradient boosting models (XGBoost, LightGBM, CatBoost, and GBRT) significantly outperformed conventional method, with XGBoost in particular achieving an increase in precision by 0.110, recall by 0.162, and F1-score by 0.140. This performance gain is attributed to the models’ ability to effectively learn nonlinear features from complex multi-source data, thereby reducing both false alarms and missed detections of anomalous rainfall events. The radar composite reflectivity, satellite cloud-top temperature, and minute-level precipitation were identified as dominant contributors to model predictions. The integration of multi-sensor observations effectively addressed limitations inherent in conventional threshold-based approaches. Through SHAP-based interpretability analysis, the model’s decision logic was shown to align with meteorological physical principles. Characteristic patterns such as combinations of low radar reflectivity and elevated cloud-top temperatures were flagged as anomalous rainfall events, typically corresponding to manual operational errors. Moreover, the model identified anomalous minute-level precipitation extremes to be critical signals for detecting instrument malfunctions, data encoding and transmission errors. The physical consistency of the model’s reasoning enhances its trustworthiness and supports its potential for operational implementation in heavy rainfall quality control.