A Machine Learning-Based Quality Control Algorithm for Heavy Rainfall Using Multi-Source Data
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In this study, a machine learning-based quality control algorithm for heavy rainfall was developed by integrating automatic weather station observations with remote sensing data, minute-level data, and metadata. Based on heavy rainfall samples from 1 June 2022 to 31 December 2024, the performances of four gradient boosting models—eXtreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), Categorical Boosting (CatBoost), and Gradient Boosted Regression Trees (GBRT)—significantly outperformed precipitation-threshold-based conventional methods, including regional extreme value checks, temporal consistency checks, and others. Specifically, the XGBoost in particular achieves an increase in precision by 0.110 and recall by 0.162. This translates to a substantial reduction in both false alarms (higher precision) and missed detections (higher recall) of anomalous heavy rainfall events, thereby significantly enhancing the reliability of the quality-controlled data. The radar composite reflectivity, satellite cloud-top temperature, and minute-level precipitation were identified as dominant contributors to model predictions. The integration of multi-sensor observations effectively addressed limitations inherent in conventional threshold-based approaches. Through SHapley Additive exPlanations (SHAP)-based interpretability analysis, the model’s decision logic was shown to align with meteorological physical principles. Characteristic patterns such as combinations of low radar reflectivity and elevated cloud-top temperatures were flagged as anomalous rainfall events, typically corresponding to manual operational errors. Moreover, the model identified anomalous minute-level precipitation extremes to be critical signals for detecting instrument malfunctions, data encoding and transmission errors. The physical consistency of the model’s reasoning enhances its trustworthiness and supports its potential for operational implementation in heavy rainfall quality control.