Leveraging Sentinel-2 Data and Machine Learning for Drought Detection in India: A Case Study
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Droughts significantly impact agriculture, water resources, and ecosystems. Their timely detection is essential for implementing effective mitigation strategies. This study explores the use of multispectral Sentinel-2 remote sensing indices and machine learning techniques to detect drought conditions in three distinct regions of India such as Jodhpur, Amravati, and Thanjavur during the Rabi season (October-April). Twelve remote sensing indices were studied to assess different aspects of vegetation health, soil moisture, and water stress, and their possible joint use and influence as indicators of regional drought events. Reference data used to define drought conditions in each region was primarily sourced from official government drought declarations, and regional and national news publications, which provide seasonal maps of drought conditions across the country. Based on this information, a District vs. Year (3×6) Ground truth is created, indicating the presence or absence of drought (Yes/No) for each region across the six-year period. Using this ground truth table, we extended the remote sensing dataset by adding a binary drought label for each observation: 1 for “Drought” and 0 for “No Drought”. The dataset is organized by year (2016–2021) in a two-dimensional format, with indices as columns and observations as rows. Each observation represents a single measurement of the remote sensing indices. This enriched dataset serves as the foundation for training and evaluating machine learning models aimed at classifying drought conditions based on spectral information. The resultant remote sensing dataset was used to predict drought events through various machine learning models, including Random Forest, XGBoost, Bagging Classifier, and Gradient Boosting. Among the models, the Bagging Classifier achieved the highest accuracy (84.15%), followed closely by Random Forest (83.39%) and XGBoost (82.30%). In terms of precision, Random Forest and Bagging Classifier performed comparably (83.49% and 83.44% respectively), while XGBoost achieved a precision of 79.82%. We applied a seasonal majority-voting strategy, assigning a final drought label for each region and Rabi season based on the majority of predicted monthly labels. Using this method, XGBoost, Random Forest, and Bagging Classifier achieved 94% accuracy, precision, and recall, while Gradient Boosting reached 83% across all metrics. The SHapley Additive exPlanations (SHAP) analysis revealed that Normalized Multi-band Drought Index (NMDI) and Day of the Season (DOS) consistently emerged as the most influential feature in determining model predictions. This finding is supported by the Borda Count and weighted sum analysis, which ranked NMDI, and DOS as the top feature across all models. Additionally, Red-edge Chlorophyll Index (RECI), Enhanced vegetation index (EVI), Normalized Difference Moisture Index (NDMI), and Ratio Drought Index (RDI) were identified as important features contributing to model performance. These features provide valuable insights into the underlying patterns and relationships within the data. To evaluate the impact of feature selection, we further conducted a feature ablation study. We trained each model using different combinations of top features: Top 1, Top 2, Top 3, Top 4, and Top 5. The performance of each model was assessed based on accuracy, precision, and recall. XGBoost demonstrated the best overall performance, especially when using the top 5 features.