Scalable Automated Video Labeling for Early Wildfire Smoke Detection with Fast-Then-Precise Two-Stage Inference
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Early wildfire response depends on detecting the first faint appearance of smoke while maintaining low false-alarm rates across diverse cameras, lighting conditions, and environments. A central barrier to progress is the lack of scalable, reliable supervision for subtle early-stage smoke plumes, which makes models brittle under real-world domain shift. We address this challenge by introducing a scalable automated video labeling pipeline based on SAM2 mask propagation, including reverse-frame processing, that enables consistent annotation of early smoke emergence from long time-series camera data. Segmentation masks are converted into tight bounding boxes with targeted human validation to remove cloud artifacts, producing a large and diverse training set spanning fixed-view and zoom-capable wildfire camera networks. Building on this dataset, we design a fast-then-precise two-stage smoke detection system that mirrors operational alerting logic. A high-recall early-warning stage based on RT-DETR prioritizes rapid detection, while a high-precision confirmation stage using YOLOv11 stabilizes alerts and suppresses false positives. The system is evaluated on a strict temporally held-out benchmark consisting of all available FIgLib ignition sequences from 2023 to 2025, which were excluded from training. On this real-world dataset, the early-warning stage achieves high recall (0.94) and detects smoke in 7.0 +- 6.3 minutes on average, while the confirmation stage reaches high precision (0.95) with no false positives observed across the full 2023–2025 evaluation set. These results demonstrate that scalable video labeling combined with complementary two-stage inference enables reliable early wildfire smoke detection under realistic operating conditions.