Robust Small Object Detection on Water Surfaces via Multi-Scale Contextual Attention and Channel-Normalized Feature Aggregation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Reliable perception of small floating objects is fundamental to the autonomous navigation of Unmanned Surface Vehicles (USVs) and Search and Rescue (SAR) operations. However, detection in dynamic water-surface environments remains a formidable challenge as high-frequency wave clutter frequently obscures targets and illumination-induced covariate shifts destabilize feature distributions. To address these inherent limitations, this study proposes YOLO11-MCN, a real-time detection framework introducing two novel architectural components tailored for water surface monitoring. First, the Multi-scale Contextual Attention (MSCA) module disentangles the target signatures from repetitive background noise. Unlike conventional attention mechanisms, the MSCA explicitly aggregates contextual information across heterogeneous receptive fields to suppress wave-generated false positives. Second, the Channel Normalization Attention Mechanism (CNAM) provides a targeted solution for illumination instability. Leveraging Group Normalization for feature statistics calibration, CNAM effectively mitigates covariate shifts from extreme lighting conditions. These core innovations are complemented by a high-resolution P2 detection head, recovering the geometric details of small-scale targets (\texorpdfstring{$<32 \times 32$}{<32 × 32} pixels) typically lost during deep downsampling. Extensive experiments on a dataset containing 5,812 images demonstrate that YOLO11-MCN achieves a state-of-the-art mAP@0.5 of 92.7\%, surpassing the YOLO11n baseline by 5.9 percentage points. Robustness evaluations confirm that the designs of MSCA and CNAM significantly reduce missed detections under severe wave clutter and backlighting conditions. With a recall of 90.5\% and an inference speed of 94 FPS, the proposed method provides a robust and efficient solution suitable for USV perception, with a model complexity (3.9M parameters) compatible with further edge-device optimization.