ARF-YOLO: Attention-Guided Adaptive Resolution-Aware Feature Learning for UAV Remote Sensing Object Detection
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Unmanned aerial vehicle (UAV)-based remote sensing object detection faces three fundamental bottlenecks: (1) insufficient resolution diversity in single-scale detection heads, causing irreversible spatial detail loss for small targets; (2) semantic gap accumulation in multi-scale feature fusion due to content-agnostic bilinear interpolation; and (3) inefficient feature resource allocation that treats all channels, spatial patches, and scale levels with equal importance regardless of relevance. To address these challenges, we propose ARF-YOLO, a novel UAV detection framework built upon YOLOv11 with three synergistic innovations. The Attention-Guided Resolution Head (AGRH) incorporates the Multi-Perspective Feature Attention (MPFA) module, which simultaneously processes dual-resolution feature streams through multi-directional pooling-based attention to fuse semantic context and fine-grained spatial cues. The Adaptive Multi-Level Feature Fusion Module (AMFF) replaces bilinear upsampling with content-adaptive dynamic kernel generation (FAUS), structure-guided feature refinement (FRS), and learning-based cross-level weighting (AFFS). The Fast Scale Resource Assigner (FSRA), adopted from the global dynamic query framework for small target detection, is incorporated into our pipeline to dynamically allocate representation capacity along channels, spatial patches, and scale levels via three lightweight parallel assigners. We further propose the ARF-Scale-Aware Loss, which amplifies supervisory signal for small objects through inverse-scale weighting. Extensive experiments on VisDrone2019 and UAVDT demonstrate that ARF-YOLO achieves 48.5% and 63.7% mAP@0.5 respectively, surpassing the YOLOv11 baseline by 5.1 and 5.4 percentage points with only \((+)\)2.3\,M additional parameters (11.5% relative increase) while maintaining real-time inference at 101\,fps.