MA-YOLO: Multi-Scale Attention-Enhanced YOLO for Object Detection in Remote Sensing Images
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Object detection plays a crucial role in remote sensing by enabling automated information extraction and supporting downstream decision-making. However, remote sensing images often exhibit complex backgrounds and significant scale variations, leading to degraded detection performance. To address these challenges, we propose MA-YOLO, an enhanced variant of YOLOv7 designed for robust object detection in remote sensing imagery. First, we introduce the dilated convolution layer aggregation network (DELAN-1), which integrates MobileViTv3 and dilated convolution to effectively balance global and local feature extraction while reducing computational overhead. This improves semantic representation in complex backgrounds and across diverse object scales. Second, we propose the cross-layer feature fusion module (CFFM), which enhances information flow between the backbone and neck networks by fusing shallow positional information with deep semantic features, mitigating contextual information loss. Finally, we incorporate the multi-angle pooling attention module (MAPA) in the neck network, leveraging multi-angle pooling and Transformer-based attention to capture target features from multiple directions, improving feature extraction robustness and multi-scale detection performance. Extensive experiments on NWPU VHR-10, VisDrone2019, and RSOD datasets demonstrate the effectiveness and robustness of MA-YOLO. Specifically, on NWPU VHR-10, MA-YOLO achieves a 1.1\% improvement in $mAP_{0.5}$ and a 0.9\% increase in $mAP_{0.5:0.95}$ over YOLOv7, highlighting its superior capability in handling complex backgrounds and multi-scale object detection.