RAVL: A Region Attention Yolo with Two-Stage Training for Enhanced Object Detection

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Improving the accuracy of object detection has been a key focus of recent research. However, many existing approaches fail to fully utilize location labels to effectively suppress irrelevant background features, which limits detection performance, particularly in the detection of small objects. In this paper, we propose a novel region attention mechanism to address this limitation, which combines of a region attention module(RAM) and a two-stage training strategy(TSTS). The RAM comprises a Squeeze-and-Excitation (SE) block, which dynamically assigns weights to multi-channel feature maps to generate a saliency map, and a fusion block that integrates the features with the saliency map to enhance object features while suppressing background features. We embed the RAM into the shallow layer of any version of YOLO, creating an object detector named Region Attention YOLO (RAVL). RAVL is trained using a two-stage training strategy (TSTS). In the first stage, “no background” images are generated based on the location labels, and a vanilla detector YOLOv8 is trained on them to produce ground truth “no background” features. In the second stage, RAVL is trained from scratch on the original infrared images by minimizing a detection loss and a region attention loss. The region attention loss ensures that the low-level features extracted from “no background” and original images are similar, thereby improving overall detection accuracy. Extensive experiments of YOLOv5, YOLOv8, YOLOv9 and YOLOv10 on the FLIR infrared image datasets and the VisDrone2019 visible light dataset demonstrate that our method can significantly improve the detection performance. YOLOv8 achieves the mAP0.5 score of 81.7% on the FLIR dataset and 42.1% on the VisDrone2019 dataset, which is 3.1% and 5.0% higher than that not using our method. Especially for small objects bicycle in FLIR and pedestrian in VisDrone2019, 5.7% and 7.9% higher mAP0.5 respectively.

Article activity feed