RAVL: A Region Attention Yolo with Two-Stage Training for Enhanced Object Detection

Weiwen Cai
Huiqian Du
Min Xie

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Improving the accuracy of object detection has been a key focus of recent research. However, many existing approaches fail to fully utilize location labels to effectively suppress irrelevant background features, which limits detection performance, particularly in the detection of small objects. In this paper, we propose a novel region attention mechanism to address this limitation, which combines of a region attention module(RAM) and a two-stage training strategy(TSTS). The RAM comprises a Squeeze-and-Excitation (SE) block, which dynamically assigns weights to multi-channel feature maps to generate a saliency map, and a fusion block that integrates the features with the saliency map to enhance object features while suppressing background features. We embed the RAM into the shallow layer of any version of YOLO, creating an object detector named Region Attention YOLO (RAVL). RAVL is trained using a two-stage training strategy (TSTS). In the first stage, “no background” images are generated based on the location labels, and a vanilla detector YOLOv8 is trained on them to produce ground truth “no background” features. In the second stage, RAVL is trained from scratch on the original infrared images by minimizing a detection loss and a region attention loss. The region attention loss ensures that the low-level features extracted from “no background” and original images are similar, thereby improving overall detection accuracy. Extensive experiments of YOLOv5, YOLOv8, YOLOv9 and YOLOv10 on the FLIR infrared image datasets and the VisDrone2019 visible light dataset demonstrate that our method can significantly improve the detection performance. YOLOv8 achieves the mAP0.5 score of 81.7% on the FLIR dataset and 42.1% on the VisDrone2019 dataset, which is 3.1% and 5.0% higher than that not using our method. Especially for small objects bicycle in FLIR and pedestrian in VisDrone2019, 5.7% and 7.9% higher mAP0.5 respectively.

Version published to 10.21203/rs.3.rs-5300581/v1 on Research Square
Nov 4, 2024

ECBAM-CVT-SOD: An Enhanced YOLOv8 Architecture with Multimodal Attentional Fusion for Innovative Low-altitude Remote Sensing in Small Object Detection

This article has 3 authors:
1. Tieshan Zhang
2. Zhong Ren
3. Shaoyuan Xi
This article has no evaluationsLatest version Apr 21, 2025
Partial Convolution Meets Visual Attention

This article has 8 authors:
1. Haiduo Huang
2. Fuwei Yang
3. Dong Li
4. Ji Liu
5. Lu Tian
6. Jinzhang Peng
7. Pengju Ren
8. Emad Barsoum
This article has no evaluationsLatest version Mar 21, 2025
Integrating Adaptive Spatio-Temporal and Motion Features in a Unified 2D Networks for Video Action Recognition

This article has 4 authors:
1. Liansong Zong
2. Linxi Li
3. Mingwei Tang
4. Li Wang
This article has no evaluationsLatest version Apr 18, 2025

Listed in

Abstract

Article activity feed

Related articles

ECBAM-CVT-SOD: An Enhanced YOLOv8 Architecture with Multimodal Attentional Fusion for Innovative Low-altitude Remote Sensing in Small Object Detection

Partial Convolution Meets Visual Attention

Integrating Adaptive Spatio-Temporal and Motion Features in a Unified 2D Networks for Video Action Recognition