Hybrid CNN-Transformer Ensemble for Enhanced Tank Detection in Aerial Imagery

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Object detection in unmanned aerial vehicle (UAV) imagery poses significant challenges due to motion blur, occlusion, and unstable viewpoints. This study introduces a hybrid ensemble approach combining transformers' global context modeling with CNNs' local feature extraction capabilities. Validated on the DroneVision benchmark dataset, our method employs Weighted Boxes Fusion (WBF) to integrate predictions from four advanced YOLO variants and a transformer-based detector (RF-DETR). The ensemble achieves superior localization accuracy, outperforming all single-model baselines. Here, we demonstrate that the calibrated fusion of diverse architectural models significantly reduces detection errors in real-world scenarios. All code and trained models are openly available (GitHub: \url{https://github.com/yunusserhat/drone}) to facilitate reproducibility, and the UAV tank dataset is accessible through the DroneVision challenge on Kaggle.

Article activity feed