Hybrid CNN-Transformer Ensemble for Enhanced Tank Detection in Aerial Imagery
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Object detection in unmanned aerial vehicle (UAV) imagery poses significant challenges due to motion blur, occlusion, and unstable viewpoints. This study introduces a hybrid ensemble approach combining transformers' global context modeling with CNNs' local feature extraction capabilities. Validated on the DroneVision benchmark dataset, our method employs Weighted Boxes Fusion (WBF) to integrate predictions from four advanced YOLO variants and a transformer-based detector (RF-DETR). The ensemble achieves superior localization accuracy, outperforming all single-model baselines. Here, we demonstrate that the calibrated fusion of diverse architectural models significantly reduces detection errors in real-world scenarios. All code and trained models are openly available (GitHub: \url{https://github.com/yunusserhat/drone}) to facilitate reproducibility, and the UAV tank dataset is accessible through the DroneVision challenge on Kaggle.