Hybrid CNN-Transformer Ensemble for Enhanced Tank Detection in Aerial Imagery

Yunus Serhat Bıçakçı

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Object detection in unmanned aerial vehicle (UAV) imagery poses significant challenges due to motion blur, occlusion, and unstable viewpoints. This study introduces a hybrid ensemble approach combining transformers' global context modeling with CNNs' local feature extraction capabilities. Validated on the DroneVision benchmark dataset, our method employs Weighted Boxes Fusion (WBF) to integrate predictions from four advanced YOLO variants and a transformer-based detector (RF-DETR). The ensemble achieves superior localization accuracy, outperforming all single-model baselines. Here, we demonstrate that the calibrated fusion of diverse architectural models significantly reduces detection errors in real-world scenarios. All code and trained models are openly available (GitHub: \url{https://github.com/yunusserhat/drone}) to facilitate reproducibility, and the UAV tank dataset is accessible through the DroneVision challenge on Kaggle.

Version published to 10.21203/rs.3.rs-8771811/v1 on Research Square
Feb 5, 2026

A Deep Hybrid CNN–ViT Architecture Incorporating Advanced 3D Features for the Estimation of Visibility and Runway Visual Range

This article has 2 authors:
1. Anand Shankar
2. Bikash Chandra Sahana
This article has no evaluationsLatest version Feb 5, 2026
A Deep Learning Based Aggregative Framework for Object Detection in Road Environments

This article has 1 author:
1. thayyaba khatoon mohammed
This article has no evaluationsLatest version Mar 6, 2026
A Hybrid YOLOv5s-Faster R-CNN Architecture for Object Detection in Complex Road Scenes

This article has 3 authors:
1. Lenard Nkalubo Byenkya
2. Rose Nakibuule
3. Danison Taremwa
This article has no evaluationsLatest version Jan 21, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Deep Hybrid CNN–ViT Architecture Incorporating Advanced 3D Features for the Estimation of Visibility and Runway Visual Range

A Deep Learning Based Aggregative Framework for Object Detection in Road Environments

A Hybrid YOLOv5s-Faster R-CNN Architecture for Object Detection in Complex Road Scenes