Agriculture surrounding monitoring and object identification based on optimized You Only Look Once and Single Shot Multibox Detector setups using combined vision and thermal images
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper presents a monitoring and object identification method in agricultural environments using both vision and thermal images. We evaluate two distinct approaches: a dual-network architecture, where separate models are trained for each image, and a unified network that integrates both data types into a single processing stream. Multiple prototypes based on You Only Look Once version 8 (YOLOv8) and Single Shot Multibox Detector (SSD) architectures were developed. YOLOv8 abandons the use of Cross Stage Partial (CSP) layers in favor of a simplified architecture based on C2f modules. In this work, we show that this modification reduces architectural complexity and enhances both computational efficiency and inference speed, during object class identification. The SSD design includes the removal of conv5_x , avgpool, fc and softmax layers from the original model and the setting of all strides in conv4_x to 1×1. The backbone is followed by 5 additional convolutional layers, to which five detection heads are attached, and the sixth head is attached to the conv4_x layer. Experimental results show differences between dual and single networks, where the mean Average Precision (mAP@0.5) changes from 0.88 to 0.90. The unified model provides improvement in overall performance due to information fusion during object identification from vision and thermal imagery data streams. The most significant variation was observed when transitioning from YOLOv8 to SSD architecture, where YOLOv8 outperformed SSD by achieving higher mAP@0.5 scores of 0.98 for the Harvester class and 0.94 for the Tractor class. Compared to SSD where mAP@0.5 achieved 0.91 and 0.88, respectively.