VCAFPN: Feature fusion for object detection in any direction based on biological visual cross attention
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Visual Transformer has long-range modeling capabilities and can predict every instance of object detection in any direction from a global perspective. However , using Transformer in the feature fusion stage brings high computational costs. In this paper, we propose a plug-and-play lightweight feature fusion model (VCAFPN) from the perspective of biological vision. First, the image is input into the backbone network to extract features of each layer. Second, in the visual cross-attention mechanism, the core fusion device (CFD) extracts sliding window features and pooling features from features of other layers. The dual-path information aggregation device (DIAD) performs feature fusion on the features of this layer, sliding window features and pooling features. Third, the detection heads perform localization and classification to obtain the final result. VCAFPN reduces 45.9FLPOs (G) and 4.6Params (MB). Our models R3Det-kld-VCAFPN and RoI-Transformer-VCAFPN achieve 78.9% and 76.8% recognition accuracy on the DOTA-v1.0 dataset and DOTA-v1.5 dataset, respectively.