DCFNet: Dual-Branch Collaborative Fusion Network for Fine-Grained Visual Classification
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Fine-grained visual classification aims to distinguish subcategories with subtle visual differences under high inter-class similarity. While auxiliary textual semantics provide supplementary information, existing multimodal methods still face a limitation in balancing global semantic consistency and local discriminative details. To address this limitation, we propose a Dual-Branch Collaborative Fusion Network (DCFNet), comprising two synergistic branches designed to decouple feature learning across granularities. Specifically, we design a cross-modal consistency alignment branch to calibrate the global semantic space. Complementarily, we construct a cross-modal transformer fusion branch to achieve fine-grained local feature interaction. This dual-branch collaboration maintains high-level semantic consistency while accurately capturing fine-grained discriminative cues. Extensive experiments and ablation studies on the CUB-200-2011, Con-Text, and Drink Bottle datasets demonstrate that DCFNet achieves competitive performance, providing an innovative solution for fine-grained visual classification tasks.