DCFNet: Dual-Branch Collaborative Fusion Network for Fine-Grained Visual Classification

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Fine-grained visual classification aims to distinguish subcategories with subtle visual differences under high inter-class similarity. While auxiliary textual semantics provide supplementary information, existing multimodal methods still face a limitation in balancing global semantic consistency and local discriminative details. To address this limitation, we propose a Dual-Branch Collaborative Fusion Network (DCFNet), comprising two synergistic branches designed to decouple feature learning across granularities. Specifically, we design a cross-modal consistency alignment branch to calibrate the global semantic space. Complementarily, we construct a cross-modal transformer fusion branch to achieve fine-grained local feature interaction. This dual-branch collaboration maintains high-level semantic consistency while accurately capturing fine-grained discriminative cues. Extensive experiments and ablation studies on the CUB-200-2011, Con-Text, and Drink Bottle datasets demonstrate that DCFNet achieves competitive performance, providing an innovative solution for fine-grained visual classification tasks.

Article activity feed