Interactive Text-Guided Image Segmentation via Vision Mamba and Large Language Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
We propose an H-type Bidirectional Alignment Network for text-guided image segmentation, achieving efficient and accurate cross-modal feature fusion. The visual branch employs a 12-layer Vision Mamba with four stages, combining a Selective State Space Model (SSM) and local convolutional residual structures to capture long-range dependencies while preserving fine boundaries at reduced computational cost. The text branch, adapted from the Qwen model, freezes lower layers and fine-tunes upper ones to extract robust referential semantics. A Q-Former-based alignment module introduces learnable queries to enforce bidirectional supervision: the forward path performs text-to-image segmentation, and the backward path reconstructs attention from image to text, ensuring cross-modal consistency. A multi-scale decoder integrates aligned features through an iterative interaction mechanism. Experiments on RefCOCO, RefCOCO+, and RefCOCOg demonstrate that our approach outperforms existing methods in both segmentation accuracy and efficiency.