Interactive Text-Guided Image Segmentation via Vision Mamba and Large Language Models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

We propose an H-type Bidirectional Alignment Network for text-guided image segmentation, achieving efficient and accurate cross-modal feature fusion. The visual branch employs a 12-layer Vision Mamba with four stages, combining a Selective State Space Model (SSM) and local convolutional residual structures to capture long-range dependencies while preserving fine boundaries at reduced computational cost. The text branch, adapted from the Qwen model, freezes lower layers and fine-tunes upper ones to extract robust referential semantics. A Q-Former-based alignment module introduces learnable queries to enforce bidirectional supervision: the forward path performs text-to-image segmentation, and the backward path reconstructs attention from image to text, ensuring cross-modal consistency. A multi-scale decoder integrates aligned features through an iterative interaction mechanism. Experiments on RefCOCO, RefCOCO+, and RefCOCOg demonstrate that our approach outperforms existing methods in both segmentation accuracy and efficiency.

Article activity feed