Interactive Text-Guided Image Segmentation via Vision Mamba and Large Language Models

Yao Meng
Haochen Sun
Wei Jiang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We propose an H-type Bidirectional Alignment Network for text-guided image segmentation, achieving efficient and accurate cross-modal feature fusion. The visual branch employs a 12-layer Vision Mamba with four stages, combining a Selective State Space Model (SSM) and local convolutional residual structures to capture long-range dependencies while preserving fine boundaries at reduced computational cost. The text branch, adapted from the Qwen model, freezes lower layers and fine-tunes upper ones to extract robust referential semantics. A Q-Former-based alignment module introduces learnable queries to enforce bidirectional supervision: the forward path performs text-to-image segmentation, and the backward path reconstructs attention from image to text, ensuring cross-modal consistency. A multi-scale decoder integrates aligned features through an iterative interaction mechanism. Experiments on RefCOCO, RefCOCO+, and RefCOCOg demonstrate that our approach outperforms existing methods in both segmentation accuracy and efficiency.

Version published to 10.21203/rs.3.rs-7965576/v1 on Research Square
Dec 5, 2025

APAU-Net: Adaptive Prior-Aware U-Net Text-Line Segmentation for Historical Documents

This article has 4 authors:
1. Mohamed Amine Beghoura
2. Abdelouahab Attia
3. Abderraouf Bouziane
4. M. Hassaballah
This article has no evaluationsLatest version Dec 15, 2025
An Efficient and Training-Free Approach for Subject-Driven Text-to-Image Generation

This article has 3 authors:
1. Gregory Yu
2. Ian Butler
3. Aaron Collins
This article has no evaluationsLatest version Jan 9, 2026
Multimodal Model Based on Contrastive Language-Image Pretraining for Micro-Expression Recognition

This article has 5 authors:
1. Peng Yang
2. Xiaoguang Wu
3. Yanyang Zhou
4. Qilin Wei
5. Zhifeng Zeng
This article has no evaluationsLatest version Dec 17, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

APAU-Net: Adaptive Prior-Aware U-Net Text-Line Segmentation for Historical Documents

An Efficient and Training-Free Approach for Subject-Driven Text-to-Image Generation

Multimodal Model Based on Contrastive Language-Image Pretraining for Micro-Expression Recognition