RobustOVS: Open-Vocabulary Segmentation with Robustly Semantic-Assisted Calibration

Ruihan Wang
Guodong Wang
Mingtao Liu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Open-vocabulary semantic segmentation has emerged as a transformative approach in the field of image segmentation. Open-vocabulary segmentation models (OVS) leverage pre-trained vision-language models, such as CLIP, to classify mask regions. However, these models face performance limitations when aligning visual content with the infinite semantics of text. To address this challenge, we propose the Robust Open-Vocabulary Segmentation Model (RobustOVS), which not only preserves CLIP’s generalization capabilities but also enhances computational efficiency. Training such models typically demands computational resources that are beyond the reach of most research labs. RobustOVS tackles this limitation by employing a streamlined and efficient network architecture, significantly reducing training requirements. The additional parameters of RobustOVS can be trained and fine-tuned on a single GPU within 50 hours, demonstrating its feasibility and practicality for standard research environments.In RobustOVS, we introduce a high-performance multi-scale feature pyramid network that effectively extracts semantically rich features through a combination of deformable convolutions and context-based self-modulation. This enables robust matching between masked image regions and nouns in image captions. Experiments reveal that mask prompt fine-tuning yields substantial improvements without modifying any weights of the CLIP model, while further boosting the performance of fully fine-tuned models. Notably, we benchmarked the RobustOVS architecture across several popular open-vocabulary semantic segmentation datasets. RobustOVS consistently delivered outstanding performance on all tasks and datasets, surpassing task-specific architectures while requiring even fewer computational resources.

Version published to 10.21203/rs.3.rs-6850046/v1 on Research Square
Aug 11, 2025

SCM: Semantic Segmentation with Dual-Stream Semantic Synergy under Adverse Weather Conditions

This article has 5 authors:
1. Shuochen Tian
2. Jian Pang
3. Jin Wang
4. Bingfeng Zhang
5. Weifeng Liu
This article has no evaluationsLatest version Sep 3, 2025
CFD-CLIP: Contrastive Feature Distillation with CLIP for Image Classification

This article has 4 authors:
1. Maohai Pang
2. Weiwei Zhang
3. Xiao bin Li
4. Jianqing Zhu
This article has no evaluationsLatest version Sep 22, 2025
Zero-Shot Image Super-Resolution Using Prompt-Driven Vision-Language Foundation Models Without Task-Specific Fine-Tuning

This article has 1 author:
1. K. AKILA
This article has no evaluationsLatest version Sep 1, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

SCM: Semantic Segmentation with Dual-Stream Semantic Synergy under Adverse Weather Conditions

CFD-CLIP: Contrastive Feature Distillation with CLIP for Image Classification

Zero-Shot Image Super-Resolution Using Prompt-Driven Vision-Language Foundation Models Without Task-Specific Fine-Tuning