RobustOVS: Open-Vocabulary Segmentation with Robustly Semantic-Assisted Calibration

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Open-vocabulary semantic segmentation has emerged as a transformative approach in the field of image segmentation. Open-vocabulary segmentation models (OVS) leverage pre-trained vision-language models, such as CLIP, to classify mask regions. However, these models face performance limitations when aligning visual content with the infinite semantics of text. To address this challenge, we propose the Robust Open-Vocabulary Segmentation Model (RobustOVS), which not only preserves CLIP’s generalization capabilities but also enhances computational efficiency. Training such models typically demands computational resources that are beyond the reach of most research labs. RobustOVS tackles this limitation by employing a streamlined and efficient network architecture, significantly reducing training requirements. The additional parameters of RobustOVS can be trained and fine-tuned on a single GPU within 50 hours, demonstrating its feasibility and practicality for standard research environments.In RobustOVS, we introduce a high-performance multi-scale feature pyramid network that effectively extracts semantically rich features through a combination of deformable convolutions and context-based self-modulation. This enables robust matching between masked image regions and nouns in image captions. Experiments reveal that mask prompt fine-tuning yields substantial improvements without modifying any weights of the CLIP model, while further boosting the performance of fully fine-tuned models. Notably, we benchmarked the RobustOVS architecture across several popular open-vocabulary semantic segmentation datasets. RobustOVS consistently delivered outstanding performance on all tasks and datasets, surpassing task-specific architectures while requiring even fewer computational resources.

Article activity feed