Improving the Generalization of Segmentation Foundation Models via Weakly-Supervised and Unsupervised Adaptation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The success of large language models has inspired the computer vision community to explore image segmentation foundation model that is able to zero/few-shot generalize through prompt engineering. Segment-Anything (SAM), among others, is the state-of-the-art image segmentation foundation model demonstrating strong zero/few-shot generalization. Despite the success, recent studies reveal the weakness of SAM under strong distribution shift. In particular, SAM performs awkwardly on corrupted natural images, camouflaged images, medical images, etc. Motivated by these observations, we aim to develop an adaptation strategy for SAM that supports both weakly-supervised and unsupervised settings. In the weakly-supervised setting, we leverage weak labels, e.g. point-wise or box annotations, together with anchor model and low-rank finetuning to regularize self-training to improve generalization. In the unsupervised setting, we propose a data pipeline to automatically generate weak labels for target domain training images, enabling adaptation without manual annotation. To further alleviate error accumulation in self-training, we introduce patch-level contrastive regularization to reduce reliance on noisy pseudo labels, and employ a novel masked image modeling approach that uses teacher-derived features and semantic alignment to improve feature consistency and robustness during adaptation. We conduct extensive validation on five segmentation tasks across diverse domains, including natural, corrupted, medical, camouflaged, and robotic images. Our task-agnostic method, compatible with both SAM and SAM2, consistently surpasses pre-trained SAM and state-of-the-art domain adaptation methods across four segmentation settings using identical prompt inputs.

Article activity feed