Open-Vocabulary 3D Understanding with Identity-Enhanced Segmentation
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Open-vocabulary 3D segmentation assigns semantic labels to 3D data by matching language embeddings with 3D rendered semantic vision-language embeddings. To achieve this function, existing methods fuse 2D semantic embeddings extracted from multi-view images by 2D foundation models into a unified 3D representation. However, these methods often struggle when multi-view images have significant viewpoint changes, object deformations, and occlusions, leading to inconsistent object identities and reduced segmentation accuracy. To overcome these challenges, we propose a novel framework, 3D Identity-Enhanced Segmentation (3D-IES). 3D-IES leverages multi-view geometry and a fully trained 3D Gaussian Splatting model to reproject 2D segmentation masks into the 3D space, thereby enabling consistent object identity assignment across diverse viewpoints. By anchoring segmentation masks in the 3D space, our method ensures spatial consistency, robust object tracking, and accurate segmentation, even under challenging conditions such as significant viewpoint changes or overlapping regions. Experimental results demonstrate that 3D-IES significantly outperforms state-of-the-art methods in open-vocabulary 3D semantic segmentation, achieving superior robustness and accuracy across a variety of complex scenes.