Enhancing Open-Vocabulary Scene Understanding via Push-Pull Alignment in Gaussian Splatting

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Open-vocabulary scene understanding based on 3D Gaussian Splatting(3DGS) has shown promising potential for applications such as embodiedagents and object localization. By integrating open-vocabulary embeddings intospatial 3D Gaussians, these models enable a more comprehensiveunderstanding of scenes. However, existing methods often suffer frommisalignment due to the gap between RGB and language modalities, leading toincorrect interpretations of similar-looking objects. To address this issue, wepropose a cross-modal integration approach that aligns multiplerepresentations through spatial Gaussian positioning. We introduce PPGS, anovel bimodal framework that bridges RGB and language modalities throughcohesive representation fields. Leveraging the illumination-invariant propertiesof language embeddings, we design the Bridge module, which employs surfacereconstruction to provide refined geometric positions, acting as a link betweenmodalities. This module significantly enhances cross-modal alignment,improves high-fidelity rendering, and ensures accurate language featureembeddings for better modality fusion. Furthermore, our frameworkdynamically adjusts gradients based on the distinct optimization requirementsof RGB and language during joint learning, ensuring stable and efficientconvergence. Comprehensive experiments demonstrate that PPGS achievessuperior language query accuracy and enhanced visual quality compared toexisting language-embedded representations, with Intersection over Union(mIoU) increasing by 6% and Peak Signal-to-Noise Ratio (PSNR) showing gainsover mainstream methods, all within only 50% of the training time. Code repository: https://github.com/flybiubiu/PPGS.

Article activity feed