3D Feature Distillation with Object-Centric Priors

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Grounding natural language to the physical world is a ubiquitous topic with a widerange of applications in computer vision and robotics. Recently, 2D vision-languagemodels such as CLIP have been widely popularized, due to their impressivecapabilities for open-vocabulary grounding in 2D images. Subsequent works aimto elevate 2D CLIP features to 3D via feature distillation, but either learn neuralfields that are scene-specific and hence lack generalization, or focus on indoor roomscan data that require access to multiple camera views, which is not practical inrobot manipulation scenarios. Additionally, related methods typically fuse featuresat pixel-level and assume that all camera views are equally informative. In thiswork, we show that this approach leads to sub-optimal 3D features, both in termsof grounding accuracy, as well as segmentation crispness. To alleviate this, wepropose a multi-view feature fusion strategy that employs object-centric priors toeliminate uninformative views based on semantic information, and fuse featuresat object-level via instance segmentation masks. To distill our object-centric3D features, we generate a large-scale synthetic multi-view dataset of clutteredtabletop scenes, spawning 15k scenes from over 3300 unique object instances,which we make publicly available. We show that our method reconstructs 3DCLIP features with improved grounding capacity and spatial consistency, whiledoing so from single-view RGB-D, thus departing from the assumption of multiplecamera views at test time. Finally, we show that our approach can generalize tonovel tabletop domains and be re-purposed for 3D instance segmentation withoutfine-tuning, and demonstrate its utility for language-guided robotic grasping in clutter. Released assets and supplementary material is made available at thewebsite https://gtziafas.github.io/DROP_project/.

Article activity feed