3D Feature Distillation with Object-Centric Priors

Georgios Tziafas
Yucheng Xu
Zhibin Li
Hamidreza Kasaei

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Grounding natural language to the physical world is a ubiquitous topic with a widerange of applications in computer vision and robotics. Recently, 2D vision-languagemodels such as CLIP have been widely popularized, due to their impressivecapabilities for open-vocabulary grounding in 2D images. Subsequent works aimto elevate 2D CLIP features to 3D via feature distillation, but either learn neuralfields that are scene-specific and hence lack generalization, or focus on indoor roomscan data that require access to multiple camera views, which is not practical inrobot manipulation scenarios. Additionally, related methods typically fuse featuresat pixel-level and assume that all camera views are equally informative. In thiswork, we show that this approach leads to sub-optimal 3D features, both in termsof grounding accuracy, as well as segmentation crispness. To alleviate this, wepropose a multi-view feature fusion strategy that employs object-centric priors toeliminate uninformative views based on semantic information, and fuse featuresat object-level via instance segmentation masks. To distill our object-centric3D features, we generate a large-scale synthetic multi-view dataset of clutteredtabletop scenes, spawning 15k scenes from over 3300 unique object instances,which we make publicly available. We show that our method reconstructs 3DCLIP features with improved grounding capacity and spatial consistency, whiledoing so from single-view RGB-D, thus departing from the assumption of multiplecamera views at test time. Finally, we show that our approach can generalize tonovel tabletop domains and be re-purposed for 3D instance segmentation withoutfine-tuning, and demonstrate its utility for language-guided robotic grasping in clutter. Released assets and supplementary material is made available at thewebsite https://gtziafas.github.io/DROP_project/.

Version published to 10.21203/rs.3.rs-7694742/v1 on Research Square
Oct 17, 2025

IncrementalDreamer: Scene-level 3D Generation with Incremental Optimization

This article has 4 authors:
1. Haiqi Zhu
2. Zihao Zhang
3. Qi Liu
4. Youdong Ding
This article has no evaluationsLatest version Sep 23, 2025
Diving Performance Analysis with 3D Motion Knowledge Hypergraphs

This article has 4 authors:
1. Jingbo Wang
2. Yifan Xie
3. Yitao Xie
4. Hongyu Xiao
This article has no evaluationsLatest version Sep 8, 2025
UREPTrack: Unified RGB-Event Visual Tracking via PoolFormer Backbone

This article has 1 author:
1. Min Lu
This article has no evaluationsLatest version Sep 24, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

IncrementalDreamer: Scene-level 3D Generation with Incremental Optimization

Diving Performance Analysis with 3D Motion Knowledge Hypergraphs

UREPTrack: Unified RGB-Event Visual Tracking via PoolFormer Backbone