CLIP-Mono3D: End-to-End Open-Vocabulary Monocular 3D Object Detection via Semantic–Geometric Similarity

Zichong Gu
Shiyi Mu
Hanqi Lyu
Shugong Xu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Open-vocabulary 3D object detection (OV-3DOD) is crucial for real-world perception, yet existing monocular methods are often limited by predefined categories or heavy reliance on external 2D detectors. In this paper, we propose CLIP-Mono3D, an end-to-end one-stage transformer framework that directly integrates vision–language semantics into monocular 3D detection. By leveraging CLIP-derived semantic priors and grounding object queries in semantically salient regions, our model achieves robust zero-shot generalization to novel categories without requiring auxiliary 2D detectors. Furthermore, we introduce OV-KITTI, a large-scale benchmark extending KITTI with 40 new categories and over 7000 annotated 3D bounding boxes. Extensive experiments on OV-KITTI, KITTI, and Argoverse demonstrate that CLIP-Mono3D achieves competitive performance in open-vocabulary scenarios.

Version published to 10.3390/s26082380
Apr 13, 2026
Version published to 10.20944/preprints202603.1117.v1
Mar 16, 2026

Akhat-DETR: End-to-End Object Detection Model on Hazy Scenarios in Autonomous Driving

This article has 2 authors:
1. Zhao Liu
2. Zhiwei Liu
This article has no evaluationsLatest version Mar 17, 2026
Viewpoint-Aware Pose Estimation Framework for Cooperative UAVs

This article has 4 authors:
1. Youngrun Kim
2. Heokjune You
3. Seunghyun Choi
4. Dongwon Jung
This article has no evaluationsLatest version Apr 1, 2026
BPC-SLAM: Part-Level Dynamic Suppression and Structure-Constrained RGB-D SLAM for Human-Centric Dynamic Environments

This article has 5 authors:
1. Wang Yang
2. Jiupeng Chen
3. Hongjun San
4. Fan Zhang
5. Wunyu Xu
This article has no evaluationsLatest version Apr 2, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Akhat-DETR: End-to-End Object Detection Model on Hazy Scenarios in Autonomous Driving

Viewpoint-Aware Pose Estimation Framework for Cooperative UAVs

BPC-SLAM: Part-Level Dynamic Suppression and Structure-Constrained RGB-D SLAM for Human-Centric Dynamic Environments