GCL-BEV: Enhancing Pure Vision 3D Detection with Motion Priors and View-Consistency Learning

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Temporal fusion has become a de facto standard in vision-centric Bird’s-Eye-View (BEV) perception, enabling velocity estimation and occlusion mitigation. However, existing paradigms typically rely on rigid geometric alignment (e.g., ego-pose warping) to aggregate historical features. We identify that this assumption is fragile: under aggressive ego-motion, such as rapid turning, the non-linear distortion of visual features leads to significant spatial misalignment, causing prediction jitter and feature smearing. To bridge this gap, we propose GCL-BEV, a robust detection framework that enforces geometric consistency through both architectural evolution and optimization constraints. First, we introduce a Geometric-Aware Feature Enhancement (GAFE) module. Unlike standard deformable convolutions that infer offsets from visual appearance, GAFE explicitly utilizes kinematic priors (ego-motion) to guide the dynamic deformation of the receptive field, ensuring feature alignment before temporal fusion. Second, we propose a View-Consistency Learning (VCL) objective. Formulated as a Siamese equivariance constraint, VCL compels the backbone to learn rotation-invariant representations during training, enhancing robustness against viewpoint perturbations with strictly zero inference overhead. Extensive experiments on the nuScenes dataset demonstrate that GCL-BEV achieves state-of-the-art performance among ResNet-101 based methods (57.8% NDS, 46.2% mAP). Crucially, our method significantly reduces Orientation Error (mAOE) by 5.4% compared to the baseline, validating its superiority in maintaining geometric stability under complex driving maneuvers.

Article activity feed