GCL-BEV: Enhancing Pure Vision 3D Detection with Motion Priors and View-Consistency Learning
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Temporal fusion has become a de facto standard in vision-centric Bird’s-Eye-View (BEV) perception, enabling velocity estimation and occlusion mitigation. However, existing paradigms typically rely on rigid geometric alignment (e.g., ego-pose warping) to aggregate historical features. We identify that this assumption is fragile: under aggressive ego-motion, such as rapid turning, the non-linear distortion of visual features leads to significant spatial misalignment, causing prediction jitter and feature smearing. To bridge this gap, we propose GCL-BEV, a robust detection framework that enforces geometric consistency through both architectural evolution and optimization constraints. First, we introduce a Geometric-Aware Feature Enhancement (GAFE) module. Unlike standard deformable convolutions that infer offsets from visual appearance, GAFE explicitly utilizes kinematic priors (ego-motion) to guide the dynamic deformation of the receptive field, ensuring feature alignment before temporal fusion. Second, we propose a View-Consistency Learning (VCL) objective. Formulated as a Siamese equivariance constraint, VCL compels the backbone to learn rotation-invariant representations during training, enhancing robustness against viewpoint perturbations with strictly zero inference overhead. Extensive experiments on the nuScenes dataset demonstrate that GCL-BEV achieves state-of-the-art performance among ResNet-101 based methods (57.8% NDS, 46.2% mAP). Crucially, our method significantly reduces Orientation Error (mAOE) by 5.4% compared to the baseline, validating its superiority in maintaining geometric stability under complex driving maneuvers.