Research on Cooperative Vehicle–Infrastructure Perception Integrating Enhanced Point-Cloud Features and Spatial Attention
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Vehicle–infrastructure cooperative perception (VICP) extends the sensing capability of single-vehicle systems by integrating multi-source information from onboard and roadside sensors, thereby alleviating limitations in sensing range and field-of-view coverage. However, in complex urban environments, the robustness of such systems—particularly in terms of blind-spot coverage and feature representation—is severely affected by both static and dynamic occlusions, as well as distance-induced sparsity in point cloud data. To address these challenges, a 3D object detection framework incorporating point cloud feature enhancement and spatially adaptive fusion is proposed. First, to mitigate feature degradation under sparse and occluded conditions, a Redefined Squeeze-and-Excitation Network (R-SENet) attention module is integrated into the feature encoding stage. This module employs a dual-dimensional squeeze-and-excitation mechanism operating across pillars and intra-pillar points, enabling adaptive recalibration of critical geometric features. In addition, a Feature Pyramid Backbone Network (FPB-Net) is designed to improve target representation across varying distances through multi-scale feature extraction and cross-layer aggregation. Second, to address feature heterogeneity and spatial misalignment between heterogeneous sensing agents, a Spatial Adaptive Feature Fusion (SAFF) module is introduced. By explicitly encoding the origin of features and leveraging spatial attention mechanisms, the SAFF module enables dynamic weighting and complementary fusion between fine-grained vehicle-side features and globally informative roadside semantics. Extensive experiments conducted on the DAIR-V2X benchmark and a custom dataset demonstrate that the proposed approach outperforms several state-of-the-art methods. Specifically, Average Precision (AP) scores of 0.762 and 0.694 are achieved at an IoU threshold of 0.5, while AP scores of 0.617 and 0.563 are obtained at an IoU threshold of 0.7 on the two datasets, respectively. Furthermore, the proposed framework maintains real-time inference performance, highlighting its effectiveness and practical potential for real-world deployment.