MSST-VO: Monocular Visual Odometry for Ground Vehicles in Urban Environments Based on Multiscale Spatial-Temporal Feature Aggregation Network
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Visual Odometry (VO) plays an important role in the visual simultaneous localization and mapping (v-SLAM) system. In recent years, end-to-end deep learning methods has been introduced and applied to VO systems for their strong robustness and feature extraction capability. However, existing deep learning-based VO models are exploited by the architecture that connects the outputs of spatial backbones with a temporal learning model, and ignored the multiscale image spatial-temporal correlations, leading to relative high estimation errors, especially in case that the vehicle is making turns. To deal with this challenge, we proposed a novel VO architecture, named Multiscale Spatial-Temporal Feature Aggregation Network (MSST-VO), which enhancing the spatial-temporal feature representation via introducing the multiscale feature aggregation (FA) layers between the spatial backbones. In addition, three submodules, i.e., Spatial-temporal Deformable Feature Fusion (STDFF), Gated Attention ConvLSTM (GA-ConvLSTM) and Cross-attention-based Spatial-temporal Interactive Fusion (CASTIF), are proposed to improve the feature representation capability of FA layers in different terms. MSST-VO was compared with other traditional and state-of-the-art deep learning-based VO models, on two public datasets, KITTI and VoD. The experimental results demonstrated that our model outperformed other tested models in the majority of cases. The source code related to the core parts of our MSST-VO model is publicly available at: https://github.com/zifengyuan1997/MSST-VO-Code.