MSST-VO: Monocular Visual Odometry for Ground Vehicles in Urban Environments Based on Multiscale Spatial-Temporal Feature Aggregation Network

Zifeng Yuan
Yuyao Shen
Jun Wang
Yongqing Wang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Visual Odometry (VO) plays an important role in the visual simultaneous localization and mapping (v-SLAM) system. In recent years, end-to-end deep learning methods has been introduced and applied to VO systems for their strong robustness and feature extraction capability. However, existing deep learning-based VO models are exploited by the architecture that connects the outputs of spatial backbones with a temporal learning model, and ignored the multiscale image spatial-temporal correlations, leading to relative high estimation errors, especially in case that the vehicle is making turns. To deal with this challenge, we proposed a novel VO architecture, named Multiscale Spatial-Temporal Feature Aggregation Network (MSST-VO), which enhancing the spatial-temporal feature representation via introducing the multiscale feature aggregation (FA) layers between the spatial backbones. In addition, three submodules, i.e., Spatial-temporal Deformable Feature Fusion (STDFF), Gated Attention ConvLSTM (GA-ConvLSTM) and Cross-attention-based Spatial-temporal Interactive Fusion (CASTIF), are proposed to improve the feature representation capability of FA layers in different terms. MSST-VO was compared with other traditional and state-of-the-art deep learning-based VO models, on two public datasets, KITTI and VoD. The experimental results demonstrated that our model outperformed other tested models in the majority of cases. The source code related to the core parts of our MSST-VO model is publicly available at: https://github.com/zifengyuan1997/MSST-VO-Code.

Version published to 10.21203/rs.3.rs-9365602/v1 on Research Square
Apr 10, 2026

From Representation to Action: A Unified Laplacian Framework for Spatial Representation and Path Planning

This article has 5 authors:
1. Junfeng Zuo
2. Yuhang He
3. Wen-Hao Zhang
4. Fang Fang
5. Si Wu
This article has no evaluationsLatest version May 6, 2026
VOGeo-Gaze: Calibration-Free, Geometry-Aware Deep Learning for Real-Time Gaze Tracking in Clinical Video-Oculography

This article has 7 authors:
1. Jingkang Zhao
2. Seyed-Ahmad Ahmadi
3. Julian Decker
4. Peter zu Eulenburg
5. Andreas Zwergal
6. Virginia L. Flanagin
7. Max Wuehr
This article has no evaluationsLatest version May 29, 2026
Prior scene context reshapes feature reliance during rapid perception

This article has 4 authors:
1. Sule Tasliyurt-Celebi
2. Benjamin de Haas
3. Melissa L.-H. Võ
4. Katharina Dobs
This article has no evaluationsLatest version May 18, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

From Representation to Action: A Unified Laplacian Framework for Spatial Representation and Path Planning

VOGeo-Gaze: Calibration-Free, Geometry-Aware Deep Learning for Real-Time Gaze Tracking in Clinical Video-Oculography

Prior scene context reshapes feature reliance during rapid perception