A Multimodal Semantic Alignment Framework for Pedestrian Intention Recognition and Trajectory Prediction in Autonomous Driving
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Understanding complex pedestrian environments remains a critical challenge for autonomous driving systems. This study introduces a multimodal framework for pedestrian intention recognition by integrating visual and language features. A CLIP encoder is employed to jointly embed in-vehicle camera images and textual labels of traffic scenes. To assess the reliability of recognition results, Bayesian uncertainty modeling is applied. Additionally, an improved Social-GRU network is designed to jointly predict multiple pedestrian trajectories. Experiments conducted on the Waymo dataset and publicly available pedestrian re-identification datasets demonstrate that the proposed framework improves intention classification accuracy by 6.8% and reduces the average displacement error in trajectory prediction by 9.1%.