A Multimodal Semantic Alignment Framework for Pedestrian Intention Recognition and Trajectory Prediction in Autonomous Driving

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Understanding complex pedestrian environments remains a critical challenge for autonomous driving systems. This study introduces a multimodal framework for pedestrian intention recognition by integrating visual and language features. A CLIP encoder is employed to jointly embed in-vehicle camera images and textual labels of traffic scenes. To assess the reliability of recognition results, Bayesian uncertainty modeling is applied. Additionally, an improved Social-GRU network is designed to jointly predict multiple pedestrian trajectories. Experiments conducted on the Waymo dataset and publicly available pedestrian re-identification datasets demonstrate that the proposed framework improves intention classification accuracy by 6.8% and reduces the average displacement error in trajectory prediction by 9.1%.

Article activity feed