V-PTP-IC: End-to-End Joint Modeling of Dynamic Scenes and Social Interactions for Pedestrian Trajectory Prediction from Vehicle-Mounted Cameras
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Pedestrian trajectory prediction from vehicle-mounted cameras is a safety-critical capability in intelligent transportation systems and autonomous driving, particularly in highly dynamic and visually complex urban traffic. In such scenarios, ego-motion–induced jitter, frequent occlusions, and diverse background motions jointly challenge the modeling of both dynamic scene and social interactions, both of which are equally critical for forecasting future trajectories. Existing approaches, often developed for fixed-camera or surveillance setups, lack robustness to these dynamic driving conditions. We present V-PTP-IC (Vehicle-view Pedestrian Trajectory Prediction with Interaction Considerations), an end-to-end framework that jointly models dynamic scene and social interactions. The framework employs SORT-based multi-object tracking to initialize pedestrian trajectories and SIFT-based static keypoint matching for ego-motion compensation and trajectory stabilization. A VGG19-based dynamic scene encoder captures evolving environmental layouts, while a Social-LSTM module models spatiotemporal dependencies among pedestrians. A unified feature fusion strategy balances both modalities to generate accurate, diverse, and socially compliant trajectory forecasts. Extensive experiments on the JAAD in-vehicle dataset demonstrate that V-PTP-IC reduces the average displacement error (ADE) by 22.2 and the final displacement error (FDE) by 25.8 compared with state-of-the-art baselines. These results confirm the framework’s ability to balance prediction accuracy, diversity, and robustness, offering a scalable solution for autonomous driving in dynamically changing real-world environments.