Object Tracking: A Comprehensive Survey From Classical Approaches to Large Vision-Language and Foundation Models

Rahul Raja
Arpita Vats
Omkar Thawakar
Tajamul Ashraf

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Object tracking remains a central problem in computer vision with broad applications in surveillance, autonomous driving, augmented reality, and human–computer interaction. This paper presents a comprehensive survey that unifies the progression of tracking methodologies, from handcrafted and probabilistic models to deep learning paradigms and recent advances with large vision–language and foundation models. We categorize tracking into Single Object Tracking (SOT), Multi-Object Tracking (MOT), and Long-Term Tracking (LTT), systematically reviewing CNN, Siamese, transformer, and hybrid-based approaches alongside detection-guided, detection-integrated, and re-identification–aware pipelines. Special emphasis is placed on emerging trends, including open-vocabulary tracking, promptable models, and multimodal fusion across RGB, depth, thermal, LiDAR, and event-based inputs. We highlight benchmark datasets, evaluation protocols, and taxonomy refinements that reveal convergence toward unified and generalizable tracking systems. Finally, we discuss open challenges—such as occlusion, scalability, identity consistency, and cross-dataset transferability—and outline future directions in self-supervised learning, adapter tuning, and efficient foundation model adaptation. This survey aims to serve as a reference for both academic researchers and practitioners, bridging classical paradigms with the rapidly evolving landscape of foundation and vision- language driven tracking.

Version published to 10.20944/preprints202509.2051.v2
Oct 6, 2025
Version published to 10.20944/preprints202509.2051.v1
Sep 24, 2025

Object Tracking: A Comprehensive Survey From Classical Approaches to Large Vision-Language and Foundation Models

This article has 4 authors:
1. Rahul Raja
2. Arpita Vats
3. Omkar Thawakar
4. Tajamul Ashraf
This article has no evaluationsLatest version Oct 6, 2025
Multi-Object Tracking with Integrated Appearance and Mamba-Based Motion Features

This article has 4 authors:
1. Changhao Zhou
2. Xun Duan
3. Dafu Zu
4. Guangqian Kong
This article has no evaluationsLatest version Sep 2, 2025
DIPLOMAT: multi-animal tracking with efficient manual editing

This article has 5 authors:
1. Isaac Robinson
2. George Glidden-Handgis
3. Neekesh Panchal
4. Nathan Insel
5. Travis J. Wheeler
This article has no evaluationsLatest version Aug 15, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Object Tracking: A Comprehensive Survey From Classical Approaches to Large Vision-Language and Foundation Models

Multi-Object Tracking with Integrated Appearance and Mamba-Based Motion Features

DIPLOMAT: multi-animal tracking with efficient manual editing