UREPTrack: Unified RGB-Event Visual Tracking via PoolFormer Backbone
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Visual Object Tracking (VOT) faces significant challenges under conditions such as fast motion, motion blur, and extreme illumination. RGB-only trackers often degrade in these scenarios, while event cameras provide microsecond latency and high dynamic range but lack rich spatial semantics. We introduce UREPTrack, a unified, single-stage, attention-free RGB-event tracker built on a lightweight PoolFormer backbone. Raw event data are voxelized into compact spatiotemporal tensors and, together with RGB template and search patches, embedded and concatenated into a single token stream processed by a shared backbone. A fully con- volutional head jointly predicts classification confidence, center offsets, and box size, eliminating the need for multi-branch Siamese pipelines and costly self-attention. UREPTrack achieves state-of-the-art performance, setting new benchmarks on COESOT (S 64.4, P 77.5, NP 76.2, BOC 23.7) at 170 FPS, VisEvent (S 55.46, SR0.5 67.01, SR0.75 46.96, P 71.58, NP 75.22), and FE108 (P 94.3, S 65.9). Ablation studies confirm (i) the complementarity of RGB and event modalities, (ii) the superiority of event voxelization over image-like alternatives, and (iii) favorable accuracy and effciency scaling across PoolFormer sizes. UREPTrack provides a practical, high-speed solution for real-time, multi-modal tracking on resource-constrained hardware. Our codes will be publicly released in https://github.com/HamadYA/UREPTrack.