MASA-RTNet: A Multimodal Adaptive-Stream-Attention Network for Real-Time Video Suspicious-Behaviour Detection
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Real–time video anomaly detection (VAD) must reconcile three competing goals: (i) state–of–the–art accuracy in one–class settings where true anomalies are unseen during training; (ii) sub–frame latency on resource-constrained edge devices; and (iii) robustness across the appearance, motion, and physical interaction cues that jointly characterise abnormal behaviour. We introduce MASA-RTNet, a Multimodal Adaptive Stream–Attention network that fuses appearance, optical flow, and a lightweight physics-informed graph branch inside a parameter-free attention gate. Two adaptive early-exit classifiers decide on-the-fly whether intermediate features already suffice for a confident verdict, yielding up to 5.6× average FLOP reduction. Trained exclusively on normal data with a curriculum of synthetic outliers, MASA-RTNet attains new state-of-the-art frame-level AUCs of 97.3 % on UCSD-Ped2, 87.9 % on CUHK Avenue, and 74.5 % on ShanghaiTech, while sustaining 37 fps (26.9 ms) on an NVIDIA Jet-son Xavier NX with only 5.8 M trainable parameters. Extensive ablations confirm that every modality and the MASA gate contribute meaningfully, and that INT8 quantisation plus structured pruning incur negligible (< 0.1 pp) accuracy loss. The full code, trained checkpoints, and reproducible Docker environment are released for the community. 1 We propose the EE-OneClass paradigm that marries early-exit inference with one-class energy modelling, enabling accurate VAD on edge GPUs (27 ms per 256 × 256 px frame on Jetson Xavier). 2 We design a tri-modal encoder that fuses appearance, motion and physics in a parameter-efficient manner (5.8 M trainable parameters), and show that each modality makes complementary contributions.