MASA-RTNet: A Multimodal Adaptive-Stream-Attention Network for Real-Time Video Suspicious-Behaviour Detection

Lucky Rajpoot
Rosy Madaan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Real–time video anomaly detection (VAD) must reconcile three competing goals: (i) state–of–the–art accuracy in one–class settings where true anomalies are unseen during training; (ii) sub–frame latency on resource-constrained edge devices; and (iii) robustness across the appearance, motion, and physical interaction cues that jointly characterise abnormal behaviour. We introduce MASA-RTNet, a Multimodal Adaptive Stream–Attention network that fuses appearance, optical flow, and a lightweight physics-informed graph branch inside a parameter-free attention gate. Two adaptive early-exit classifiers decide on-the-fly whether intermediate features already suffice for a confident verdict, yielding up to 5.6× average FLOP reduction. Trained exclusively on normal data with a curriculum of synthetic outliers, MASA-RTNet attains new state-of-the-art frame-level AUCs of 97.3 % on UCSD-Ped2, 87.9 % on CUHK Avenue, and 74.5 % on ShanghaiTech, while sustaining 37 fps (26.9 ms) on an NVIDIA Jet-son Xavier NX with only 5.8 M trainable parameters. Extensive ablations confirm that every modality and the MASA gate contribute meaningfully, and that INT8 quantisation plus structured pruning incur negligible (< 0.1 pp) accuracy loss. The full code, trained checkpoints, and reproducible Docker environment are released for the community. 1 We propose the EE-OneClass paradigm that marries early-exit inference with one-class energy modelling, enabling accurate VAD on edge GPUs (27 ms per 256 × 256 px frame on Jetson Xavier). 2 We design a tri-modal encoder that fuses appearance, motion and physics in a parameter-efficient manner (5.8 M trainable parameters), and show that each modality makes complementary contributions.

Version published to 10.21203/rs.3.rs-7362672/v1 on Research Square
Aug 27, 2025

A Unified GAN-Based Framework for Unsupervised Video Anomaly Detection Using Optical Flow and RGB Cues

This article has 2 authors:
1. Seung-Hun Kang
2. Hyun-Soo Kang
This article has no evaluationsLatest version Sep 19, 2025
HCTNet: Hybrid CNN--Mamba Network for Real-Time Semantic Segmentation in Urban Traffic Scenes

This article has 6 authors:
1. Qiang Meng
2. Jingjun Cheng
3. Wenbang Hao
4. Mengyi Liu
5. Xiang Gao
6. Zhiyuan Zhao
This article has no evaluationsLatest version Oct 1, 2025
Swin-CATPN: A Context-Enhanced Temporal Pyramid Network with Swin Transformer for Action Recognition

This article has 4 authors:
1. Hanyou Huang
2. Changnan Jiang
3. Ziyuan Zhang
4. Heqing Ouyang
This article has no evaluationsLatest version Sep 2, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Unified GAN-Based Framework for Unsupervised Video Anomaly Detection Using Optical Flow and RGB Cues

HCTNet: Hybrid CNN--Mamba Network for Real-Time Semantic Segmentation in Urban Traffic Scenes

Swin-CATPN: A Context-Enhanced Temporal Pyramid Network with Swin Transformer for Action Recognition