A Dual-Modality Spatio-Temporal and Frequency Framework for Robust Deepfake Detection
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The rapid proliferation of high-fidelity facial synthesis has made automated deepfake detection a core requirement in digital forensics. Existing detectors that rely solely on spatial artifacts or temporal inconsistencies lack robustness under strong compression, heterogeneous manipulation pipelines. This paper introduces DSTF-Net, a dual-modality spatio-temporal and temporal-frequency framework for deepfake video detection that explicitly addresses these limitations. DSTF-Net consists of two coordinated branches. The Waveformer branch applies a three-level 2D Haar wavelet decomposition to face-aligned frames, generating multi-scale frequency representations that expose seams, resampling traces, and texture inconsistencies, and then aggregates these frame-level descriptors using a transformer-based temporal combiner with hybrid statistical pooling. In parallel, the SFormer branch employs a Swin-Transformer backbone coupled with a temporal encoder to jointly model spatial structure and motion dynamics over sequences of 32 frames, capturing both local appearance distortions and long-range temporal irregularities. The 512-dimensional embeddings from the two branches are fused using an adapted MLP-Mixer that performs token- and channel-wise mixing to learn a compact, discriminative video-level representation. DSTF-Net attains 97.50–99.16% accuracy with AUC values between 0.9786 and 0.9998 across four manipulation types in the FaceForensics + + benchmark and reaches 97.77% accuracy with an AUC of 0.9881 on Celeb-DF, establishing consistent high performance in intra-dataset settings. These results confirm that the explicit integration of spatio-temporal and frequency-domain cues in a unified architecture yields a robust and practically deployable solution for deepfake detection in real-world conditions.