A Dual-Modality Spatio-Temporal and Frequency Framework for Robust Deepfake Detection

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The rapid proliferation of high-fidelity facial synthesis has made automated deepfake detection a core requirement in digital forensics. Existing detectors that rely solely on spatial artifacts or temporal inconsistencies lack robustness under strong compression, heterogeneous manipulation pipelines. This paper introduces DSTF-Net, a dual-modality spatio-temporal and temporal-frequency framework for deepfake video detection that explicitly addresses these limitations. DSTF-Net consists of two coordinated branches. The Waveformer branch applies a three-level 2D Haar wavelet decomposition to face-aligned frames, generating multi-scale frequency representations that expose seams, resampling traces, and texture inconsistencies, and then aggregates these frame-level descriptors using a transformer-based temporal combiner with hybrid statistical pooling. In parallel, the SFormer branch employs a Swin-Transformer backbone coupled with a temporal encoder to jointly model spatial structure and motion dynamics over sequences of 32 frames, capturing both local appearance distortions and long-range temporal irregularities. The 512-dimensional embeddings from the two branches are fused using an adapted MLP-Mixer that performs token- and channel-wise mixing to learn a compact, discriminative video-level representation. DSTF-Net attains 97.50–99.16% accuracy with AUC values between 0.9786 and 0.9998 across four manipulation types in the FaceForensics + + benchmark and reaches 97.77% accuracy with an AUC of 0.9881 on Celeb-DF, establishing consistent high performance in intra-dataset settings. These results confirm that the explicit integration of spatio-temporal and frequency-domain cues in a unified architecture yields a robust and practically deployable solution for deepfake detection in real-world conditions.

Article activity feed