A Dual-Modality Spatio-Temporal and Frequency Framework for Robust Deepfake Detection

Arman Sajjadi
Sayna Sarvar
Mobin Nekou
Mahdi Fallah
Delaram Mehralizadeh
Mohammad Hossein Jabbarzadeh
Pedram Salehpour

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rapid proliferation of high-fidelity facial synthesis has made automated deepfake detection a core requirement in digital forensics. Existing detectors that rely solely on spatial artifacts or temporal inconsistencies lack robustness under strong compression, heterogeneous manipulation pipelines. This paper introduces DSTF-Net, a dual-modality spatio-temporal and temporal-frequency framework for deepfake video detection that explicitly addresses these limitations. DSTF-Net consists of two coordinated branches. The Waveformer branch applies a three-level 2D Haar wavelet decomposition to face-aligned frames, generating multi-scale frequency representations that expose seams, resampling traces, and texture inconsistencies, and then aggregates these frame-level descriptors using a transformer-based temporal combiner with hybrid statistical pooling. In parallel, the SFormer branch employs a Swin-Transformer backbone coupled with a temporal encoder to jointly model spatial structure and motion dynamics over sequences of 32 frames, capturing both local appearance distortions and long-range temporal irregularities. The 512-dimensional embeddings from the two branches are fused using an adapted MLP-Mixer that performs token- and channel-wise mixing to learn a compact, discriminative video-level representation. DSTF-Net attains 97.50–99.16% accuracy with AUC values between 0.9786 and 0.9998 across four manipulation types in the FaceForensics + + benchmark and reaches 97.77% accuracy with an AUC of 0.9881 on Celeb-DF, establishing consistent high performance in intra-dataset settings. These results confirm that the explicit integration of spatio-temporal and frequency-domain cues in a unified architecture yields a robust and practically deployable solution for deepfake detection in real-world conditions.

Version published to 10.21203/rs.3.rs-8686357/v1 on Research Square
Feb 27, 2026

Lip-Sync Authenticity Detection Using Spatial, Spectral, and Deep Learning-Based Feature Fusion

This article has 4 authors:
1. Pranav Mahesh Rayban
2. Ajitesh Sharma
3. Aarya Ashish Nagvekar
4. Jaishree Jaikrishnan
This article has no evaluationsLatest version Mar 18, 2026
StycoGAN for Feature Level Temporal Regularization in Perceptually Stable Sequential Image Synthesis

This article has 4 authors:
1. Mars Caroline Wibowo
2. Danny HF Manongga
3. Hendry Hendry
4. Budhi Kristianto
This article has no evaluationsLatest version Mar 4, 2026
MFFP-Net: Multi-directional Feature Fusion and Position-Aware Network

This article has 4 authors:
1. Yazhong Si
2. Jingyu Chen
3. Hongxu Li
4. Chen Li
This article has no evaluationsLatest version Mar 9, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Lip-Sync Authenticity Detection Using Spatial, Spectral, and Deep Learning-Based Feature Fusion

StycoGAN for Feature Level Temporal Regularization in Perceptually Stable Sequential Image Synthesis

MFFP-Net: Multi-directional Feature Fusion and Position-Aware Network