Multi Stage Spatial Temporal Ensemble Model with Integrated Learning Methods for Robust Deepfake Detection
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In the era of synthetic media, robust and scalable deepfake detection has become critical to preserving digital content integrity. Existing detection methods often focus narrowly on spatial or temporal features, limiting generalizability and robustness. This paper proposes an Integrated Learning Methods (ILM) Model, a novel multi-stage hybrid architecture combining YOLOv5 for precise face detection, Haar Cascade for face validation, ResNet-50 for hierarchical spatial feature extraction, LightGBM for frame-level classification, LSTM for temporal modeling, and Random Forest for final ensemble fusion. Evaluated on FaceForensics + + and Celeb-DF (v2) datasets, the proposed ILM achieved 98% accuracy, precision, recall, and F1-score, outperforming state-of-the-art CNN, RNN, and transformer-based models. Ablation studies validated the incremental contributions of each module, confirming the synergistic design of ILM in addressing spatial misalignment, temporal inconsistencies, and generalization limitations. The modular and scalable design supports deployment in digital forensics, media authentication, and AI governance, while future work will integrate transformer-based global context encoders and explainable AI for enhanced robustness and interpretability.