Hybrid Framework for Interpretable Deepfake Video Detection Using CapsNet and Transformer Encoders
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In an era increasingly shaped by synthetic media, deepfake videos pose significant threats to public trust, media integrity, and digital security. To address the limitations of current detection models, particularly their lack of interpretability and poor cross-dataset generalizability, we propose a novel hybrid framework that integrates EfficientNetB7, Capsule Networks (CapsNet), and Transformer Encoder. This architecture uses EfficientNetB7 for high-resolution spatial feature extraction, CapsNet for part-whole relational modeling via dynamic routing, and Transformers for capturing long-range temporal inconsistencies across video frames. A distinguishing feature of our approach is its emphasis on explainability, realized through the integration of Grad-CAM to produce visual attributions for classification decisions. Our model was trained and validated on benchmark datasets Google DeepFakeDetection (DFD)and both original and manipulated subsets of FF++, and evaluated under intra-dataset and cross-dataset conditions. It achieved 93.00% accuracy and an F1-score of 0.9619 on DFD, and 87.64% accuracy on FF++, outperforming several state-of-the-art baselines. In cross-dataset testing, our model demonstrated superior generalization, achieving 89.36% accuracy when trained on FF++ and tested on DFD, outperforming recent methodologies. By combining explainable AI with a computationally efficient and temporally-aware hybrid model, this work offers a powerful, interpretable, and deployable solution for deepfake detection in real-world environments.