Hybrid Framework for Interpretable Deepfake Video Detection Using CapsNet and Transformer Encoders

Gargi Kadam
Sanika Tiwarekar
Yash Sonawane
Kailas Devadkar
Jignesh Sisodia

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

In an era increasingly shaped by synthetic media, deepfake videos pose significant threats to public trust, media integrity, and digital security. To address the limitations of current detection models, particularly their lack of interpretability and poor cross-dataset generalizability, we propose a novel hybrid framework that integrates EfficientNetB7, Capsule Networks (CapsNet), and Transformer Encoder. This architecture uses EfficientNetB7 for high-resolution spatial feature extraction, CapsNet for part-whole relational modeling via dynamic routing, and Transformers for capturing long-range temporal inconsistencies across video frames. A distinguishing feature of our approach is its emphasis on explainability, realized through the integration of Grad-CAM to produce visual attributions for classification decisions. Our model was trained and validated on benchmark datasets Google DeepFakeDetection (DFD)and both original and manipulated subsets of FF++, and evaluated under intra-dataset and cross-dataset conditions. It achieved 93.00% accuracy and an F1-score of 0.9619 on DFD, and 87.64% accuracy on FF++, outperforming several state-of-the-art baselines. In cross-dataset testing, our model demonstrated superior generalization, achieving 89.36% accuracy when trained on FF++ and tested on DFD, outperforming recent methodologies. By combining explainable AI with a computationally efficient and temporally-aware hybrid model, this work offers a powerful, interpretable, and deployable solution for deepfake detection in real-world environments.

Version published to 10.21203/rs.3.rs-7300434/v1 on Research Square
Aug 21, 2025

RLDSCP: Reducing Label Dependency with Self-Attention and Contrastive Pretraining

This article has 2 authors:
1. sai prabanjan kumar kalvapalli
2. MALA C
This article has no evaluationsLatest version Aug 27, 2025
Attention based deep learning model for detecting copy move forgery

This article has 2 authors:
1. Priyadharsini Selvaraj
2. Annes Belmin S
This article has no evaluationsLatest version Oct 1, 2025
Multi Stage Spatial Temporal Ensemble Model with Integrated Learning Methods for Robust Deepfake Detection

This article has 6 authors:
1. Warusia Yassin
2. Faizal Abdollah
3. Anuar Ismail
4. Noor Hisham Kamis
5. Siti Fatimah Abdul Razak
6. Helen K Joy
This article has no evaluationsLatest version Sep 15, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

RLDSCP: Reducing Label Dependency with Self-Attention and Contrastive Pretraining

Attention based deep learning model for detecting copy move forgery

Multi Stage Spatial Temporal Ensemble Model with Integrated Learning Methods for Robust Deepfake Detection