Explainability-Driven Adversarial Robustness Assessment for Generalized Deepfake Detectors
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The capabilities of generative models to produce high-quality fake images require deepfake detectors to be accurate and have strong generalization performance. Moreover, the explainability and adversarial robustness of deepfake detectors are critical to apply such models in real-world scenarios. In this paper, we propose a framework that leverages explainability to assess the adversarial robustness of deepfake detectors. Specifically, we apply feature attribution methods to identify image regions where the model is focusing to make its prediction. Then we use the generated heatmaps to perform an explainability-driven attack, perturbing the most relevant and irrelevant regions with gradient-based adversarial techniques. We feed the model with the resulting adversarial images and measure the accuracy drop and the attack success rate. We tested our methodology on state-of-the-art models with strong generalization abilities, providing a comprehensive and explainability-driven evaluation of their robustness. Experimental results show the explainability analysis serves as a tool to reveal vulnerabilities of generalized deepfake detectors to adversarial attacks.