Explainability-Driven Adversarial Robustness Assessment for Generalized Deepfake Detectors

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The capabilities of generative models to produce high-quality fake images require deepfake detectors to be accurate and have strong generalization performance. Moreover, the explainability and adversarial robustness of deepfake detectors are critical to apply such models in real-world scenarios. In this paper, we propose a framework that leverages explainability to assess the adversarial robustness of deepfake detectors. Specifically, we apply feature attribution methods to identify image regions where the model is focusing to make its prediction. Then we use the generated heatmaps to perform an explainability-driven attack, perturbing the most relevant and irrelevant regions with gradient-based adversarial techniques. We feed the model with the resulting adversarial images and measure the accuracy drop and the attack success rate. We tested our methodology on state-of-the-art models with strong generalization abilities, providing a comprehensive and explainability-driven evaluation of their robustness. Experimental results show the explainability analysis serves as a tool to reveal vulnerabilities of generalized deepfake detectors to adversarial attacks.

Article activity feed