Combating Deepfakes Using an Integrated Framework forAudio and Video Deepfake Detection

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The creation and accessibility of synthetic audio-visual media which is commonly known as deepfakes has been into an alarming situation with the rapid advancement of generative AI techniques. This media is fabricated or sometimes manipulated and it can serve threats to individuals’ privacy, security and integrity of information. This situation makes it crucial to develop robust and reliable deepfake detection methods to mitigate these risks. We present a multimodal approach to deepfake detection leveraging both audio and visual signals. By harnessing the power of deep learning, our proposed framework extracts discriminative audio and visual features which are then fused to classify the synthetic media effectively. For audio analysis, we employ mel-spectrogram representations and convolutional neural networks (CNNs) to capture spectral patterns indicative of deepfakes. On the front of video, we utilize facial landmark detection, alignment, and deep CNNs to model facial cues and inconsistencies associated with deepfakes. The fusion of these audio and video modalities enables our model to capitalize on complementary information, enhancing its ability to detect deepfakes accurately. Extensive experiments on benchmark datasets, including DeepFakeTIMIT and DFDC, demonstrate the efficacy of our approach, achieving a precision of 0.78, an accuracy of 0.93, and an F1 score of 0.82, outperforming several state-of-the-art monomodal methods. Our findings underscore the importance of multimodal analysis for robust deepfake detection and pave the way for developing more sophisticated techniques to safeguard media authenticity in an increasingly synthetic world.

Article activity feed