Lip-Sync Authenticity Detection Using Spatial, Spectral, and Deep Learning-Based Feature Fusion

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Lip-sync deepfakes, which re-animate a speaker’s mouth region to match arbitrary audio, present a distinct and underexamined forensic challenge: unlike full face-swap manipulations, they preserve speaker identity and confine modification to the perioral region, causing general-purpose deepfake detectors to fail. Detecting such manipulations requires cues that are sensitive to subtle spatial inconsistencies, frequency-domain artifacts, and inter-frame temporal discontinuities simultaneously. This paper proposes LipSyncAuthenticityNet, a multi-modal lip-sync authenticity detection framework that fuses spatial, spectral, and attention augmented deep learning features into a unified classification pipeline. Spatial features capture inter-frame consistency via Pearson correlation, pixel intensity difference, and Canny edge density analysis. Frequency-domain analysis detects spectral artifacts characteristic of manipulated content via two-dimensional fast Fourier transform (2D FFT). A convolutional neural network (CNN) with a channel-wise spatial attention mechanism focuses representational capacity on discriminative lip-region discrepancies in paired video frames. The three modalities are individually insufficient but jointly discriminative, a property confirmed by systematic ablation studies across all component combinations. Evaluated on the GRID Lipreading Database, LipSyncAuthenticityNet achieves a receiver operating characteristic area under the curve (ROC-AUC) of 0.9918 and outperforms ResNet-18 and MobileNetV2 baselines by up to 4.9 percentage points, while maintaining a compact 2.55M-parameter architecture with sub-4 ms graphics processing unit (GPU) inference latency suitable for real-time forensic deployment. An integrated explainability module provides interpretable feature-level evidence for every classification decision, supporting transparent and auditable forensic application.

Article activity feed