Lip-Sync Authenticity Detection Using Spatial, Spectral, and Deep Learning-Based Feature Fusion
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Lip-sync deepfakes, which re-animate a speaker’s mouth region to match arbitrary audio, present a distinct and underexamined forensic challenge: unlike full face-swap manipulations, they preserve speaker identity and confine modification to the perioral region, causing general-purpose deepfake detectors to fail. Detecting such manipulations requires cues that are sensitive to subtle spatial inconsistencies, frequency-domain artifacts, and inter-frame temporal discontinuities simultaneously. This paper proposes LipSyncAuthenticityNet, a multi-modal lip-sync authenticity detection framework that fuses spatial, spectral, and attention augmented deep learning features into a unified classification pipeline. Spatial features capture inter-frame consistency via Pearson correlation, pixel intensity difference, and Canny edge density analysis. Frequency-domain analysis detects spectral artifacts characteristic of manipulated content via two-dimensional fast Fourier transform (2D FFT). A convolutional neural network (CNN) with a channel-wise spatial attention mechanism focuses representational capacity on discriminative lip-region discrepancies in paired video frames. The three modalities are individually insufficient but jointly discriminative, a property confirmed by systematic ablation studies across all component combinations. Evaluated on the GRID Lipreading Database, LipSyncAuthenticityNet achieves a receiver operating characteristic area under the curve (ROC-AUC) of 0.9918 and outperforms ResNet-18 and MobileNetV2 baselines by up to 4.9 percentage points, while maintaining a compact 2.55M-parameter architecture with sub-4 ms graphics processing unit (GPU) inference latency suitable for real-time forensic deployment. An integrated explainability module provides interpretable feature-level evidence for every classification decision, supporting transparent and auditable forensic application.