Lip-Sync Authenticity Detection Using Spatial, Spectral, and Deep Learning-Based Feature Fusion

Pranav Mahesh Rayban
Ajitesh Sharma
Aarya Ashish Nagvekar
Jaishree Jaikrishnan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Lip-sync deepfakes, which re-animate a speaker’s mouth region to match arbitrary audio, present a distinct and underexamined forensic challenge: unlike full face-swap manipulations, they preserve speaker identity and confine modification to the perioral region, causing general-purpose deepfake detectors to fail. Detecting such manipulations requires cues that are sensitive to subtle spatial inconsistencies, frequency-domain artifacts, and inter-frame temporal discontinuities simultaneously. This paper proposes LipSyncAuthenticityNet, a multi-modal lip-sync authenticity detection framework that fuses spatial, spectral, and attention augmented deep learning features into a unified classification pipeline. Spatial features capture inter-frame consistency via Pearson correlation, pixel intensity difference, and Canny edge density analysis. Frequency-domain analysis detects spectral artifacts characteristic of manipulated content via two-dimensional fast Fourier transform (2D FFT). A convolutional neural network (CNN) with a channel-wise spatial attention mechanism focuses representational capacity on discriminative lip-region discrepancies in paired video frames. The three modalities are individually insufficient but jointly discriminative, a property confirmed by systematic ablation studies across all component combinations. Evaluated on the GRID Lipreading Database, LipSyncAuthenticityNet achieves a receiver operating characteristic area under the curve (ROC-AUC) of 0.9918 and outperforms ResNet-18 and MobileNetV2 baselines by up to 4.9 percentage points, while maintaining a compact 2.55M-parameter architecture with sub-4 ms graphics processing unit (GPU) inference latency suitable for real-time forensic deployment. An integrated explainability module provides interpretable feature-level evidence for every classification decision, supporting transparent and auditable forensic application.

Version published to 10.21203/rs.3.rs-9103447/v1 on Research Square
Mar 18, 2026

Classification of deepfake images with RANSAC for feature extraction and a hybrid model of YOLOv5 and ResNet-50

This article has 3 authors:
1. Rohan Singh
2. Dilip Kumar Sharma
3. Praphula Kumar Jain
This article has no evaluationsLatest version Apr 7, 2026
CascadeNet: A Two-Stage Hybrid Learning Framework for Explainable Deepfake Forensics

This article has 4 authors:
1. Jatin Yadav
2. Divyanshu Sharma
3. Pritee Khanna
4. Neha Gour
This article has no evaluationsLatest version Apr 13, 2026
A Contrastive Learning-Based Short Speech Bio-key Generation Model

This article has 1 author:
1. Zhengyin Lv
This article has no evaluationsLatest version Apr 1, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Classification of deepfake images with RANSAC for feature extraction and a hybrid model of YOLOv5 and ResNet-50

CascadeNet: A Two-Stage Hybrid Learning Framework for Explainable Deepfake Forensics

A Contrastive Learning-Based Short Speech Bio-key Generation Model