ViT-ConvGDNet: A Vision Transformer–MobileNet Guided Decoder Network for Robust Copy-Move Forgery Detection and Localization

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Copy-move forgery Digital image manipulation is a common form of copy-move forgery where a portion of an image is copied and pasted back into the image. This is especially difficult to detect when the forgery has been done on copied areas that have undergone post-processing functions, e.g. rotation, scaling or blurring. We suggest a new encoder-decoder framework named ViT-ConvGDNet, which integrates the global contextual strengths of Vision Transformers with feature extraction strengths of convolutional operations in MobileNet. Sobel edge detection is added to the encoder to improve the level of awareness of the boundary and sharpness of features. Also, there is Atrous Spatial Pyramid Pooling (ASPP) to obtain multi-scale contextual data that are necessary to accurately perform localization. A layer-wise weighted loss mechanism controls the decoding process, which uses a custom mixture of loss functions to every decoder layer to improve the prediction accuracy. ViT-ConvGDNet makes use of patch-based self-attention mechanisms and is effective at learning long-range dependencies and being trained to the different scales and complexities of images. The performance of the model is better as shown by extensive evaluations on several benchmark datasets such as MICC-F600, MICC-F2000, IMD, Coverage, CoMoFoD, Ardizzone, GRIP, and CASIA. It has been experimentally demonstrated that ViT-ConvGDNet is more effective than various current deep learning methods and provides a rigorous and scalable solution to problematic copy-move forgery detection and localization.

Article activity feed