MLIF-Net: Multimodal Fusion of Vision Transformers and Large Language Models for AI Image Detection

Xuan Li
Lei Fu
Jinghan Cao
Qiyuan Tian
Jing Cao
Kowei Shih

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This paper presents the Multimodal Language-Image Fusion Network (MLIF-Net), a new architecture to distinguish AI-generated images from real ones. MLIF-Net combines Vision Transformer (ViT) and Large Language Models (LLMs) to build a multimodal feature fusion network that improves AI-generated content detection accuracy. The model uses a Cross-Attention Mechanism to combine visual and semantic features and a Multiscale Contextual Reasoning Layer to capture both global and local image features. An Adaptive Loss Function improves the consistency and robustness of feature extraction. Experimental results show that MLIF-Net outperforms existing models in accuracy, recall, and Average Precision (AP). This approach can lead to more accurate detection of AI-generated content and may have applications in other generative content tasks.

Version published to 10.20944/preprints202505.2370.v1
May 29, 2025

VCAFPN: Feature fusion for object detection in any direction based on biological visual cross attention

This article has 4 authors:
1. Zhiyou Wang
2. Jingmin Yang
3. Yongchao Qiao
4. Wenjie Zhang
This article has no evaluationsLatest version May 14, 2025
Multimodal and Distributed LLMs: Bridging Scalability and Cross-Modal Reasoning

This article has 4 authors:
1. Rajesh Kumar
2. Isabelle Laurent
3. David Müller
4. Klaus Elli
This article has no evaluationsLatest version May 15, 2025
DEEPQUERY VISIONXTRANS: INTEGRATING CUSTOM ATTENTION AND HYBRID LEARNING FOR VERSATILE IMAGE CLASSIFICATION, DETECTION, AND SEGMENTATION

This article has 2 authors:
1. Abhijit Tripathy
2. Isha Pandey
This article has no evaluationsLatest version May 19, 2025

Listed in

Abstract

Article activity feed

Related articles

VCAFPN: Feature fusion for object detection in any direction based on biological visual cross attention

Multimodal and Distributed LLMs: Bridging Scalability and Cross-Modal Reasoning

DEEPQUERY VISIONXTRANS: INTEGRATING CUSTOM ATTENTION AND HYBRID LEARNING FOR VERSATILE IMAGE CLASSIFICATION, DETECTION, AND SEGMENTATION