Light-MLLMAD: A Lightweight Multimodal Large Language Model for One-Shot Industrial Visual Anomaly Detection

Augustian Isaac R
Sundaravadivel P
Vinoth kumar E.S
Priyanga G

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Industrial visual anomaly detection plays a pivotal role in ensuring product quality and operational safety across manufacturing, energy, and precision engineering sectors. However, most deep learning approaches rely on extensive defect datasets, making them unsuitable for real-world scenarios where only a single defective instance may be available. To address this challenge, this paper introduces Light-MLLMAD, a Lightweight Multimodal Large Language Model framework designed for one-shot industrial anomaly detection. The proposed model integrates a compact vision encoder with parameter-efficient adapter layers and a text-guided reasoning module, enabling efficient learning from minimal examples. By employing prompt-conditioned anomaly grounding, Light-MLLMAD leverages natural-language prompts to describe contextual attributes such as texture, color deviation, or surface irregularity, thus enhancing interpretability and localization accuracy. A contrastive embedding regularization strategy further ensures robust separation between normal and anomalous features even with limited samples. Extensive experiments conducted on benchmark datasets—covering metallic surfaces, printed circuit boards, and industrial components—demonstrate that Light-MLLMAD achieves superior detection accuracy while reducing computational cost by over 60% compared to traditional vision-language models. The system also achieves near real-time inference on edge hardware, confirming its deployability in factory settings. Overall, the proposed framework bridges the gap between multimodal reasoning and lightweight industrial implementation, offering an interpretable, resource-efficient, and scalable approach for one-shot visual anomaly detection.

Version published to 10.21203/rs.3.rs-7853870/v1 on Research Square
Oct 15, 2025

Unified Transformer Framework for Integrated Language -Vision Understanding and Content Generation

This article has 2 authors:
1. Anuj Attri
2. HariOm .
This article has no evaluationsLatest version Nov 4, 2025
FiT: Feature Integration Transformer with Universal Language Interface for Multi-Task Vision

This article has 6 authors:
1. Sana Cheema
2. Ghulam Gilanie
3. Tariq Alsahfi
4. Sami Alesawi
5. Raed Alsini
6. Ali Daud
This article has no evaluationsLatest version Oct 9, 2025
SF-YOLO11: A Real-Time Winter Jujube Detection Model Based on Lightweight Multi-Scale Fusion

This article has 5 authors:
1. Jichao Wang
2. Zhenqiao Hui
3. Mengyuan Li
4. Donglai Sun
5. Sicong Pang
This article has no evaluationsLatest version Nov 11, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Unified Transformer Framework for Integrated Language -Vision Understanding and Content Generation

FiT: Feature Integration Transformer with Universal Language Interface for Multi-Task Vision

SF-YOLO11: A Real-Time Winter Jujube Detection Model Based on Lightweight Multi-Scale Fusion