Light-MLLMAD: A Lightweight Multimodal Large Language Model for One-Shot Industrial Visual Anomaly Detection

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Industrial visual anomaly detection plays a pivotal role in ensuring product quality and operational safety across manufacturing, energy, and precision engineering sectors. However, most deep learning approaches rely on extensive defect datasets, making them unsuitable for real-world scenarios where only a single defective instance may be available. To address this challenge, this paper introduces Light-MLLMAD, a Lightweight Multimodal Large Language Model framework designed for one-shot industrial anomaly detection. The proposed model integrates a compact vision encoder with parameter-efficient adapter layers and a text-guided reasoning module, enabling efficient learning from minimal examples. By employing prompt-conditioned anomaly grounding, Light-MLLMAD leverages natural-language prompts to describe contextual attributes such as texture, color deviation, or surface irregularity, thus enhancing interpretability and localization accuracy. A contrastive embedding regularization strategy further ensures robust separation between normal and anomalous features even with limited samples. Extensive experiments conducted on benchmark datasets—covering metallic surfaces, printed circuit boards, and industrial components—demonstrate that Light-MLLMAD achieves superior detection accuracy while reducing computational cost by over 60% compared to traditional vision-language models. The system also achieves near real-time inference on edge hardware, confirming its deployability in factory settings. Overall, the proposed framework bridges the gap between multimodal reasoning and lightweight industrial implementation, offering an interpretable, resource-efficient, and scalable approach for one-shot visual anomaly detection.

Article activity feed