Exploring the Collaboration Between Vision Models and LLMs for Enhanced Image Classification
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper defines a task that utilizes vision and language models to improve benchmarks through analysis of CIFAR-10 and CIFAR-100 datasets. The work divides its operations into image categorization followed by visual description production. The task utilizes BEiT and Swin models as state-of-the-art application-specific components for both parts of this research. We selected the current best image classification checkpoints available in the market which delivered 99.00% accuracy on CIFAR-10 and 92.01% on CIFAR-100. For dense contextually rich text output we used BLIP. The expert models performed well on their target responsibilities using minimal noisy data. The BART model achieved new state-of-the-art accuracies when used as a text classifier to compare synthesized descriptions and reached 99.73% accuracy on CIFAR-10 while attaining 98.38% accuracy on CIFAR-100. This paper demonstrates how our integrated vision and language decomposition-hierarchical model surpasses all existing state-of-the-art results on these common benchmark classifications. The full framework, along with the classified images and generated datasets, is available at https://github.com/bhavyarupani/LLM-Img-Classification.