CLIP-CMDF Enhanced Vision Language Models with Novel GAN for Hematological Analysis: A Text-Guided White Blood Cell Identification Framework

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This paper presents a novel approach combining CLIP (Contrastive Language-Image Pre-training) with CMDF (Cross-Modal Dynamic Filters) methodology, enhanced by a specialized Generative Adversarial Network (Saliency-Consistent Cycle GAN, and Policy-Augmented Robust GAN), to address text-guided white blood cell classification in the Raabin dataset. Our hybrid framework tackles the challenging problem of correlating random significant sentences with specific leukocyte types, including morphologically complex cells such as basophils and eosinophils. The proposed CLIP-CMDF architecture leverages vision-language understanding while incorporating multi-scale feature extraction for semantic-visual alignment. A novel GAN architecture generates balanced text-image pairs to address class imbalance issues in the dataset. Experimental results demonstrate 80% accuracy, achieving competitive performance against state-of-the-art medical vision-language models including Med-PaLM M (78.5%) and GPT-4V Medical (77.2%). This research establishes a new benchmark for text-guided hematological analysis and provides a reproducible framework for sentence-to-cell-type association tasks. The implementation source code is accessible via the following link.

Article activity feed