Cross-Domain Tomato Disease Classification via Flexible Contrastive Clustering in Vision-Language Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Plant disease detection systems face significant challenges in cross-domain generalization, particularly when transitioning from controlled laboratory settings to diverse field conditions. Traditional deep learning approaches exhibit severe performance degradation across different imaging environments, limiting practical deployment in real-world agricultural scenarios. This paper introduces a novel Flexible Contrastive Clustering (FCC) framework for zero-shot tomato disease classification that addresses fundamental generalization limitations through vision-language learning. Unlike standard CLIP’s one-to-one image-text pairing, our method leverages one-to-many relationships where each disease image is associated with multiple diverse textual descriptions, enabling robust representation learning across linguistic variations. The FCC framework optimizes class-based clustering in joint embedding space through a specialized loss function that treats all same-class descriptions as positives, facilitating effective handling of both seen and unseen disease categories during zero-shot evaluation. We evaluate our approach on PlantDoc training data (740 images) and test across four diverse tomato disease datasets totaling 17,313 images, spanning laboratory and field conditions. Experimental results demonstrate substantial improvements over state-of-the-art vision-language models, achieving an average of 30.15% accuracy and 28.05% weighted F1-score on average across all test datasets. Our method shows particularly strong performance on field datasets, achieving 59.70% accuracy on FieldPlant and 26.52% on Tomato Village, significantly outperforming existing approaches. Attention visualization analysis reveals effective disease localization capabilities for both seen and unseen categories, validating the practical applicability of our approach for real-world agricultural monitoring systems.