ConVLM: Concept-Guided Vision-Language Models for Explainable Dermatological Diagnosis
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate and interpretable diagnosis of dermatological lesions is crucial but challenging due to data scarcity, morphological diversity, and the "black-box" nature of traditional deep learning models. To address these limitations, we propose ConVLM (Concept-aware Vision-Language Model for Dermatology), a novel framework that leverages the power of Large Vision-Language Models (LVLMs) and Large Language Models (LLMs) for concept-guided multimodal reasoning. ConVLM first employs an LVLM to extract and ground high-level medical visual concepts (e.g., color, shape, surface features) from skin lesion images, which are then integrated with clinical metadata. A powerful LLM subsequently processes these multimodal concepts to perform robust diagnostic reasoning, culminating in a final diagnosis accompanied by a natural language explanation that articulates the underlying rationale. Experiments on the challenging SkinCon dataset demonstrate that ConVLM not only achieves competitive or superior diagnostic performance (87.21% BACC, 81.05% F1) but also significantly enhances model interpretability, as validated by human evaluation with dermatologists (4.6/5 clarity, 4.3/5 utility). Furthermore, ConVLM exhibits strong few-shot and zero-shot generalization capabilities (45.1% BACC in 0-shot), crucial for rare conditions. Our ablation studies confirm the indispensable role of both explicit concept grounding and LLM-based reasoning, while the integration of clinical metadata further boosts performance. ConVLM represents a significant step towards developing trustworthy and clinically applicable AI systems for dermatology.