ConVLM: Concept-Guided Vision-Language Models for Explainable Dermatological Diagnosis

Alexander Davis

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Accurate and interpretable diagnosis of dermatological lesions is crucial but challenging due to data scarcity, morphological diversity, and the "black-box" nature of traditional deep learning models. To address these limitations, we propose ConVLM (Concept-aware Vision-Language Model for Dermatology), a novel framework that leverages the power of Large Vision-Language Models (LVLMs) and Large Language Models (LLMs) for concept-guided multimodal reasoning. ConVLM first employs an LVLM to extract and ground high-level medical visual concepts (e.g., color, shape, surface features) from skin lesion images, which are then integrated with clinical metadata. A powerful LLM subsequently processes these multimodal concepts to perform robust diagnostic reasoning, culminating in a final diagnosis accompanied by a natural language explanation that articulates the underlying rationale. Experiments on the challenging SkinCon dataset demonstrate that ConVLM not only achieves competitive or superior diagnostic performance (87.21% BACC, 81.05% F1) but also significantly enhances model interpretability, as validated by human evaluation with dermatologists (4.6/5 clarity, 4.3/5 utility). Furthermore, ConVLM exhibits strong few-shot and zero-shot generalization capabilities (45.1% BACC in 0-shot), crucial for rare conditions. Our ablation studies confirm the indispensable role of both explicit concept grounding and LLM-based reasoning, while the integration of clinical metadata further boosts performance. ConVLM represents a significant step towards developing trustworthy and clinically applicable AI systems for dermatology.

Version published to 10.31224/5064
Aug 11, 2025

DermFusionX: An Explainable CNN–MLP Late Fusion Framework for Multimodal Skin Lesion Classification

This article has 1 author:
1. Vanshika Sharma
This article has no evaluationsLatest version Sep 25, 2025
Compact Vision–Language Models Enable Efficient and Interpretable Automated OCT Analysis Through Layer Specific Multimodal Learning

This article has 8 authors:
1. Tania Haghighi
2. Sina Gholami
3. Jared Todd Sokol
4. Jennifer I. Lim
5. Theodore Leng
6. Atalie C. Thompson
7. Hamed Tabkhi
8. Minhaj Nur Alam
This article has no evaluationsLatest version Aug 11, 2025
Pre- and Post-Gated Attention-based Multimodal Fusion for Skin Lesion Classification

This article has 3 authors:
1. Thi-Trang Nguyen
2. Van-Hieu Vu
3. Viet-Anh Nguyen
This article has no evaluationsLatest version Sep 11, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

DermFusionX: An Explainable CNN–MLP Late Fusion Framework for Multimodal Skin Lesion Classification

Compact Vision–Language Models Enable Efficient and Interpretable Automated OCT Analysis Through Layer Specific Multimodal Learning

Pre- and Post-Gated Attention-based Multimodal Fusion for Skin Lesion Classification