FusionNeXt-XtremeNet: A Deep Ensemble Model with LLM-Aided Clinical Report Generation for Dermoscopic Image Classification
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper presents FusionNeXt-XtremeNet, a novel deep ensemble architecture that combines ConvNeXt, Vision Transformer (ViT), and EfficientNetV2 for classifying dermoscopic images based on acquisition types. To improve clinical interpretability, a GPT-2-based Large Language Model (LLM) enhanced by the Language-augmented Multimodal Attention (LeMMA) mechanism is integrated to generate structured diagnostic reports. Model evaluated on the ISIC 2020--2022 dataset of 1,767 images, and achieves state-of-the-art performance in binary classification (94.1% accuracy, 94.1% F1-score, 0.969 ROC-AUC), three-class classification (90.6% accuracy, 90.8% F1-score), and four-class classification (87.6% accuracy, 87.8% F1-score). The LeMMA-augmented GPT-2 generates clinically relevant reports with a BLEU score of 0.85, reducing generation time by 15.2% compared to baseline, and achieves high dermatologist evaluation scores (accuracy: 4.3/5, relevance: 4.4/5). Grad-CAM visualisations demonstrate strong alignment with clinical features (r=0.82, p<0.001), with 85% of attention regions corresponding to dermatologically significant patterns. This dual framework not only enhances prediction reliability but also bridges the gap between black-box AI models and clinical usability through explainable, text-based outputs.