SMILES Challenge 2025: Multitask Learning with Contrastive and Natural Language Generation for Enhanced Medical Image Classification
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This article proposes a novel multitask learning framework that integrates contrastive learning and natural language generation (NLG) to enhance medical image classification and report generation. The goal is to improve disease classification accuracy and interpretability in medical diagnostics. The model architecture consists of a Vision Transformer (ViT) as a visual encoder, a transformer-based text encoder, and a multimodal decoder. The visual encoder processes medical images, while the text encoder handles disease-related text prompts. These components are trained jointly using image-text contrastive loss and language generation loss. Evaluations on the MIMICCXR and Chexpert datasets show that the model with NLG (Plain + NLG) outperforms the baseline contrastive learning model (Plain) in disease classification. For example, in the MIMICCXR dataset, the accuracy for Atelectasis increased from 17.44%(Plain) to 41.5% (Plain + NLG), and for Cardiomegaly, it improved from 19.25% to 47.4%. In Chexpert, the accuracy for Atelectasis increased from 12.5% to 58.5%, and for Pleural Effusion, from 61.10% to 64.0%. The model also demonstrated improvements in F1 scores, particularly for complex diseases like Cardiomegaly and Consolidation. The proposed multitask framework effectively combines contrastive learning with NLG, leading to improved disease classification and medical report generation. This approach has potential clinical applications by enhancing AI's interpretability and accuracy in medical decision-making.