SMILES Challenge 2025: Multitask Learning with Contrastive and Natural Language Generation for Enhanced Medical Image Classification

Raja Vavekanand
Teerath Kumar

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This article proposes a novel multitask learning framework that integrates contrastive learning and natural language generation (NLG) to enhance medical image classification and report generation. The goal is to improve disease classification accuracy and interpretability in medical diagnostics. The model architecture consists of a Vision Transformer (ViT) as a visual encoder, a transformer-based text encoder, and a multimodal decoder. The visual encoder processes medical images, while the text encoder handles disease-related text prompts. These components are trained jointly using image-text contrastive loss and language generation loss. Evaluations on the MIMICCXR and Chexpert datasets show that the model with NLG (Plain + NLG) outperforms the baseline contrastive learning model (Plain) in disease classification. For example, in the MIMICCXR dataset, the accuracy for Atelectasis increased from 17.44%(Plain) to 41.5% (Plain + NLG), and for Cardiomegaly, it improved from 19.25% to 47.4%. In Chexpert, the accuracy for Atelectasis increased from 12.5% to 58.5%, and for Pleural Effusion, from 61.10% to 64.0%. The model also demonstrated improvements in F1 scores, particularly for complex diseases like Cardiomegaly and Consolidation. The proposed multitask framework effectively combines contrastive learning with NLG, leading to improved disease classification and medical report generation. This approach has potential clinical applications by enhancing AI's interpretability and accuracy in medical decision-making.

Version published to 10.21203/rs.3.rs-7782188/v1 on Research Square
Oct 27, 2025

Towards Multimodal Retrieval-Augmented Generation for Medical Visual Question Answering

This article has 6 authors:
1. Mai A. Shaaban
2. Mohammad Reza Zarei
3. Adnan Khan
4. Abbas Akkasi
5. Mohammad Yaqub
6. Majid Komeili
This article has no evaluationsLatest version Oct 28, 2025
An Empirical Evaluation of Low-Rank Adapted Vision–Language Models for Radiology Medical Image Captioning

This article has 6 authors:
1. Mahmudul Hoque
2. Raisa Nusrat Chowdhury
3. Md Rakibul Hasan
4. Ojonugwa Oluwafemi Ejiga Peter
5. Fahmi Khalifa
6. Md Mahmudur Rahman
This article has no evaluationsLatest version Oct 24, 2025
Quantum-Assisted Deep Learning: A Hybrid Approach for Robust COVID-19 Diagnosis in Medical Imaging

This article has 2 authors:
1. Seyedeh Aram Salehi
2. Hanieh Naderi
This article has no evaluationsLatest version Nov 18, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Towards Multimodal Retrieval-Augmented Generation for Medical Visual Question Answering

An Empirical Evaluation of Low-Rank Adapted Vision–Language Models for Radiology Medical Image Captioning

Quantum-Assisted Deep Learning: A Hybrid Approach for Robust COVID-19 Diagnosis in Medical Imaging