Enhancing Multimodal Glaucoma Screening through Attention-Guided Vision-Language Fusion

Xiaoyu Liu
Jinchun Piao
Qi Wang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Glaucoma, a leading cause of irreversible blindness, demands early detection for effective management. Traditional diagnostic methods relying on single-modality inputs often suffer from limitations such as incomplete context and high inter-observer variability. This study introduces a novel multimodal deep learning framework that integrates retinal fundus images with corresponding clinical texts for automated glaucoma screening. Leveraging a vision transformer (ViT)-based CLIP visual encoder and a domain-specific Bio_ClinicalBERT text encoder, our approach extracts semantically rich features from both modalities. An attention-based fusion module adaptively weights visual and textual cues, while a contrastive alignment loss enhances cross-modal consistency. Parameter-efficient fine-tuning strategies are employed to update only the top transformer layers, reducing computational overhead. Experiments on the Harvard-FairVLMed dataset demonstrate superior performance over baselines, achieving an AUC of 87.25% and an accuracy of 79.20%. Ablation studies validate the contributions of adaptive fusion, contrastive learning, and domain-specific language modeling, showcasing the framework's potential for real-world medical vision-language applications.The code and data are available at https://github.com/LXY12370/Multimodal.

Version published to 10.21203/rs.3.rs-7481454/v1 on Research Square
Sep 22, 2025

Explainable Deep Learning for Glaucoma Detection: A DenseNet121-Based Classification with Grad-CAM Visualization

This article has 2 authors:
1. Heshan Chandeepa Pathmakumara
2. Gayan Perera
This article has no evaluationsLatest version Oct 9, 2025
Enhancing Diabetic Retinopathy Prediction Using Transformer-based Attention in Hybrid CNN Models

This article has 4 authors:
1. Aayush Verma
2. Sanket Agrawal
3. Shreyans Jain
4. S Kanthimathi
This article has no evaluationsLatest version Sep 23, 2025
Benchmarking Deep Learning Models for Real-Time Diabetic Retinal Blood Vessel Segmentation

This article has 11 authors:
1. Robert Ngabo Mugisha
2. Geoffrey Munyaneza
3. Fideli Nsanzumukunzi
4. Mediatrice Dusenge
5. Josue Uzigusenga
6. Theophilla Igihozo
7. Fabrice Mpozenzi
8. Emmanuella Nuwayo
9. Benny Uhoranishema
10. Prince Shema Musonerwa
11. Jean De Dieu Niyonteze
This article has no evaluationsLatest version Oct 8, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Explainable Deep Learning for Glaucoma Detection: A DenseNet121-Based Classification with Grad-CAM Visualization

Enhancing Diabetic Retinopathy Prediction Using Transformer-based Attention in Hybrid CNN Models

Benchmarking Deep Learning Models for Real-Time Diabetic Retinal Blood Vessel Segmentation