Enhancing Multimodal Glaucoma Screening through Attention-Guided Vision-Language Fusion

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Glaucoma, a leading cause of irreversible blindness, demands early detection for effective management. Traditional diagnostic methods relying on single-modality inputs often suffer from limitations such as incomplete context and high inter-observer variability. This study introduces a novel multimodal deep learning framework that integrates retinal fundus images with corresponding clinical texts for automated glaucoma screening. Leveraging a vision transformer (ViT)-based CLIP visual encoder and a domain-specific Bio_ClinicalBERT text encoder, our approach extracts semantically rich features from both modalities. An attention-based fusion module adaptively weights visual and textual cues, while a contrastive alignment loss enhances cross-modal consistency. Parameter-efficient fine-tuning strategies are employed to update only the top transformer layers, reducing computational overhead. Experiments on the Harvard-FairVLMed dataset demonstrate superior performance over baselines, achieving an AUC of 87.25% and an accuracy of 79.20%. Ablation studies validate the contributions of adaptive fusion, contrastive learning, and domain-specific language modeling, showcasing the framework's potential for real-world medical vision-language applications.The code and data are available at https://github.com/LXY12370/Multimodal.

Article activity feed