Enhancing Multimodal Glaucoma Screening through Attention-Guided Vision-Language Fusion
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Glaucoma, a leading cause of irreversible blindness, demands early detection for effective management. Traditional diagnostic methods relying on single-modality inputs often suffer from limitations such as incomplete context and high inter-observer variability. This study introduces a novel multimodal deep learning framework that integrates retinal fundus images with corresponding clinical texts for automated glaucoma screening. Leveraging a vision transformer (ViT)-based CLIP visual encoder and a domain-specific Bio_ClinicalBERT text encoder, our approach extracts semantically rich features from both modalities. An attention-based fusion module adaptively weights visual and textual cues, while a contrastive alignment loss enhances cross-modal consistency. Parameter-efficient fine-tuning strategies are employed to update only the top transformer layers, reducing computational overhead. Experiments on the Harvard-FairVLMed dataset demonstrate superior performance over baselines, achieving an AUC of 87.25% and an accuracy of 79.20%. Ablation studies validate the contributions of adaptive fusion, contrastive learning, and domain-specific language modeling, showcasing the framework's potential for real-world medical vision-language applications.The code and data are available at https://github.com/LXY12370/Multimodal.