A Novel Clinically Explainable Vision Transformer for OCT-Based Retinal Disease Classification: Integrating UniMIE Enhancement and Grad-CAM Interpretability
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Explainable and precise Optical Coherence Tomography (OCT) image classification plays an essential role in early retinal disease detection and follow-up for conditions like Choroidal Neovascularization (CNV), Diabetic Macular Edema (DME), and Drusen. Conventionally applied deep learning models, including transformers and convolutional neural networks, yield state-of-the-art classification results but suffer from lacking interpretability within clinical practice and the difficulty in subtle differentiation among diseases. This paper suggests a clinically interpretable vision transformer (ViT) model, combining Universal Medical Image Enhancement (UniMIE)-based image enhancement, hierarchical ViT feature extraction, and Gradient-weighted Class Activation Mapping (Grad-CAM) based visualization to enhance both classification accuracy and interpretability. The Proposed ViT model is tested on UCSD and Mendeley OCT datasets, which has a top accuracy of 98.84% in 5-fold cross-validation, outperforming existing convolutional neural based and transformer-based methods. The model also attains an AUC-ROC value of 99.45%, showing better discriminative ability in CNV, DME, Drusen, and Normal classes. An extensive hyperparameter tuning approach optimized the dropout rate, encoder depth, and learning rate to improve accuracy and generalization. Grad-CAM visualizations also add clinical interpretability, where decision-critical retinal areas are pointed out, ensuring predictions to be consistent with pathological features noticed by ophthalmologists. Comparative analysis against current deep learning models reaffirms that the proposed ViT Model offers top-class performance without sacrificing the ability to solve primary shortcomings like deficiency in fine-grained classification, overfitting, and interpretability.