Graph-Based Learning and Multimodal Learning for Colon Disease Classification: An Interpretable Study using CNN-GNN Pipelines and Vision-Language Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Colorectal cancer (CRC) is a significant health issue in the world that requires the use of improved diagnostic instruments to detect it at an early and precise stage. In this paper, the researcher proposes an interpretable classification of colon diseases based on endoscopic images of the Kvasir V2 data set. Each image was subjected to a systematic preprocessing pipeline prior to being model trained to be consistent and better represent features. The size of the images was reduced to 224 224 pixels to match the specifications of deep learning inputs. Pixel intensities were brought to a stable value to enable convergence and contrast enhancement was used to improve the visibility of the mucosal textures. The edge sharpening methods included unsharp masking and Laplacian filter as a way of emphasizing structural boundaries and highlighted the edges of lesions and polyp margins. To augment the data, random rotation and flips, zoom scaling, and light articulations were introduced to enhance the diversity of the data as well as reduce overfitting and enhance resistance to real-world variation in colonoscopy imaging. In this paper, a hybrid pipeline consisting of CNNs and GNNs is suggested to extract visual features and model relational dependencies and Vision-Language Models (VLMs), which combines Vision Transformer (ViT) with BERT to learn across multiple modalities. It tested various methods of creating graphs (cosine similarity, ε-radius, k-nearest neighbors) and GNN models (GCN, GAT, GraphSAGE, GIN) and reached the highest accuracy of 91 percent with ViT + Epsilon + GIN. Tuned ViT-BERT performed the best with 95.17% accuracy and 0.95 F1-score. Grad-CAM visualizations have further improvements in interpretability as they demonstrate clinically-relevant areas of pictures, which place the framework as a strong, interpretable, and transparent instrument of automated CRC diagnostics in various clinical environments.