A Unified Vision Transformer and Convolutional Neural Network Framework for Multi-Domain Cancer Classification
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate and reliable classification of cancer from medical imaging is essential for effective computer-aided diagnosis. In this study, we conduct a comprehensive evaluation of three deep learning architectures—Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and a hybrid model (HViT-CNN) that integrates CNN backbones with transformer-based attention mechanisms. These models are benchmarked across three diverse and clinically relevant imaging modalities: brain magnetic resonance imaging (MRI), dermoscopic images for skin cancer, and cytology slides for cervical cancer. While CNNs demonstrate strong performance in capturing local texture features and ViTs offer advantages in modeling global spatial relationships, both architectures exhibit modality-specific limitations. The proposed HViT-CNN addresses these limitations by combining localized feature extraction with global contextual reasoning. Across all datasets, the hybrid model consistently achieved the highest classification accuracy of 98.4% for brain tumors, 98.0% for skin cancer, and 99.0% for cervical cancer—outperforming its individual components. These results underscore the effectiveness of hybrid architectures in handling both coarse and fine-grained image features, and highlight their potential for advancing generalizable, high-precision diagnostic tools in medical image analysis.