A Transformer Driven Hybrid Feature Fusion Framework for Multi-Modal Medical Image Analysis
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Early disease diagnosis greatly depends on strong medical image classification models. In this paper, a hybrid method is proposed to combine handcrafted descriptors (HOG, BoVW) and deep features (VGG19) to form an integrative feature fusion representation. The combined features are then fed into an optimized Vision Transformer (FFXViT), which allows stronger global context modelling while maintaining key local information. Two reference modalities, histopathology images with three classes - adenocarcinoma, squamous cell carcinoma, benign and chest X-ray images with four classes - COVID-19, lung opacity, normal, viral pneumonia, were experimented on. The proposed approach FFXViT attained 99.50% on histopathology and 97.41% on chest X-rays accuracy, a remarkable improvement over state-of-the-art CNNs, transformer and hybrid baselines. The experiment showcases the scalability, robustness, and interpretability of the framework and empirically verify FFXViT as a viable solution for robust cross-modality medical image analysis and clinical decision support.