A Resource-Efficient Hybrid ViT–CNN Framework with CutMix Regularization for Cardiac MRI Image Classification

Amirreza Khayyat assadi
Babak Nouri-Moghaddam
Abbas Mirzaei

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Hybrid deep learning architectures that combine Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have emerged as an effective paradigm in medical image computing by enabling simultaneous modeling of local textures and global contextual dependencies. Nevertheless, achieving high classification accuracy while maintaining computational efficiency remains a significant challenge, particularly in resource-constrained computing environments and data-limited medical imaging scenarios. In this study, we propose a resource-efficient Hybrid ViT–CNN framework for cardiac magnetic resonance imaging (MRI) classification, explicitly designed to optimize architectural inductive bias through structured feature fusion and CutMix regularization. The proposed model employs a shallow convolutional stem to encode localized texture information and inject domain-specific inductive bias, followed by a lightweight Transformer encoder to capture long-range global dependencies. To enhance generalization and training stability on limited datasets, CutMix stochastic augmentation is incorporated, while a dynamic resource-adaptive batching strategy is utilized to optimize memory usage and computational throughput during training on CPU-only hardware. The framework is evaluated on the CAD Cardiac MRI dataset using stratified five-fold cross-validation. Experimental results demonstrate an average classification accuracy of 96.71% ± 1.2%, an F1-score of 0.9671, and an Area Under the ROC Curve (AUC) of 0.9960, consistently outperforming standalone CNN- and ViT-based baselines. Importantly, the proposed model converges within 231 seconds on CPU-only hardware and achieves real-time inference performance of approximately 12 ms per image, highlighting its practical feasibility for deployment in constrained computing environments. Ablation studies further confirm that the hybrid architectural design yields an intrinsic performance gain of approximately 4.5%, with CutMix providing additional robustness. These findings demonstrate that high-accuracy cardiac MRI classification can be achieved without reliance on high-end GPU resources, underscoring the potential of hybrid, resource-aware deep learning architectures for scalable and efficient medical image computing applications.

Version published to 10.21203/rs.3.rs-8735303/v1 on Research Square
Feb 25, 2026

ML-ConvNet: A Lightweight and Interpretable Unified Architecture for Medical Image Classification Across Modalities

This article has 10 authors:
1. Williams Ayivi
2. Xiaoling Zhang
3. Yeongx Yeong Hyeon Gu
4. Amil Aligayev
5. Ali Alqahtani
6. Wisdom Xornam Ativi
7. Francis Sam
8. Muhammed Amin Abdullah
9. Emmanuel Sarpong Addai Gyarteng
10. Mugahed A. Al-antari
This article has no evaluationsLatest version Mar 17, 2026
Bridging Scale, Semantics, and Boundaries: A Hybrid CNN-Transformer Architecture with Bidirectional Spatial-Channel Fusion for Medical Image Segmentation

This article has 4 authors:
1. Lanxiang Ma
2. Zongjian Yang
3. Jinghua Zhu
4. Jiquan Ma
This article has no evaluationsLatest version Apr 1, 2026
Do Hybrid CNN–Transformer Architectures Really Generalize? A Systematic Review for Medical Imaging

This article has 3 authors:
1. Roaa Ehab
2. Shimaa El-Bana
3. Ahmad Al-Kabbany
This article has no evaluationsLatest version Mar 26, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

ML-ConvNet: A Lightweight and Interpretable Unified Architecture for Medical Image Classification Across Modalities

Bridging Scale, Semantics, and Boundaries: A Hybrid CNN-Transformer Architecture with Bidirectional Spatial-Channel Fusion for Medical Image Segmentation

Do Hybrid CNN–Transformer Architectures Really Generalize? A Systematic Review for Medical Imaging