A Resource-Efficient Hybrid ViT–CNN Framework with CutMix Regularization for Cardiac MRI Image Classification

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Hybrid deep learning architectures that combine Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have emerged as an effective paradigm in medical image computing by enabling simultaneous modeling of local textures and global contextual dependencies. Nevertheless, achieving high classification accuracy while maintaining computational efficiency remains a significant challenge, particularly in resource-constrained computing environments and data-limited medical imaging scenarios. In this study, we propose a resource-efficient Hybrid ViT–CNN framework for cardiac magnetic resonance imaging (MRI) classification, explicitly designed to optimize architectural inductive bias through structured feature fusion and CutMix regularization. The proposed model employs a shallow convolutional stem to encode localized texture information and inject domain-specific inductive bias, followed by a lightweight Transformer encoder to capture long-range global dependencies. To enhance generalization and training stability on limited datasets, CutMix stochastic augmentation is incorporated, while a dynamic resource-adaptive batching strategy is utilized to optimize memory usage and computational throughput during training on CPU-only hardware. The framework is evaluated on the CAD Cardiac MRI dataset using stratified five-fold cross-validation. Experimental results demonstrate an average classification accuracy of 96.71% ± 1.2%, an F1-score of 0.9671, and an Area Under the ROC Curve (AUC) of 0.9960, consistently outperforming standalone CNN- and ViT-based baselines. Importantly, the proposed model converges within 231 seconds on CPU-only hardware and achieves real-time inference performance of approximately 12 ms per image, highlighting its practical feasibility for deployment in constrained computing environments. Ablation studies further confirm that the hybrid architectural design yields an intrinsic performance gain of approximately 4.5%, with CutMix providing additional robustness. These findings demonstrate that high-accuracy cardiac MRI classification can be achieved without reliance on high-end GPU resources, underscoring the potential of hybrid, resource-aware deep learning architectures for scalable and efficient medical image computing applications.

Article activity feed