Do Hybrid CNN–Transformer Architectures Really Generalize? A Systematic Review for Medical Imaging

Roaa Ehab
Shimaa El-Bana
Ahmad Al-Kabbany

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This article presents a systematic review of hybrid CNN–Transformer architectures, examining whether and how their structural design supports generalization across diverse medical imaging scenarios. While convolutional neural networks offer strong spatial inductive biases and data efficiency, and Vision Transformers provide superior global context modeling, neither paradigm alone adequately addresses the full complexity of clinical imaging tasks. Hybrid architectures, by integrating both components, offer a compelling middle ground that is particularly valuable in medical imaging, where models must remain reliable across heterogeneous acquisition protocols, scanner variability, and diverse patient populations. At the same time, medical imaging poses unique generalization challenges — including cross-organ transfer, multi-modal fusion, and cross-dataset robustness — that expose the limitations of architectures optimized narrowly for benchmark performance. Following PRISMA guidelines, we systematically queried major academic databases, screened the resulting literature, and synthesized a representative body of peer-reviewed studies spanning a range of imaging modalities, anatomical targets, and learning paradigms. Our analysis covers the architectural taxonomy of hybrid designs, their learning and optimization strategies, and the evaluation practices adopted across the reviewed literature. The findings reveal that while hybrid models consistently demonstrate competitive performance, critical limitations persist: high computational overhead, insufficient external validation, and a heavy reliance on fully supervised learning constrain their real-world applicability. We conclude with a set of forward-looking recommendations emphasizing efficiency-aware design, standardized cross-domain evaluation, and the broader adoption of self-supervised learning strategies to advance the clinical translation of hybrid architectures.

Version published to 10.21203/rs.3.rs-9216007/v1 on Research Square
Mar 26, 2026

Bridging Scale, Semantics, and Boundaries: A Hybrid CNN-Transformer Architecture with Bidirectional Spatial-Channel Fusion for Medical Image Segmentation

This article has 4 authors:
1. Lanxiang Ma
2. Zongjian Yang
3. Jinghua Zhu
4. Jiquan Ma
This article has no evaluationsLatest version Apr 1, 2026
Synthetic MRI Pretraining for Medical Imaging Tasks

This article has 2 authors:
1. Rosanna Turrisi
2. Giuseppe Patanè
This article has no evaluationsLatest version Apr 6, 2026
CerebroNet: A systematically derived explainable brain tumor classifier for resource-constrained MRI diagnostics

This article has 3 authors:
1. Umar Hasan
2. Muhammad Ali Nayeem
3. Riasat Khan
This article has no evaluationsLatest version Mar 26, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Bridging Scale, Semantics, and Boundaries: A Hybrid CNN-Transformer Architecture with Bidirectional Spatial-Channel Fusion for Medical Image Segmentation

Synthetic MRI Pretraining for Medical Imaging Tasks

CerebroNet: A systematically derived explainable brain tumor classifier for resource-constrained MRI diagnostics