Do Hybrid CNN–Transformer Architectures Really Generalize? A Systematic Review for Medical Imaging

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This article presents a systematic review of hybrid CNN–Transformer architectures, examining whether and how their structural design supports generalization across diverse medical imaging scenarios. While convolutional neural networks offer strong spatial inductive biases and data efficiency, and Vision Transformers provide superior global context modeling, neither paradigm alone adequately addresses the full complexity of clinical imaging tasks. Hybrid architectures, by integrating both components, offer a compelling middle ground that is particularly valuable in medical imaging, where models must remain reliable across heterogeneous acquisition protocols, scanner variability, and diverse patient populations. At the same time, medical imaging poses unique generalization challenges — including cross-organ transfer, multi-modal fusion, and cross-dataset robustness — that expose the limitations of architectures optimized narrowly for benchmark performance. Following PRISMA guidelines, we systematically queried major academic databases, screened the resulting literature, and synthesized a representative body of peer-reviewed studies spanning a range of imaging modalities, anatomical targets, and learning paradigms. Our analysis covers the architectural taxonomy of hybrid designs, their learning and optimization strategies, and the evaluation practices adopted across the reviewed literature. The findings reveal that while hybrid models consistently demonstrate competitive performance, critical limitations persist: high computational overhead, insufficient external validation, and a heavy reliance on fully supervised learning constrain their real-world applicability. We conclude with a set of forward-looking recommendations emphasizing efficiency-aware design, standardized cross-domain evaluation, and the broader adoption of self-supervised learning strategies to advance the clinical translation of hybrid architectures.

Article activity feed