A Hybrid CNN-Transformer Deep Learning Model for Differentiating Benign and Malignant Breast Tumors Using Multi-View Ultrasound Images
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Breast cancer is a leading malignancy threatening women’s health globally, making early and accurate diagnosis crucial. Ultrasound is a key screening and diagnostic tool due to its non-invasive, real-time, and cost-effective nature. However, its diagnostic accuracy is highly dependent on operator experience, and conventional single-image analysis often fails to capture the comprehensive features of a lesion. This study introduces a computer-aided diagnosis (CAD) system that emulates a clinician’s multi-view diagnostic process. We developed a novel hybrid deep learning model that integrates a Convolutional Neural Network (CNN) with a Transformer architecture. The model uses a pretrained EfficientNetV2 to extract spatial features from multiple, unordered ultrasound images of a single lesion. These features are then processed by a Transformer encoder, whose self-attention mechanism globally models and fuses their intrinsic correlations. To prevent data leakage, a strict lesion-level data partitioning strategy ensured a rigorous evaluation. On an internal test set, our hybrid CNN-Transformer achieved an accuracy of 0.960, a sensitivity of 0.967, a specificity of 0.954, and an Area Under the Curve (AUC) of 0.9788. On an external dataset, it demonstrated an accuracy of 0.940, a sensitivity of 0.952, a specificity of 0.929, and an AUC of 0.9730. Furthermore, in a prospective validation on a newly collected independent dataset, the model maintained robust performance with an accuracy of 0.952 and an AUC of 0.9801. These results significantly outperform those of a baseline single-image model, which achieved accuracies of 0.88 and 0.89 and AUC of 0.95 and 0.94 on the internal and external dataset, respectively. This study demonstrates that combining CNN with Transformers yields a highly accurate and robust diagnostic system for breast ultrasound. By effectively fusing multi-view information, our model aligns with clinical logic and shows potential for improving diagnostic reliability.