Hybrid Vision Transformers for Accurate Recognition of Lung Lesions and Anatomical Structures in Bronchoscopic Imaging
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Bronchoscopy is a vital technique for diagnosing central lung cancers, yet its effectiveness is limited by the high reliance on clinician expertise and the variability of visual interpretation. To address this challenge, we develop and adapt advanced hybrid CNN-Transformer architectures designed specifically for the automatic recognition of lung cancer lesions and anatomical landmarks in bronchoscopic images. Our approach integrates the complementary strengths of convolutional networks for local feature extraction and Transformers for global context modeling. We apply our models to BM-BronchoLC, a richly annotated public dataset comprising 2,921 bronchoscopic images, and introduce a dual-framework pipeline: MedViT for multi-label classification and FCB-SwinV2 for binary segmentation. These models are tailored to handle the unique complexity of bronchoscopic data and demonstrate strong performance across both tasks. To ensure clinical relevance and robust generalization, we compare conventional random image-level splitting with a strict patient-level partitioning strategy that better reflects real-world deployment conditions. Our results confirm that the models maintain meaningful predictive power under patient-level separation, revealing their capacity to generalize beyond patient-specific features. MedViT achieves a mean accuracy of 94.74% and an AUC-ROC of 0.95 under random splitting, with performance remaining competitive under stricter evaluation. Similarly, FCB-SwinV2 demonstrates reliable segmentation with Dice scores of 0.42 for anatomical landmarks and 0.33 for lesions. This study establishes robust hybrid vision baselines for BM-BronchoLC and highlights the importance of rigorous validation protocols to ensure trustworthy AI systems in clinical bronchoscopy. All code and models are publicly released to support reproducibility and foster future advancements in this critical application area.