ContextualCLIP: A Context-Aware and Multi-Grained Fusion Framework for Few-Shot Ultrasound Anomaly Analysis
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Ultrasound (US) imaging is crucial for breast anomaly detection, but its interpretation is subjective and suffers from data scarcity and domain generalization issues. Existing deep learning models struggle to achieve both precise pixel-level localization and fine-grained image-level classification simultaneously, especially in few-shot and cross-domain settings. To address these challenges, we propose ContextualCLIP, a novel few-shot adaptation framework built upon CLIP. ContextualCLIP introduces three core enhancements: (1) a Contextualized Adaptive Prompting (CAP) generator that dynamically creates clinically relevant text prompts by integrating high-order semantic contextual information; (2) a Multi-Grained Feature Fusion Adapter (MGFA) that extracts and adaptively fuses features from different CLIP visual encoder layers using gated attention for multi-scale lesion analysis; and (3) a Domain-Enhanced Memory Bank (DEMB) that improves cross-domain generalization by learning domain-invariant embeddings through a lightweight domain-aware module and contrastive learning. Jointly optimized for localization and classification, ContextualCLIP is evaluated on BUS-UCLM for adaptation and BUSI/BUSZS for zero-extra-adaptation. Results demonstrate that ContextualCLIP consistently achieves superior performance over state-of-the-art baselines across various few-shot settings, yielding substantially higher classification and localization metrics. Ablation studies validate the efficacy of each module, and human evaluation suggests significant augmentation of radiologists' diagnostic accuracy and confidence. ContextualCLIP provides a robust and efficient solution for comprehensive ultrasound anomaly analysis in data-scarce and diverse clinical environments.