Independent Benchmarking of Prompt-Based Medical Segmentation Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Medical image segmentation rapidly shifts toward vision(-language) foundation models that unify diverse modalities and tasks within a single framework. In this work, we systematically benchmark high-impact vision-language and segment-anything-based architectures across multiple clinically relevant CT and MRI tasks. We show that while these models achieve strong performance, each comes with specific (dis)advantages. Non-3D models are highly flexible but require substantial user guidance and are prone to over- or under-detection. 3D architectures offer overall more reliable volumetric consistency, but can still have detection problems. Vision-language models appear sensitive to the coverage of training data, whereas click-prompted SAM-based models are more universal, with a, though limited, ability to address zero-shot targets. When tested with more complex text prompts, most vision-language models exhibit missing semantic language understanding. Overall, these models hold considerable promise but still express limitations. Our work highlights key areas where future research is needed to advance vision(-language) foundation models.