Patch-CLIP - Contrastive Health Record-Image Joint Training with Patch Embedding Loss
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Vision-Language (VL) models such as Contrastive Language-Image pretraining (CLIP) use multimodal self-supervised learning (SSL) methods to extract maximal information from large-scale datasets. This enables the trained model to learn key image encodings while correlating them to corresponding textual information through a contrastive loss function that maximizes the similarity of VL pairs. Due to weak supervision provided by the text information, these VL models demonstrate strong zero-shot classification performance; however, their performance on downstream tasks such as object detection and localization remains suboptimal [1]. In this work, we introduce a novel contrastive loss function that aligns image patch embeddings with text embeddings. These patch embeddings that are usually discarded output from the image encoder can naturally incorporate location information during unsupervised training. The proposed approach improves both localization and classification performance, thus allowing key findings to be localized without the need for a complex downstream object detection framework. We evaluated the performance of our proposed method on two chest X-ray (CXR) datasets for abnormality detection and localization tasks. The experiments achieved state-of-the-art (SOTA) results for 8 abnormality detection tasks. Moreover, the patch prediction maps introduced in work considerably reduce False Positive (FP) rates at a given sensitivity, compared to saliency maps.