ChatGPT-4 Prompt: A Tool to Enhance Novice Radiologists' Diagnostic Capabilities in Cystic Renal Masses to Expert-Level Accuracy

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background The impact of prompt engineering in LLMs on text-based questions has shown variability, whereas its influence on image-based diagnostic tasks remains largely unexplored. Purpose This study aims to evaluate the diagnostic performance of various prompts in GPT-4 for the assessment of renal cystic masses (CRMs) using contrast-enhanced ultrasound (CEUS)Bosniak classification. And then test the ability of ChatGPT-4 prompts to assist radiologists with different experience. Materials and Methods This retrospective study included 103 images of CRMs from patients who underwent CEUS and CT. GPT-4 (OpenAI) and six radiologists (three experts and three novices) were independently tasked with assigning the Bosniak classification (BC) based solely on the original CEUS images. Subsequently,radiologists reassessed these images after knowing the BCs generated by GPT-4's prompt and determined whether to modify their initial assessments. The diagnostic performance of radiologists and GPT-4 prompts was assessed and quantified using the area under the receiver operating characteristic curve (AUC). Result The AUC achieved by GPT-4 prompts ranged from 0.549 to 0.778, while radiologists' AUCs ranged from 0.820 to 0.901. Among all prompting strategies, ROT prompting achieved the highest AUC, demonstrating performance comparable to that of novices (0.778 vs. 0.820, P = 0.39). Although the AUC was lower than that of experts (0.778 vs. 0.901, P = 0.01), ROT prompting improved the AUCs of novices: from 0.714 to 0.834 for novice 1, from 0.685 to 0.782 for novice 2, and from 0.704 to 0.783 for novice 3, with all three novices approaching expert-level performance. Conclusion GPT-4 with different prompts showed variable performance in interpreting images. ROT prompting as the best-performing style achieved diagnostic accuracy comparable to novices, and it could aidnovices in improving their diagnostic performance to expert level.

Article activity feed