Comparative Evaluation the Knowledge of Large Language Models about Response Evaluation Criteria in Solid Tumors?

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

PURPOSE: To evaluate the diagnostic prowess of eight cutting‐edge large language models (LLMs) in applying the RECIST 1.1 guidelines for oncologic imaging and to compare their performance with that of board‐certified radiologists. This study explores the potential of LLMs as transformative adjuncts in cancer follow‐up imaging. MATERIAL AND METHOD: In this experimental cross‐sectional study, 50 text‐based and 30 case‐based multiple‐choice questions (MCQs) derived from RECIST 1.1 were administered to eight LLMs—including ChatGPT variants, Claude (3 Opus and 3.5 Sonnet), Google Gemini 1.5 Pro, Meta Llama 3.1 405B, Mistral Large 2, and Perplexity Pro—and two junior radiologists with seven years of experience. Responses were independently scored as correct or incorrect, and non‐parametric statistical analyses were performed to compare performance across groups. RESULTS: Strikingly, all LLMs demonstrated competence comparable to that of the radiologists, with only minor performance variations. Claude 3.5 Sonnet led the pack, achieving 83.3% accuracy on case‐based and 90% on text‐based questions. Other models exhibited robust performance, with no significant differences in case‐based assessments between LLMs and radiologists. CONCLUSION: Our findings may pioneer a great change in the reporting of follow-up imaging of cancer patients, which has an important place in clinical practice. The exceptional performance of LLMs,-particularly Claude 3.5 Sonnet- and their peers underscores the promise of LLMs as revolutionary tools in oncologic imaging. These models not only support radiologist but may soon redefine clinical workflows, setting a new benchmark for diagnostic excellence in radiology.

Article activity feed