Is it time for the neurologist to use Large Language Models in everyday practice?
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large Language Models (LLMs) such as ChatGPT and Gemini are gaining momentum in healthcare for their diagnostic potential. However, their real-world applicability in specialized medical fields like neurology remains inadequately explored. The possibility to use these tools in everyday diagnostic practice relies on the evaluation of their ability to serve as support for the clinician in assessing the patient, understanding the possible diagnosis and design the diagnostic pathway. To this end, in this study we (1) examined the available literature on the evaluation of LLMs in neurology diagnosis in order to understand whether the methodologies applied were adequate to translate the use of LLMs in everyday practice, and (2) designed and performed an experiment to evaluate the diagnostic accuracy and clinical recommendations of ChatGPT-3.5 and Gemini compared to neurologists using real-world clinical cases presented following the everyday diagnostic practice. In the vast literature of LLMs application in neurology, only 24 studies reported experiences using LLMs in clinical neurology. The experiments reported showed a heterogeneous scenario of prompt engineering and input formats. At present, while responses using structured prompts were well documented, there is a lack of studies using real-world clinical scenarios, and everyday workflows and practice. We therefore conducted a real-world experiment using a cohort of 28 anonymized patient records from the neurology department of the ASST Santi Paolo e Carlo Hospital (Milan, Italy). Cases were presented to ChatGPT-3.5 and Gemini replicating the typical clinical workflows. Diagnostic accuracy and appropriateness of recommended diagnostic tests were assessed against discharge diagnoses and neurologists’ performance.
Neurologists achieved a diagnostic accuracy of 75%, outperforming ChatGPT-3.5 (54%) and Gemini (46%). Both LLMs exhibited difficulties in nuanced clinical reasoning and over-prescribed diagnostic tests in 17–25% of cases. Despite their ability to generate structured recommendations, they struggled with complex or ambiguous presentations, requiring additional prompts in some cases. We can therefore conclude that LLMs have potential as supportive tools in neurology but they currently lack the depth required for nuanced clinical decision-making. The findings emphasize the need for further refinement of LLMs and the development of evaluation methodologies that reflect the complexities of real-world neurology practice.