Benchmarking General-Purpose and Medical AI Large Language Models for Clinical Assessment and Management in Parkinson’s Disease
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
The clinical applicability of large language models (LLMs) in Parkinson’s disease (PD) management remains insufficiently characterized, particularly in generative responses to clinical vignette scenarios.
Objective
To evaluate the quality of clinical assessments and management plans generated by a general-purpose LLM (Gemini 1.5 Pro) and a medically specialized LLM (OpenEvidence), and to compare their performance.
Methods
Models generated free-text responses to 45 open clinical queries, focused on assessment of the situation, and recommended management plan. Two movement disorders fellows rated outputs using 5-point Likert scales, dichotomized into clinically appropriate (≥4) versus inappropriate (≤3). Discrepancies were adjudicated by a senior movement disorders specialist. Paired comparisons used McNemar’s test; qualitative analysis examined severe errors.
Results
Gemini 1.5 Pro and OpenEvidence showed high rates of clinically appropriate assessments (80.0% vs. 86.7%) but lower performance in management plans (48.9% vs. 57.8%). Cases in which both assessment and plan were clinically appropriate occurred in 46.7% and 55.6% of cases, respectively. None of these differences reached statistical significance. Severe errors were uncommon in assessments (6.7% vs. 8.9%) but more frequent in plans (26.7% in both), predominantly reflecting treatment strategy errors.
Conclusions
In generative clinical reasoning tasks involving Parkinson’s disease management vignettes, LLMs demonstrated reasonable performance in assessment, but consistent limitations in plan generation. The medically specialized LLM demonstrated several qualitative advantages but no statistically significant performance benefit over the general-purpose model. Therefore, these tools should be used with appropriate caution in Parkinson’s disease management, particularly regarding treatment recommendations.