Benchmarking General-Purpose and Medical AI Large Language Models for Clinical Assessment and Management in Parkinson’s Disease

Shechter Yosef
Klevor Raymond
Kouchache Trycia
Bouhadoun Sarah
Ronald B Postuma

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

The clinical applicability of large language models (LLMs) in Parkinson’s disease (PD) management remains insufficiently characterized, particularly in generative responses to clinical vignette scenarios.

Objective

To evaluate the quality of clinical assessments and management plans generated by a general-purpose LLM (Gemini 1.5 Pro) and a medically specialized LLM (OpenEvidence), and to compare their performance.

Methods

Models generated free-text responses to 45 open clinical queries, focused on assessment of the situation, and recommended management plan. Two movement disorders fellows rated outputs using 5-point Likert scales, dichotomized into clinically appropriate (≥4) versus inappropriate (≤3). Discrepancies were adjudicated by a senior movement disorders specialist. Paired comparisons used McNemar’s test; qualitative analysis examined severe errors.

Results

Gemini 1.5 Pro and OpenEvidence showed high rates of clinically appropriate assessments (80.0% vs. 86.7%) but lower performance in management plans (48.9% vs. 57.8%). Cases in which both assessment and plan were clinically appropriate occurred in 46.7% and 55.6% of cases, respectively. None of these differences reached statistical significance. Severe errors were uncommon in assessments (6.7% vs. 8.9%) but more frequent in plans (26.7% in both), predominantly reflecting treatment strategy errors.

Conclusions

In generative clinical reasoning tasks involving Parkinson’s disease management vignettes, LLMs demonstrated reasonable performance in assessment, but consistent limitations in plan generation. The medically specialized LLM demonstrated several qualitative advantages but no statistically significant performance benefit over the general-purpose model. Therefore, these tools should be used with appropriate caution in Parkinson’s disease management, particularly regarding treatment recommendations.

Version published to 10.64898/2026.05.13.26353021 on medRxiv
May 20, 2026

Uncertainty-aware extraction of clinical findings from Finnish EHRs using open large language models

This article has 5 authors:
1. Jussi Leinonen
2. Juha Knuuttila
3. Siina Pamilo
4. Samu Kurki
5. Miika Koskinen
This article has no evaluationsLatest version Jul 9, 2026
Research through Evaluation for Large Language Model in Patient-Clinician Communications

This article has 16 authors:
1. Yuexing Hao
2. Jason Holmes
3. Jared Hobson
4. Alexandra Bennett
5. Elizabeth L. McKone
6. Daniel K. Ebner
7. David M. Routman
8. Satomi Shiraishi
9. Samir H. Patel
10. Nathan Y. Yu
11. Chris L. Hallemeier
12. Brooke E. Ball
13. Saleh Kalantari
14. Marzyeh Ghassemi
15. Mark Waddle
16. Wei Liu
This article has no evaluationsLatest version Jun 18, 2026
A real-world feasibility evaluation of LLM-based clinical prediction: emergency department return visit admission across two academic medical centers

This article has 12 authors:
1. Jinsong Liu
2. Katherine Brown
3. Michelle J. Ma
4. Arindam RoyChoudhury
5. Bradley A Malin
6. Allison McCoy
7. Adam Wright
8. Jessica S. Ancker
9. Tony Rosen
10. Jin Ho Han
11. Peter A D Steel
12. Yiye Zhang
This article has no evaluationsLatest version Jun 2, 2026

Discuss this preprint

Listed in

Abstract

Background

Objective

Methods

Results

Conclusions

Article activity feed

Related articles

Uncertainty-aware extraction of clinical findings from Finnish EHRs using open large language models

Research through Evaluation for Large Language Model in Patient-Clinician Communications

A real-world feasibility evaluation of LLM-based clinical prediction: emergency department return visit admission across two academic medical centers