Large Language Model Performance in UK Advice & Guidance: A Pilot Study in Neurology

J Healy
A Marvasti
D Wallace
A Baheerathan
A Ghosh
J Kossoff
S Thio
MS Balaratnam
S Haider
S Ellershaw
R Dobson

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Large language models (LLMs) demonstrate strong performance in controlled medical environments such as multiple choice exams, but their utility in real-world clinical workflows remains unproven. The NHS Advice & Guidance (A&G) service, where Primary Care clinicians can submit text-based queries to specialists, provides an environment for evaluating the clinical performance of LLMs as a specialist.

Methods

We compared responses from MedGemma 4B-IT, an open-weight model deployed locally on hospital infrastructure, against specialist neurologist responses across 50 adult neurology A&G cases from University College London Hospital. Two neurologists and two GPs rated 80 blinded and 20 unblinded responses for outcome, safety, efficacy, and feasibility using standardised criteria; outcome was a binary correct/incorrect, while other domains were scored 1-5. Inter-rater reliability was assessed using intraclass correlation coefficients.

Results

Although there were no statistically significant differences between blinded specialist neurologists and LLM responses across any domain (outcome: 84% vs 82%, p=0.67; safety: 3.98 vs 4.02, p=0.85; efficacy: 4.06 vs 3.98, p=0.61; feasibility: 4.39 vs 4.20, p=0.45), 10% of LLM responses received concerning scores (≤2 average score) compared to 0% of human responses, indicating potentially clinically important tail risk. Furthermore, unblinded results showed a preference for human responses, with human ratings being preferred across all domains. Only 51% of binary outcomes had unanimous agreement and inter-rater agreement was moderate across other domains (ICC 0.50-0.52).

Conclusions

In this pilot study, aggregate scores between blinded human and LLM responses were similar, and no statistically significant differences were detected in this exploratory sample. However, aggregate metrics masked clinically important edge-case failures in LLM responses. Pronounced inter-rater variability and the potential impact of LLM/human syntax on blinded rater judgements highlight the challenges in establishing robust evaluation frameworks for clinical LLM deployment

Version published to 10.64898/2026.05.13.26353081 on medRxiv
May 18, 2026

Research through Evaluation for Large Language Model in Patient-Clinician Communications

This article has 16 authors:
1. Yuexing Hao
2. Jason Holmes
3. Jared Hobson
4. Alexandra Bennett
5. Elizabeth L. McKone
6. Daniel K. Ebner
7. David M. Routman
8. Satomi Shiraishi
9. Samir H. Patel
10. Nathan Y. Yu
11. Chris L. Hallemeier
12. Brooke E. Ball
13. Saleh Kalantari
14. Marzyeh Ghassemi
15. Mark Waddle
16. Wei Liu
This article has no evaluationsLatest version Jun 18, 2026
Uncertainty-aware extraction of clinical findings from Finnish EHRs using open large language models

This article has 5 authors:
1. Jussi Leinonen
2. Juha Knuuttila
3. Siina Pamilo
4. Samu Kurki
5. Miika Koskinen
This article has no evaluationsLatest version Jul 9, 2026
Use of large language models by academic hospitalists: results of a multicenter survey

This article has 5 authors:
1. Eric Bressman
2. Andrew Auerbach
3. Angela Keniston
4. Caroline Jens
5. Sumant Ranji
This article has no evaluationsLatest version May 29, 2026

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusions

Article activity feed

Related articles

Research through Evaluation for Large Language Model in Patient-Clinician Communications

Uncertainty-aware extraction of clinical findings from Finnish EHRs using open large language models

Use of large language models by academic hospitalists: results of a multicenter survey