Evaluating Large Language Model-Generated Brain MRI Protocols: Performance of GPT4o, o3-mini, DeepSeek-R1 and Qwen2.5-72B

Su Hwan Kim
Severin Schramm
Lena Schmitzer
Kerem Serguen
Sebastian Ziegelmayer
Felix Busch
Alexander Komenda
Marcus R. Makowski
Lisa C. Adams
Keno K. Bressem
Claus Zimmer
Jan Kirschke
Benedikt Wiestler
Dennis Hedderich
Tom Finck
Jannis Bodden

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Purpose

To evaluate the potential of LLMs to generate sequence-level brain MRI protocols.

Methods

A dataset of 150 brain MRI cases was derived from imaging request forms obtained from the local institution. For each case, a reference MRI protocol was established by two board-certified neuroradiologists, with discrepancies resolved through consensus. GPT-4o, o3-mini, DeepSeek-R1 and Qwen-2.5-72B were employed to generate brain MRI protocols based on the case descriptions. For each model, protocol generation was conducted under two conditions: 1) with additional in-context learning involving local standard protocols and sequence explanations (enhanced) and 2) without additional external information (base). Additionally, two radiology residents independently defined MRI protocols for a subsample of 50 cases. The frequencies of redundant sequences, total missing sequences, and missing critical sequences are reported. The sum of redundant and missing sequences (accuracy index) was defined as a comprehensive metric to evaluate LLM protocoling performance. Accuracy indices were compared between groups using paired t-tests, with false discovery rate correction applied to control for multiple testing.

Results

The two neuroradiologists achieved substantial inter-rater agreement (Cohen’s κ = 0.74). The lowest accuracy index and therefore superior performance, was observed with o3-mini (base: 2.65; enhanced: 1.94), followed by GPT-4o (base: 3.11; enhanced: 2.23), DeepSeek-R1 (base: 3.42; enhanced: 2.37) and Qwen-2.5-72B (base: 5.95; enhanced: 2.75). o3-mini consistently outperformed the other models with a significant margin. All four models showed highly significant performance improvements under the enhanced condition ( adj. p < 0.001 for all models), primarily driven by a substantial reduction of redundant MRI sequences. In the subsample, the highest-performing LLM (o3-mini [enhanced]) yielded an accuracy index comparable to residents (o3-mini [enhanced]: 1.92, resident 1: 1.80, resident 2: 1.44).

Conclusion

Our findings demonstrate promising potential of LLMs in automating brain MRI protocoling, especially when augmented through in-context learning. o3-mini exhibited superior performance, followed by GPT-4o.

Version published to 10.1101/2025.04.08.25325433v1 on medRxiv
Apr 9, 2025

Evaluating Large Language Model Performance and Reliability in Scoring Picture Description Tasks for Neuropsychological Assessment

This article has 1 author:
1. Michael J Kleiman
This article has no evaluationsLatest version May 8, 2025
Grounding Large Language Model in Clinical Diagnostics

This article has 14 authors:
1. Jian Li
2. Xi Chen
3. Hanyu Zhou
4. Huahui Yi
5. Mingke You
6. Weizhi Liu
7. Li Wang
8. Hairui Li
9. Xue Zhang
10. Yingman Guo
11. Lei Fan
12. Qicheng Lao
13. Weili Fu
14. Kang Li
This article has no evaluationsLatest version Apr 15, 2025
Evaluation of the Precision of a Surgery Subspecialty-Specific Large Language Model, AtlasGPT, in Relation to Standard Models Using Board-Like Questions

This article has 4 authors:
1. Brandon L. Staple
2. Elijah M. Staple
3. Cynthia Wallace
4. Bevan D. Staple
This article has no evaluationsLatest version May 9, 2025

Listed in

Abstract

Purpose

Methods

Results

Conclusion

Article activity feed

Related articles

Evaluating Large Language Model Performance and Reliability in Scoring Picture Description Tasks for Neuropsychological Assessment

Grounding Large Language Model in Clinical Diagnostics

Evaluation of the Precision of a Surgery Subspecialty-Specific Large Language Model, AtlasGPT, in Relation to Standard Models Using Board-Like Questions