Evaluating Large Language Model-Generated Brain MRI Protocols: Performance of GPT4o, o3-mini, DeepSeek-R1 and Qwen2.5-72B
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Purpose
To evaluate the potential of LLMs to generate sequence-level brain MRI protocols.
Methods
A dataset of 150 brain MRI cases was derived from imaging request forms obtained from the local institution. For each case, a reference MRI protocol was established by two board-certified neuroradiologists, with discrepancies resolved through consensus. GPT-4o, o3-mini, DeepSeek-R1 and Qwen-2.5-72B were employed to generate brain MRI protocols based on the case descriptions. For each model, protocol generation was conducted under two conditions: 1) with additional in-context learning involving local standard protocols and sequence explanations (enhanced) and 2) without additional external information (base). Additionally, two radiology residents independently defined MRI protocols for a subsample of 50 cases. The frequencies of redundant sequences, total missing sequences, and missing critical sequences are reported. The sum of redundant and missing sequences (accuracy index) was defined as a comprehensive metric to evaluate LLM protocoling performance. Accuracy indices were compared between groups using paired t-tests, with false discovery rate correction applied to control for multiple testing.
Results
The two neuroradiologists achieved substantial inter-rater agreement (Cohen’s κ = 0.74). The lowest accuracy index and therefore superior performance, was observed with o3-mini (base: 2.65; enhanced: 1.94), followed by GPT-4o (base: 3.11; enhanced: 2.23), DeepSeek-R1 (base: 3.42; enhanced: 2.37) and Qwen-2.5-72B (base: 5.95; enhanced: 2.75). o3-mini consistently outperformed the other models with a significant margin. All four models showed highly significant performance improvements under the enhanced condition ( adj. p < 0.001 for all models), primarily driven by a substantial reduction of redundant MRI sequences. In the subsample, the highest-performing LLM (o3-mini [enhanced]) yielded an accuracy index comparable to residents (o3-mini [enhanced]: 1.92, resident 1: 1.80, resident 2: 1.44).
Conclusion
Our findings demonstrate promising potential of LLMs in automating brain MRI protocoling, especially when augmented through in-context learning. o3-mini exhibited superior performance, followed by GPT-4o.