Evaluating Large Language Model-Generated Brain MRI Protocols: Performance of GPT4o, o3-mini, DeepSeek-R1 and Qwen2.5-72B

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Purpose

To evaluate the potential of LLMs to generate sequence-level brain MRI protocols.

Methods

A dataset of 150 brain MRI cases was derived from imaging request forms obtained from the local institution. For each case, a reference MRI protocol was established by two board-certified neuroradiologists, with discrepancies resolved through consensus. GPT-4o, o3-mini, DeepSeek-R1 and Qwen-2.5-72B were employed to generate brain MRI protocols based on the case descriptions. For each model, protocol generation was conducted under two conditions: 1) with additional in-context learning involving local standard protocols and sequence explanations (enhanced) and 2) without additional external information (base). Additionally, two radiology residents independently defined MRI protocols for a subsample of 50 cases. The frequencies of redundant sequences, total missing sequences, and missing critical sequences are reported. The sum of redundant and missing sequences (accuracy index) was defined as a comprehensive metric to evaluate LLM protocoling performance. Accuracy indices were compared between groups using paired t-tests, with false discovery rate correction applied to control for multiple testing.

Results

The two neuroradiologists achieved substantial inter-rater agreement (Cohen’s κ = 0.74). The lowest accuracy index and therefore superior performance, was observed with o3-mini (base: 2.65; enhanced: 1.94), followed by GPT-4o (base: 3.11; enhanced: 2.23), DeepSeek-R1 (base: 3.42; enhanced: 2.37) and Qwen-2.5-72B (base: 5.95; enhanced: 2.75). o3-mini consistently outperformed the other models with a significant margin. All four models showed highly significant performance improvements under the enhanced condition ( adj. p < 0.001 for all models), primarily driven by a substantial reduction of redundant MRI sequences. In the subsample, the highest-performing LLM (o3-mini [enhanced]) yielded an accuracy index comparable to residents (o3-mini [enhanced]: 1.92, resident 1: 1.80, resident 2: 1.44).

Conclusion

Our findings demonstrate promising potential of LLMs in automating brain MRI protocoling, especially when augmented through in-context learning. o3-mini exhibited superior performance, followed by GPT-4o.

Article activity feed