The Performance of ChatGPT-4o and DeepSeek-R1 in Interpreting Thyroid Nodule Ultrasound Text Report: A Multicenter Study

Yujie Xie
Bing Zhan
Kangfan Zhang
Yuchen Li
Jiarui Liu
Chunping Ning

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objective To assess two large language models (LLMs), DeepSeek-R1 and ChatGPT-4o, in interpreting thyroid nodule ultrasound text report, emphasizing the accuracy in benign-malignant differentiation, the agreement of Chinese Thyroid Imaging Reporting and Data System (C-TIRADS) classification and management recommendation, and the stability of each task. Methods We analyzed 1,063 ultrasound text reports from three medical centers, with 306 nodules confirmed by histopathology. Each nodule's report was processed through two LLMs using standardized prompts, repeated five times, with the final result determined by mode voting. Results DeepSeek-R1 excelled over ChatGPT-4o in differentiating benign from malignant nodules, with superior sensitivity (0.879 vs. 0.692), accuracy (0.729 vs. 0.644), and Area Under the Curve (AUC) (0.694 vs. 0.632). However, senior radiologists achieved notably better results with higher accuracy (0.804), and AUC (0.865) compared two LLMs. In C-TIRADS classification, DeepSeek-R1 also outperformed ChatGPT-4o (κ = 0.770 vs. κ = 0.688, Δκ = 0.083 [95% CI: 0.048, 0.122]). Both models showed substantial agreement with clinicians on management recommendation (κ = 0.606 vs. κ = 0.608, Δκ=-0.002 [95% CI: -0.044, 0.041]). In terms of stability, LLMs exhibited almost perfect agreement in C-TIRADS classification (α = 0.864 vs. α = 0.866, Δα=-0.003 [95% CI: -0.023, 0.017]) and management recommendation (κ = 0.853 vs. κ = 0.849, Δκ = 0.004 [95% CI: -0.026, 0.033]). However, in benign-malignant discrimination, DeepSeek-R1 demonstrated significantly greater stability than ChatGPT-4o (κ = 0.849 vs. κ = 0.550, Δκ = 0.260 [95% CI: 0.191, 0.321]). Conclusion Our study highlights the potential of LLMs for interpreting thyroid nodule ultrasound text reports. DeepSeek-R1 outperformed in benign-malignant differentiation accuracy and classification consistency, whereas ChatGPT-4o and DeepSeek-R1 performed similarly in management recommendation.

Version published to 10.21203/rs.3.rs-7574125/v1 on Research Square
Oct 23, 2025

Performance of Next-Generation AI Chatbots in Gynecological Knowledge Assessment: A Comparative Pilot Study of ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus

This article has 2 authors:
1. Huan Out
2. Zhen Wang
This article has no evaluationsLatest version Dec 16, 2025
Comparative efficacy of ChatGPT-5.1 Auto and DeepSeek-V3.1 Thinking in answering patients’ questions on cervical spine surgery

This article has 4 authors:
1. Xiaoyang Huo
2. Jiaming Zhou
3. Rongzhi Ma
4. Yuan Xue
This article has no evaluationsLatest version Jan 23, 2026
Diagnostic Comparison of TI-RADS and a Nomogram for Thyroid Nodules in Northwestern China

This article has 5 authors:
1. Miao Tan
2. Wenhan Li
3. Jianhui Li
4. Jia Du
5. Xufeng Zhang
This article has no evaluationsLatest version Dec 30, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Performance of Next-Generation AI Chatbots in Gynecological Knowledge Assessment: A Comparative Pilot Study of ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus

Comparative efficacy of ChatGPT-5.1 Auto and DeepSeek-V3.1 Thinking in answering patients’ questions on cervical spine surgery

Diagnostic Comparison of TI-RADS and a Nomogram for Thyroid Nodules in Northwestern China