Detection without calibration: benchmarking domestic and international large language models for quality control of Mandarin ¹⁸ F-FDG PET/CT reports

Jingbo Wang
Weiqing Tang
Xingdi Ma
Huimin Yan
Ying Yuan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) are increasingly used for automated quality control (QC) of radiology reports. However, the reliability of LLMs on reports in Mandarin, and the relative performance of domestic versus international flagship models, remain unknown. We benchmarked 14 LLM configurations, seven Chinese-developed (“domestic”) and seven international models, on 1,000 whole-body ¹⁸ F-FDG PET/CT reports split into an error-injected “junior-doctor” arm and a low-residual “finalised” arm (500 each), using a controlled error-injection gold standard. Under each blinded zero-shot prompt, each model flagged six error types and assigned a 1–5 overall score. Two distinct abilities: error-detection macro-F1 (0.356–0.667) and overall-score calibration (ICC[2,1] 0.099–0.627), were weakly and not significantly correlated across models (Spearman ρ = 0.38, p = 0.18); the dissociation was instead evident in sharp rank reversals, the strongest detector (Claude-Opus-4.8 0.667) calibrating poorly (0.491), while the three best-calibrated models were all domestic (MiMo 0.627, GLM-5 0.612, DeepSeek 0.609). Once the access channel was controlled, domestic and international error detection were statistically indistinguishable (Δmacro-F1 = −0.011, P = 0.84); domestic models showed consistent but not significant advantages in calibration (ΔICC = +0.142) and Chinese-character-error detection (ΔF1 = +0.109), accompanied with large reductions in cost (US$0.09–2.71 vs $0.26–14.5 per 1,000 reports) and on-premise deployability. Re-running two flagships through both agent channels and clean APIs showed that agent channel inflated both detection and calibration (GPT-5.5 ΔICC = +0.098, 95% CI 0.070–0.128), confirming that uncontrolled benchmarks over-credit agent-channel models. Missed-diagnosis detection was the universal weakness (best 0.467) and the one category where the human physicians outperformed every model. Raw detection ability does not guarantee a trustworthy score, and domestic and international models differ by deployment-relevant profile rather than by overall performance rank; both essential distinctions for performing clinical nuclear-medicine QC.

Version published to 10.64898/2026.06.24.26356406 on medRxiv
Jun 26, 2026

Board-Level Performance of Leading Open-Weight Vision-Language Models on the Japanese Diagnostic Radiology Board Examination: Reasoning, Image-Input, and Language Effects

This article has 14 authors:
1. Yuki Sonoda
2. Yosuke Yamagishi
3. Yuichiro Hirano
4. Soichiro Miki
5. Takahiro Nakao
6. Shouhei Hanaoka
7. Yukihiro Nomura
8. Akiyoshi Hamada
9. Noriko Kanemaru
10. Rintaro Miyo
11. Masumi Mizuki Takahashi
12. Reina Hosoi
13. Takeharu Yoshikawa
14. Osamu Abe
This article has no evaluationsLatest version Jul 13, 2026
Uncertainty-aware extraction of clinical findings from Finnish EHRs using open large language models

This article has 5 authors:
1. Jussi Leinonen
2. Juha Knuuttila
3. Siina Pamilo
4. Samu Kurki
5. Miika Koskinen
This article has no evaluationsLatest version Jul 9, 2026
Assessment of Zero-Shot Large Language Model (LLM) Assisted Clinical Trial Matching Processes: A Metastatic Cancer Use Case

This article has 10 authors:
1. Yingjie Weng
2. Himani Yalamaddi
3. Danning Fu
4. Ankita Mishra
5. Bryan J. Bunning
6. Andrew B. Martin
7. Jessica Hope
8. Vivek Charu
9. Allison Kurian
10. Manisha Desai
This article has no evaluationsLatest version Jul 10, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Board-Level Performance of Leading Open-Weight Vision-Language Models on the Japanese Diagnostic Radiology Board Examination: Reasoning, Image-Input, and Language Effects

Uncertainty-aware extraction of clinical findings from Finnish EHRs using open large language models

Assessment of Zero-Shot Large Language Model (LLM) Assisted Clinical Trial Matching Processes: A Metastatic Cancer Use Case