Detection without calibration: benchmarking domestic and international large language models for quality control of Mandarin 18F-FDG PET/CT reports

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large language models (LLMs) are increasingly used for automated quality control (QC) of radiology reports. However, the reliability of LLMs on reports in Mandarin, and the relative performance of domestic versus international flagship models, remain unknown. We benchmarked 14 LLM configurations, seven Chinese-developed ("domestic") and seven international models, on 1,000 whole-body 18F-FDG PET/CT reports split into an error-injected "junior-docto" arm and a low-residual "finalised" arm (500 each), using a controlled error-injection gold standard. Under each blinded zero-shot prompt, each model flagged six error types and assigned a 1-5 overall score. Two distinct abilities: error-detection macro-F1 (0.356-0.667) and overall-score calibration (ICC[2,1] 0.099-0.627), were weakly and not significantly correlated across models (Spearman ρ = 0.38, p = 0.18); the dissociation was instead evident in sharp rank reversals, the strongest detector (Claude-Opus-4.8 0.667) calibrating poorly (0.491), while the three best-calibrated models were all domestic (MiMo 0.627, GLM-5 0.612, DeepSeek 0.609). Once the access channel was controlled, domestic and international error detection were statistically indistinguishable (Δmacro-F1= -0.011, P = 0.84); domestic models showed consistent but not significant advantages in calibration (ΔICC = +0.142) and Chinese-character-error detection (ΔF1 = +0.109), accompanied with large reductions in cost (US$0.09-2.71 vs $0.26-14.5 per 1,000 reports) and on-premise deployability. Re-running two flagships through both agent channels and clean APIs showed that agent channel inflated both detection and calibration (GPT-5.5 ΔICC = +0.098, 95% CI 0.070-0.128), confirming that uncontrolled benchmarks over-credit agent-channel models. Missed-diagnosis detection was the universal weakness (best 0.467) and the one category where the human physicians outperformed every model. Raw detection ability does not guarantee a trustworthy score, and domestic and international models differ by deployment-relevant profile rather than by overall performance rank; both essential distinctions for performing clinical nuclear-medicine QC.

Article activity feed