Clinical performance tradeoffs of ChatGPT-5.2 Thinking (OpenAI) compared with radiologist interpretation in biopsy-referred mammography: cancer detection, false positives, and laterality
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Purpose To compare ChatGPT-5.2 Thinking (OpenAI) with practicing radiologists for the clinically relevant, examination-level task of classifying biopsy-proven malignancy in a biopsy-referred mammography test set, and to assess performance by radiologic feature type and laterality Methods In this multicenter retrospective study across several cities in Saudi Arabia, screening mammograms from an initial cohort of 1,225 women were linked to pathology to create a biopsy-anchored reference standard. Board-certified breast radiologists, blinded to pathology and model outputs, provided study-level BI-RADS® assessments. ChatGPT-5.2 Thinking received de-identified bilateral CC and MLO views with a fixed BI-RADS–based prompt and produced an ordinal BI-RADS category (0–5) and suspected laterality. The analytic test set included 100 examinations that proceeded to biopsy (61 biopsy-confirmed cancers and 39 biopsy-negative controls). Primary outcomes were case-level sensitivity, specificity, and accuracy; secondary outcomes included laterality performance and feature-level patterns. All analyses were executed in Python and R on secured institutional workstations. Results ChatGPT-5.2 demonstrated higher sensitivity than radiologists (95.1% vs 82.0%) but lower specificity (10.3% vs 56.4%), yielding lower overall accuracy (62.0% vs 72.0%). Feature-wise, AI showed highest sensitivity with dense parenchymal patterns and highest specificity for architectural distortion, tended to overcall mass-like findings, and performed weakest for microcalcifications. Laterality accuracy was 60.7%. Conclusion In this biopsy-referred, pathology-anchored evaluation, ChatGPT-5.2 Thinking showed higher sensitivity but substantially lower specificity than radiologists, supporting its potential role as a concurrent aid/triage signal rather than a stand-alone reader, pending prospective validation.