Clinical performance tradeoffs of ChatGPT-5.2 Thinking (OpenAI) compared with radiologist interpretation in biopsy-referred mammography: cancer detection, false positives, and laterality

Mohammad Alarifi
Abdulrahman Jabour
Ahmad Abanomy
Haitham Alahmad
Khaled Alenazi
Alhanouf Alshedi
Mansour Almanaa
Areej Aloufi

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Purpose To compare ChatGPT-5.2 Thinking (OpenAI) with practicing radiologists for the clinically relevant, examination-level task of classifying biopsy-proven malignancy in a biopsy-referred mammography test set, and to assess performance by radiologic feature type and laterality Methods In this multicenter retrospective study across several cities in Saudi Arabia, screening mammograms from an initial cohort of 1,225 women were linked to pathology to create a biopsy-anchored reference standard. Board-certified breast radiologists, blinded to pathology and model outputs, provided study-level BI-RADS® assessments. ChatGPT-5.2 Thinking received de-identified bilateral CC and MLO views with a fixed BI-RADS–based prompt and produced an ordinal BI-RADS category (0–5) and suspected laterality. The analytic test set included 100 examinations that proceeded to biopsy (61 biopsy-confirmed cancers and 39 biopsy-negative controls). Primary outcomes were case-level sensitivity, specificity, and accuracy; secondary outcomes included laterality performance and feature-level patterns. All analyses were executed in Python and R on secured institutional workstations. Results ChatGPT-5.2 demonstrated higher sensitivity than radiologists (95.1% vs 82.0%) but lower specificity (10.3% vs 56.4%), yielding lower overall accuracy (62.0% vs 72.0%). Feature-wise, AI showed highest sensitivity with dense parenchymal patterns and highest specificity for architectural distortion, tended to overcall mass-like findings, and performed weakest for microcalcifications. Laterality accuracy was 60.7%. Conclusion In this biopsy-referred, pathology-anchored evaluation, ChatGPT-5.2 Thinking showed higher sensitivity but substantially lower specificity than radiologists, supporting its potential role as a concurrent aid/triage signal rather than a stand-alone reader, pending prospective validation.

Version published to 10.21203/rs.3.rs-8701935/v1 on Research Square
Feb 9, 2026

Diagnostic Performance of Large Language Models and Radiologists in Case-Based Radiology Questions

This article has 5 authors:
1. Raif Can Yarol
2. Ali Cantürk
3. Kenan Kadirli
4. Aslı Suner Karakulah
5. Oğuz Dicle
This article has no evaluationsLatest version Mar 13, 2026
Evaluation of Ultrasonographic, Cytological, Histopathological, and Clinical Findings in Childhood Thyroid Nodules in Terms of Malignancy: A Single-Center Cross-Sectional Analysis and a Six-Parameter Diagnostic Model Proposal

This article has 1 author:
1. Meryem Badem
This article has no evaluationsLatest version Mar 19, 2026
Diagnostic accuracy of ADC values as a supplementary tool in differentiating hepatocellular adenoma from focal nodular hyperplasia: a systematic review and meta-analysis

This article has 3 authors:
1. Alisa Mohebbi
2. Mehrad Zare
3. Afshin Mohammadi
This article has no evaluationsLatest version Feb 9, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Diagnostic Performance of Large Language Models and Radiologists in Case-Based Radiology Questions

Evaluation of Ultrasonographic, Cytological, Histopathological, and Clinical Findings in Childhood Thyroid Nodules in Terms of Malignancy: A Single-Center Cross-Sectional Analysis and a Six-Parameter Diagnostic Model Proposal

Diagnostic accuracy of ADC values as a supplementary tool in differentiating hepatocellular adenoma from focal nodular hyperplasia: a systematic review and meta-analysis