High Concordance Between GPT-4o and Multidisciplinary Tumor Board Decisions in Breast Cancer: A Retrospective Decision Support Analysis
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Large language models (LLMs) such as ChatGPT have gained attention for their potential to assist clinical decision-making in oncology. However, real-world validation of these models against multidisciplinary tumor board (MTB) recommendations—particularly in breast cancer treatment—remains limited. Methods: This retrospective study assessed the concordance between GPT-4o and the decisions of a breast cancer MTB over a six-month period. Thirty-three patients were included. Structured clinical data were entered into GPT-4o using standardized prompts, and treatment plans were generated in two independent sessions per case. Seven therapeutic domains were evaluated: surgery, radiotherapy, hormonal therapy, neoadjuvant therapy, adjuvant therapy, genetic counseling/testing, and dual HER2-targeted therapy. Two blinded reviewers scored concordance using a 5-point Likert scale. Inter-rater reliability and classification metrics were calculated. Results: GPT-4o generated consistent recommendations across both sessions for all patients. Full concordance (5/5) with MTB decisions was observed in 31 of 33 cases (93.9%), while partial concordance (4/5) occurred in 2 cases (6.1%) due to differences regarding genetic counseling. Inter-rater agreement was perfect (Cohen’s kappa = 1.00), and the mean concordance score was 4.94 out of 5. The model achieved an overall accuracy of 93.9%, precision of 93.9%, recall of 100%, and F1 score of 96.8%. Conclusion: GPT-4o demonstrated a high level of agreement with expert multidisciplinary decisions in breast cancer care when provided with structured clinical input. These findings support its potential as a reproducible, guideline-consistent decision-support tool in oncology workflows.