Three multimodal large language models fail at clinically actionable breast pathology in three different directions

Sun-Young Jun
Seonhui Kim
Young-Joon Kang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Breast cancer treatment depends on histopathological features, such as grade and receptor-defined subtype; however, specialist pathologist access is constrained when the workforce is limited. Commercial multimodal large language models (MLLMs) accept hematoxylin and eosin (H&E) image tiles through paid interfaces without local hardware or fine-tuning. However, prior pathology evaluations addressed only coarse tasks. Whether they reach treatment-determining accuracy and whether vendors agree remain unclear.

Methods

We aimed to evaluate three vendor-designated flagship MLLMs (Claude Sonnet 4.6, Gemini 2.5 Pro, GPT-5.5) in 427 invasive breast cancer cases. Each case went to all three with identical H&E tiles and prompts, and the subtype was inferred in the second call. The reference was an institutional sign-out report of an immunohistochemistry-derived subtype. We calculated the concordance, sensitivity, specificity, Cohen’s kappa, and pairwise McNemar and Bowker tests.

Findings

Claude ranked highest by raw histologic-type concordance but lowest by kappa, classifying all 23 lobular and seven micropapillary carcinomas as invasive breast carcinoma of no special type. The models anchored the Nottingham grade to three modal grades. None of the models reliably identified human epidermal growth factor receptor 2-positive disease. The failure direction was vendor-specific: Claude and GPT-5.5 were under-detected, whereas Gemini was over-called. Twelve prompt variants (4,056 calls) did not recover sensitivity.

Interpretation

No current commercial MLLM reaches deployment-ready accuracy for any treatment-determining feature of breast pathology. As each vendor fails in its own fixed direction, changing vendors alters the type of error rather than removing it; therefore, the value of these models is assistive rather than autonomous. At USD 0.20–0.50 per case, they may serve as supervised draft generators that leave the diagnosis with the pathologist.

Research in context

Evidence before this study

We searched PubMed and Embase from database inception to January 10, 2026, without language restriction, using combinations of the terms “large language model,” “multimodal,” “GPT,” “Gemini,” “Claude,” “foundation model,” “breast cancer,” “pathology,” “histopathology,” “whole-slide image,” and “diagnostic accuracy.” We also screened the reference lists of retrieved articles. Task-specific deep-learning models and pathology foundation models pretrained on large slide collections achieve strong performance on individual breast pathology tasks such as Nottingham grading and receptor-status prediction, but require local GPU infrastructure, curated training data, and deployment expertise. Evaluations of general-purpose commercially available multimodal large language models (MLLMs) in pathology were limited to coarse tasks, such as tissue-type classification and metastasis, and were typically confined to a single model. We found no studies directly comparing current flagship commercial MLLMs on clinically relevant, treatment-determining breast pathology tasks, and none reporting whether vendors agree or whether the choice of vendor changes the diagnosis. The available evidence was limited by single-model designs, task-narrow datasets, and reliance on raw accuracy without chance-corrected agreement.

Added value of this study

In this paired, single-center retrospective study of 427 invasive breast cancers, we evaluated three vendor-designated flagship commercial MLLMs (Claude Sonnet 4.6, Gemini 2.5 Pro, GPT-5.5) on identical H&E tiles and prompts against an expert pathologist reference, across treatment-determining features. No model reached deployment-ready accuracy for any feature. Raw concordance and chance-corrected agreement ranked the models in opposite order, so a decision based on raw accuracy alone would have selected the least discriminating model. Each model demonstrated a consistent error pattern, tending either toward under-detection or over-call. Consequently, changing vendors altered the type of diagnostic error rather than removing it. Twelve prompt variants across 4,056 calls, a reasoning-effort escalation, and an intra-vendor version upgrade did not change the failure direction, indicating a vendor-specific prior rather than a prompt-engineering artifact. To our knowledge, this is the first head-to-head, chance-corrected comparison of commercial MLLMs at the level of treatment-determining breast pathology.

Implications of all the available evidence

Taken together with previous evidence, our findings indicate that current commercial MLLMs are not ready for autonomous interpretation of breast pathology and should not be used as primary readers without pathologist oversight. Their value, if any, is assistive. Given their relatively low per-case cost, these systems may be useful for generating supervised draft reports where pathologist workload is the primary constraint, provided that digital pathology infrastructure and immunohistochemical testing are available and that final diagnostic responsibility remains with a qualified pathologist. Because the failure direction is vendor-specific, deployment requires vendor-aware caveats and item-level, chance-corrected evaluation rather than raw accuracy. Improving performance will likely require incorporation of domain-specific knowledge through approaches such as in-context reference examples, retrieval against a curated atlas, or handoff to a domain-trained model. Future development should be supported by immunohistochemistry-grounded datasets and prospective validation in the target setting.

Version published to 10.64898/2026.06.18.26355928 on medRxiv
Jun 22, 2026

Challenges in AI Based Tumor Board Case Summarization and Recommendations

This article has 18 authors:
1. Wen-wai Yim
2. Hendrik Damm
3. Tabea M. G. Pakull
4. Sam Preston
5. Timothy Keyes
6. Timothy John Ellis-Caleo
7. Zhaoyi Sun
8. Meliha Yetisgen
9. Noel Codella
10. Mu Wei
11. Faraah Bekheet
12. Joel W Neal
13. Nigam Shah
14. Bahadır Eryılmaz
15. Felix Nensa
16. Elisabeth Livingstone
17. Christoph M. Friedrich
18. Georg Lodde
This article has no evaluationsLatest version Jun 16, 2026
A multimodal foundation model linking histopathology and DNA methylation

This article has 9 authors:
1. Dannong Wang
2. Jessica Zhang
3. Chen Chen
4. Wei Zhang
5. Song Wang
6. Yanda Meng
7. Guru Sonpavde
8. Craig Horbinski
9. Yu Tian
This article has no evaluationsLatest version Jun 14, 2026
Assessment of Zero-Shot Large Language Model (LLM) Assisted Clinical Trial Matching Processes: A Metastatic Cancer Use Case

This article has 10 authors:
1. Yingjie Weng
2. Himani Yalamaddi
3. Danning Fu
4. Ankita Mishra
5. Bryan J. Bunning
6. Andrew B. Martin
7. Jessica Hope
8. Vivek Charu
9. Allison Kurian
10. Manisha Desai
This article has no evaluationsLatest version Jul 10, 2026

Discuss this preprint

Listed in

Abstract

Background

Methods

Findings

Interpretation

Research in context

Evidence before this study

Added value of this study

Implications of all the available evidence

Article activity feed

Related articles

Challenges in AI Based Tumor Board Case Summarization and Recommendations

A multimodal foundation model linking histopathology and DNA methylation

Assessment of Zero-Shot Large Language Model (LLM) Assisted Clinical Trial Matching Processes: A Metastatic Cancer Use Case