Three multimodal large language models fail at clinically actionable breast pathology in three different directions

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Breast cancer treatment depends on histopathological features, such as grade and receptor-defined subtype; however, specialist pathologist access is constrained when the workforce is limited. Commercial multimodal large language models (MLLMs) accept hematoxylin and eosin (H&E) image tiles through paid interfaces without local hardware or fine-tuning. However, prior pathology evaluations addressed only coarse tasks. Whether they reach treatment-determining accuracy and whether vendors agree remain unclear.

Methods

We aimed to evaluate three vendor-designated flagship MLLMs (Claude Sonnet 4.6, Gemini 2.5 Pro, GPT-5.5) in 427 invasive breast cancer cases. Each case went to all three with identical H&E tiles and prompts, and the subtype was inferred in the second call. The reference was an institutional sign-out report of an immunohistochemistry-derived subtype. We calculated the concordance, sensitivity, specificity, Cohen’s kappa, and pairwise McNemar and Bowker tests.

Findings

Claude ranked highest by raw histologic-type concordance but lowest by kappa, classifying all 23 lobular and seven micropapillary carcinomas as invasive breast carcinoma of no special type. The models anchored the Nottingham grade to three modal grades. None of the models reliably identified human epidermal growth factor receptor 2-positive disease. The failure direction was vendor-specific: Claude and GPT-5.5 were under-detected, whereas Gemini was over-called. Twelve prompt variants (4,056 calls) did not recover sensitivity.

Interpretation

No current commercial MLLM reaches deployment-ready accuracy for any treatment-determining feature of breast pathology. As each vendor fails in its own fixed direction, changing vendors alters the type of error rather than removing it; therefore, the value of these models is assistive rather than autonomous. At USD 0.20–0.50 per case, they may serve as supervised draft generators that leave the diagnosis with the pathologist.

Research in context

Evidence before this study

We searched PubMed and Embase from database inception to January 10, 2026, without language restriction, using combinations of the terms “large language model,” “multimodal,” “GPT,” “Gemini,” “Claude,” “foundation model,” “breast cancer,” “pathology,” “histopathology,” “whole-slide image,” and “diagnostic accuracy.” We also screened the reference lists of retrieved articles. Task-specific deep-learning models and pathology foundation models pretrained on large slide collections achieve strong performance on individual breast pathology tasks such as Nottingham grading and receptor-status prediction, but require local GPU infrastructure, curated training data, and deployment expertise. Evaluations of general-purpose commercially available multimodal large language models (MLLMs) in pathology were limited to coarse tasks, such as tissue-type classification and metastasis, and were typically confined to a single model. We found no studies directly comparing current flagship commercial MLLMs on clinically relevant, treatment-determining breast pathology tasks, and none reporting whether vendors agree or whether the choice of vendor changes the diagnosis. The available evidence was limited by single-model designs, task-narrow datasets, and reliance on raw accuracy without chance-corrected agreement.

Added value of this study

In this paired, single-center retrospective study of 427 invasive breast cancers, we evaluated three vendor-designated flagship commercial MLLMs (Claude Sonnet 4.6, Gemini 2.5 Pro, GPT-5.5) on identical H&E tiles and prompts against an expert pathologist reference, across treatment-determining features. No model reached deployment-ready accuracy for any feature. Raw concordance and chance-corrected agreement ranked the models in opposite order, so a decision based on raw accuracy alone would have selected the least discriminating model. Each model demonstrated a consistent error pattern, tending either toward under-detection or over-call. Consequently, changing vendors altered the type of diagnostic error rather than removing it. Twelve prompt variants across 4,056 calls, a reasoning-effort escalation, and an intra-vendor version upgrade did not change the failure direction, indicating a vendor-specific prior rather than a prompt-engineering artifact. To our knowledge, this is the first head-to-head, chance-corrected comparison of commercial MLLMs at the level of treatment-determining breast pathology.

Implications of all the available evidence

Taken together with previous evidence, our findings indicate that current commercial MLLMs are not ready for autonomous interpretation of breast pathology and should not be used as primary readers without pathologist oversight. Their value, if any, is assistive. Given their relatively low per-case cost, these systems may be useful for generating supervised draft reports where pathologist workload is the primary constraint, provided that digital pathology infrastructure and immunohistochemical testing are available and that final diagnostic responsibility remains with a qualified pathologist. Because the failure direction is vendor-specific, deployment requires vendor-aware caveats and item-level, chance-corrected evaluation rather than raw accuracy. Improving performance will likely require incorporation of domain-specific knowledge through approaches such as in-context reference examples, retrieval against a curated atlas, or handoff to a domain-trained model. Future development should be supported by immunohistochemistry-grounded datasets and prospective validation in the target setting.

Article activity feed