Performance assessment of large language models in cancer staging: Comparative analysis of Mistral models
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Cancer staging plays a critical role in treatment planning and prognosis but is often embedded in unstructured clinical narratives. To automate the extraction and structuring of staging data, large language models (LLMs) have emerged as a promising approach. However, their performance in real-world oncology settings has yet to be systematically evaluated. Herein, we analysed 1000 oncological summaries from patients receiving treatment for breast cancer between 2019 and 2020 at the François Baclesse Comprehensive Cancer Centre, France. Five Mistral artificial intelligence–based LLMs were evaluated (i.e. Small, Medium, Large, Magistral and Mistral:latest) for their ability to derive the cancer stage and identify staging elements. Larger models outperformed their smaller counterparts in staging accuracy and reproducibility (kappa > 0.95 for Mistral Large and Medium). Mistral Large achieved the highest accuracy in deriving the cancer stage (93.0%), surpassing the original clinical documentation in several cases. The LLMs consistently performed better in deriving the cancer stage when working through tumour size, nodal status and metastatic components compared to when they were directly requested stage data. The top-performing models had a test–retest reliability exceeding 97%, while smaller models and locally deployed versions lacked sufficient robustness, particularly in handling unit conversions and complex staging rules. The structured, stepwise use of LLMs that emulates clinician reasoning offers a more efficient, transparent and reproducible approach to cancer staging, and the study findings support LLM integration into digital oncology workflows.