Concordance Between the DeepSeek-V3 Language Model and Multidisciplinary Team Recommendations in Lung Cancer: A Retrospective Study

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background The complexity of lung cancer multidisciplinary team (MDT) decision-making necessitates tools that can efficiently synthesize clinical data. Evaluating the concordance between large language model (LLM)-generated recommendations and MDT decisions is critical for clinical integration. Objective This study aimed to evaluate the overall and subgroup concordance between treatment recommendations generated by DeepSeek-V3 (the predominant clinical LLM in mid-2025) and the consensus decisions of institutional MDT, and to assess the clinical quality and utility of the model’s outputs via expert appraisal. Methods In this retrospective cohort study, 100 consecutive lung cancer patients were included. Identical anonymized clinical data were processed through DeepSeek-V3 (the predominant LLM version in clinical deployment as of June 2025) configured as a clinical decision support system, and reviewed by the institutional MDT. The primary outcome was the overall concordance of treatment recommendations measured by Cohen's Kappa. Secondary analyses included subgroup concordance by molecular markers and quality assessment via 5-point Likert scales by two independent oncologists. Results DeepSeek-V3 demonstrated substantial concordance with MDT recommendations (κ = 0.789, 95% CI: 0.723–0.855). Discordances primarily occurred between localized treatment modalities (12/16 discordant cases between definitive chemoradiotherapy and surgery ± adjuvant therapy, all of which were locally advanced NSCLC with high surgical risk factors). Subgroup Kappa values ranged from 0.55 to 0.83 across molecular phenotypes. Independent experts rated the model's outputs highly for guideline adherence (mean score 4.5 ± 0.6) and clinical utility (4.3 ± 0.7), with excellent inter-rater reliability (Spearman's ρ > 0.76, p < 0.001). Conclusion DeepSeek-V3 showed substantial concordance with MDT treatment recommendations in lung cancer, with outputs considered clinically relevant by domain experts. This supports its potential role as an assistive tool in MDT settings.

Article activity feed