Concordance Between the DeepSeek-V3 Language Model and Multidisciplinary Team Recommendations in Lung Cancer: A Retrospective Study

Yihan ZHao
Fangqi Yuan
Lingli Wang
Meifang Wang
Long Zhang
Tao Ren
Hansheng Wang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background The complexity of lung cancer multidisciplinary team (MDT) decision-making necessitates tools that can efficiently synthesize clinical data. Evaluating the concordance between large language model (LLM)-generated recommendations and MDT decisions is critical for clinical integration. Objective This study aimed to evaluate the overall and subgroup concordance between treatment recommendations generated by DeepSeek-V3 (the predominant clinical LLM in mid-2025) and the consensus decisions of institutional MDT, and to assess the clinical quality and utility of the model’s outputs via expert appraisal. Methods In this retrospective cohort study, 100 consecutive lung cancer patients were included. Identical anonymized clinical data were processed through DeepSeek-V3 (the predominant LLM version in clinical deployment as of June 2025) configured as a clinical decision support system, and reviewed by the institutional MDT. The primary outcome was the overall concordance of treatment recommendations measured by Cohen's Kappa. Secondary analyses included subgroup concordance by molecular markers and quality assessment via 5-point Likert scales by two independent oncologists. Results DeepSeek-V3 demonstrated substantial concordance with MDT recommendations (κ = 0.789, 95% CI: 0.723–0.855). Discordances primarily occurred between localized treatment modalities (12/16 discordant cases between definitive chemoradiotherapy and surgery ± adjuvant therapy, all of which were locally advanced NSCLC with high surgical risk factors). Subgroup Kappa values ranged from 0.55 to 0.83 across molecular phenotypes. Independent experts rated the model's outputs highly for guideline adherence (mean score 4.5 ± 0.6) and clinical utility (4.3 ± 0.7), with excellent inter-rater reliability (Spearman's ρ > 0.76, p < 0.001). Conclusion DeepSeek-V3 showed substantial concordance with MDT treatment recommendations in lung cancer, with outputs considered clinically relevant by domain experts. This supports its potential role as an assistive tool in MDT settings.

Version published to 10.21203/rs.3.rs-9109873/v1 on Research Square
Apr 10, 2026

Evaluating 11 Large Language Models in Answering Key Questions on Ovarian Cancer

This article has 7 authors:
1. Michela Quaranta
2. Yong Sheng Tan
3. Areti Karamanou
4. Evangelos Kalampokis
5. Nicolas M Orsi
6. Diederick DeJong
7. Alexandros Laios
This article has no evaluationsLatest version Apr 11, 2026
Development and Preliminary Validation of RP-WX: A WeChat Mini- Program-Based Prediction Model for Radiation Pneumonitis in Patients Undergoing Concurrent Chemoradiotherapy for Locally Advanced Squamous Cell Lung Cancer

This article has 5 authors:
1. Jianqiang Fang
2. Xi’an Xiong
3. Wei Tian
4. Qianxi Ni
5. Xiadong Li
This article has no evaluationsLatest version Apr 13, 2026
Diagnostic Accuracy of Large Language Models Versus Clinicians in Severe Preeclampsia: A Cross-Sectional Study

This article has 3 authors:
1. Diah Putri
2. Ferry Achmad Firdaus
3. Akhmad Yogi Pramatirta¹
This article has no evaluationsLatest version Apr 9, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Evaluating 11 Large Language Models in Answering Key Questions on Ovarian Cancer

Development and Preliminary Validation of RP-WX: A WeChat Mini- Program-Based Prediction Model for Radiation Pneumonitis in Patients Undergoing Concurrent Chemoradiotherapy for Locally Advanced Squamous Cell Lung Cancer

Diagnostic Accuracy of Large Language Models Versus Clinicians in Severe Preeclampsia: A Cross-Sectional Study