Hybrid Intelligence in Oncology: Superior Accuracy and Convergence of Large Language Models Over Human Experts in Interpreting
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: The interpretation of nuanced recommendations within complex clinical oncology guidelines, such as those for brain metastases, presents persistent challenges for medical experts, potentially impacting treatment consistency. While Large Language Models (LLMs) offer potential decision support, their comparative efficacy in this domain remains underexplored. Objective: This study evaluated the accuracy and convergence of medical experts versus leading LLMs in interpreting Strength of Recommendation (SoR) and Quality of Evidence (QoE) from the ASTRO and ASCO-SNO-ASTRO brain metastases guidelines. Methods: Neurosurgeons, radiation oncologists, and four LLMs (ChatGPT-4o, Gemini 2.0, Microsoft Copilot Pro, Deepseek R1) assessed SoR and QoE for guideline recommendations. Accuracy, near-answer rates, and Cohen's weighted kappa (κ) were calculated. Results: LLMs, notably Gemini and Deepseek, demonstrated significantly higher accuracy (up to 100% for ASTRO SoR vs. maximum 58.82% for experts) and near-perfect convergence (κ up to 1.000 vs. κ ≤ 0.504 for experts) in interpreting ASTRO guideline specifics. While all groups found Quality of Evidence (QoE) and the more complex ASCO guideline more challenging, LLMs generally maintained an advantage in convergence, with Deepseek achieving 61.53% accuracy and κ = 0.428 for ASCO SoR versus maximum 53.84% accuracy and highly variable convergence for experts. Conclusion: LLMs show considerable promise in accurately and consistently interpreting complex oncology guidelines, in some aspects surpassing human expert performance. These findings highlight the potential of hybrid intelligence systems where LLMs assist clinicians with guideline interpretation, enhancing practice standardization while preserving expert judgment for patient-specific applications. This approach may inform future guideline development to optimize both human and AI comprehension, ultimately improving patient care through more consistent implementation of evidence-based recommendations.