Grounding Large Language Model in Clinical Diagnostics
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models (LLMs) possess extensive medical knowledge and demonstrate impressive performance in answering diagnostic questions. However, responding to such questions differs significantly from actual clinical diagnostic procedures. Real-world diagnostics involve a dynamic, iterative process that includes hypothesis refinement and targeted data collection. This complex task is both challenging and time-consuming, demanding a significant portion of clinical workload and healthcare resources. Therefore, evaluating and enhancing LLM performance in real-world diagnostic procedures is crucial for clinical deployment. In this study, a framework was developed to assess LLMs' capability in complete clinical encounters, including medical history, physical examination, diagnostic tests and diagnosis. A benchmark dataset of 4,421 real-world cases was curated, covering both rare and common diseases across 32 specialties. Clinical evaluation methods were used to comprehensively assess the performance of GPT-4o-mini, GPT-4o, Claude-3-Haiku, Qwen2.5-72b, Qwen2.5-34b, and Qwen2.5-14b in diagnostic procedures. Although these models performed well in answering diagnostic questions, they consistently underperformed in clinical diagnostic procedures and exhibited a significant number of clinical errors. To address these challenges, ClinDiag-GPT was trained on over 8,000 real-world cases. It emulates physicians' diagnostic reasoning, collects information in line with clinical practice, and recommends key diagnostic tests for definitive diagnoses. It significantly outperformed other LLMs in both diagnostic accuracy and procedural performance. We further compared the diagnostic performance of ClinDiag-GPT alone, in collaboration with physicians, and physicians alone. Collaboration between ClinDiag-GPT and physicians enhanced both diagnostic accuracy and efficiency, demonstrating ClinDiag-GPT's potential as a valuable clinical assistant.