Large Language Models in Clinical Neurology: A Systematic Review
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Large language models (LLMs) are increasingly explored for clinical applications in neurology, yet their real-world utility, safety, and optimal implementation remain uncertain. We systematically reviewed the literature to characterize current applications, evaluate evidence quality, and identify knowledge gaps regarding LLM use in clinical neurology. Methods Following PRISMA guidelines, we searched PubMed, Embase, Scopus, Web of Science, and CENTRAL from January 1, 2022 through February 1 2026. for peer-reviewed studies evaluating LLM applications in clinical neurology. We included studies using large language models for clinically relevant neurology tasks from text or multimodal inputs. Two independent reviewers screened records, extracted data, and assessed risk of bias using the QUADS-AI. We synthesized evidence narratively across application domains, validation approaches, and model performance. Results Thirty-six studies (published 2023–2026) spanning 8 neurology subspecialties met inclusion criteria; 13 were simulation or feasibility studies, 17 analyzed retrospective clinical data, and 6 reported prospective clinical validation. Proprietary models predominated; 7 studies used retrieval-augmented generation (RAG) and 3 used agentic frameworks. Performance was highest for constrained tasks, including binary diagnostic classification (area under the curve, AUC 0.75–0.94), information extraction (F1 score, 0.89–0.90), patient education question answering (accuracy, 68%-97%), and ischemic stroke thrombectomy decision support (AUC, 0.92). Open-ended case-based classification showed lower accuracy (42%-54%). Safety signals included hallucinations and fabricated citations, overconfident recommendations, and poor calibration; risk of bias was rated high in all included studies. Conclusion LLMs show promise for selected neurology workflows, but current evidence is early, heterogeneous, and limited by high risk of bias and scarce prospective validation. Clinical translation will likely require RAG and agentic architectures that can plan multi-step tasks, retrieve guidelines and local protocols, verify and calibrate outputs, and produce structured, auditable recommendations with source attribution, with clinician oversight and prospective evaluation. Primary Funding Source: This work was supported in part through the computational and data resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai and supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences. Research reported in this publication was also supported by the Office of Research Infrastructure of the National Institutes of Health under award number S10OD026880 and S10OD030463. Registration: PROSPERO CRD420251082465