Neuro-symbolic LLM Integration in Clinical Medicine: A Systematic Review and Taxonomy
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background LLMs are promising for clinical workflows but hallucinations limit deployment. Neuro-symbolic systems pair LLMs with explicit rules, ontologies, or knowledge graphs to constrain outputs, yet integration patterns are described inconsistently and their deployment trade-offs remain unclear. Methods We conducted a PRISMA 2020 systematic review (PROSPERO CRD420261296004) of peer-reviewed studies (PubMed/MEDLINE, CENTRAL, Web of Science, Scopus; English; January 1, 2022 to January 30, 2026) and assessed risk of bias using PROBAST adapted to neuro-symbolic evaluations. Results We identified 3,166 records; after screening, 21 studies were included. Studies spanned 18 clinical settings (N = 20 to 197,761; median 2,398) with external validation in 9/21 (42.9%). Four integration patterns were identified, ordered by increasing symbolic authority: structured output, rule-guided generation, knowledge retrieval, and iterative validation (symbolic veto or regeneration). Among studies reporting quantitative comparisons (n = 14), performance improvements ranged from + 3.1% to + 125.6% (median + 21.9%) and increased with symbolic authority (median gains: 9%, 16%, 26%, 40%). All studies reported explicit hallucination mitigation, and 17/21 (81.0%) reported guideline alignment. All studies were rated high risk of bias, driven mainly by analysis limitations (high in 19/21). Iterative validation approaches reported 2–88 s latency and up to 100x cost increases. Conclusion Across four neuro-symbolic LLM integration approaches, gains increase with symbolic authority: the strongest results come from iterative validation where the symbolic layer can veto, not just add context. That improves safety and auditability but raises latency, cost, and symbolic-stack failure risk. Primary Funding Source: This work was supported in part through the computational and data resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai and supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences. Research reported in this publication was also supported by the Office of Research Infrastructure of the National Institutes of Health under award number S10OD026880 and S10OD030463. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Registration: PROSPERO CRD420251120318