Neurosymbolic Multi-Agent Artificial Intelligence versus General-Purpose Large Language Models for Clinical Decision Support in Ileus and Volvulus

mete ucdal
Sefa Keskin
karya yurtsever
Leyla Eybatova

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background General-purpose large language models (LLMs) demonstrate variable diagnostic accuracy and residual hallucination when applied to complex surgical emergencies. Whether a neurosymbolic multi-agent architecture—integrating domain-specific vision-language models, medically fine-tuned reasoning engines, and compositional verification agents—can outperform monolithic LLMs in ileus and volvulus case assessment remains unexplored. Methods We conducted a retrospective diagnostic accuracy study using 133 adult case vignettes (median age 62 years; 57.9% male) reconstructed from PubMed-indexed case reports published between January 2022 and December 2025. Three AI systems were evaluated: ChatGPT (GPT-4 Turbo), Gemini 2.0 Pro, and a sequential neurosymbolic multi-agent hybrid system comprising a radiology vision-language agent (Hulu-Med 32B), a clinical reasoning agent (Med-PaLM 2), and a compositional validation agent (Gyan LLM). Standardized prompts were submitted in zero-shot configuration. Two blinded expert assessors independently evaluated five predefined criteria: diagnostic accuracy, treatment appropriateness, hallucination presence, explanation adequacy, and critical safety errors. Inter-rater reliability was assessed using Cohen’s kappa. McNemar’s test with Bonferroni correction was used for pairwise comparisons. Results The neurosymbolic multi-agent system achieved significantly higher diagnostic accuracy (75.2%; 95% CI: 66.9–82.2%) compared with ChatGPT (60.2%; 95% CI: 51.4–68.5%; p < 0.001) and Gemini (58.6%; 95% CI: 49.8–67.0%; p < 0.001). The multi-agent system also demonstrated superior treatment appropriateness (74.4% vs. 63.9% and 61.7%; both p < 0.017), markedly lower hallucination rates (1.5% vs. 15.0% and 9.8%; both p < 0.001), and zero critical safety errors (0% vs. 3.8% and 2.3%). Subgroup analysis revealed perfect diagnostic accuracy (100%) for volvulus cases in the multi-agent system versus 78.6% and 75.0% for the single-model systems. Performance convergence was observed in diagnostically ambiguous entities including Ogilvie syndrome (67.6% vs. 48.6% and 51.4%) and toxic megacolon (50.0% vs. 41.7%). Conclusions A neurosymbolic multi-agent pipeline that decomposes the clinical reasoning workflow into specialized perception, synthesis, and verification stages significantly outperforms general-purpose LLMs in diagnosing and managing ileus-spectrum and volvulus-spectrum emergencies. The architectural separation of neural pattern recognition from symbolic rule-based verification substantially reduces hallucination and eliminates critical safety errors. These findings support the integration of neurosymbolic design principles in clinical AI systems for acute abdominal pathology, while underscoring persistent limitations in diagnostically ambiguous conditions.

Version published to 10.21203/rs.3.rs-9045948/v1 on Research Square
Apr 10, 2026

The Inefficacy of Artificial Intelligence Large Language Models in Healthcare: A Clinical and Statistical Perspective

This article has 4 authors:
1. Michael Williams
2. Raeed Kabir
3. Cody Taylor
4. Tariq Nakhooda
This article has no evaluationsLatest version Apr 27, 2026
NEURA: A proof-carrying framework for hallucination-resistant neuroimaging automation

This article has 10 authors:
1. Jun Xie
2. Jing Wang
3. Xiumei Wu
4. Xinyuan Liu
5. Yiqi Mi
6. Qinjin Liu
7. Tong Xu
8. Chen Liu
9. Huafu Chen
10. Jing Guo
This article has no evaluationsLatest version Apr 30, 2026
Benchmarking General-Purpose and Medical AI Large Language Models for Clinical Assessment and Management in Parkinson’s Disease

This article has 5 authors:
1. Shechter Yosef
2. Klevor Raymond
3. Kouchache Trycia
4. Bouhadoun Sarah
5. Ronald B Postuma
This article has no evaluationsLatest version May 20, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

The Inefficacy of Artificial Intelligence Large Language Models in Healthcare: A Clinical and Statistical Perspective

NEURA: A proof-carrying framework for hallucination-resistant neuroimaging automation

Benchmarking General-Purpose and Medical AI Large Language Models for Clinical Assessment and Management in Parkinson’s Disease