Neurosymbolic Multi-Agent Artificial Intelligence versus General-Purpose Large Language Models for Clinical Decision Support in Ileus and Volvulus

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background General-purpose large language models (LLMs) demonstrate variable diagnostic accuracy and residual hallucination when applied to complex surgical emergencies. Whether a neurosymbolic multi-agent architecture—integrating domain-specific vision-language models, medically fine-tuned reasoning engines, and compositional verification agents—can outperform monolithic LLMs in ileus and volvulus case assessment remains unexplored. Methods We conducted a retrospective diagnostic accuracy study using 133 adult case vignettes (median age 62 years; 57.9% male) reconstructed from PubMed-indexed case reports published between January 2022 and December 2025. Three AI systems were evaluated: ChatGPT (GPT-4 Turbo), Gemini 2.0 Pro, and a sequential neurosymbolic multi-agent hybrid system comprising a radiology vision-language agent (Hulu-Med 32B), a clinical reasoning agent (Med-PaLM 2), and a compositional validation agent (Gyan LLM). Standardized prompts were submitted in zero-shot configuration. Two blinded expert assessors independently evaluated five predefined criteria: diagnostic accuracy, treatment appropriateness, hallucination presence, explanation adequacy, and critical safety errors. Inter-rater reliability was assessed using Cohen’s kappa. McNemar’s test with Bonferroni correction was used for pairwise comparisons. Results The neurosymbolic multi-agent system achieved significantly higher diagnostic accuracy (75.2%; 95% CI: 66.9–82.2%) compared with ChatGPT (60.2%; 95% CI: 51.4–68.5%; p < 0.001) and Gemini (58.6%; 95% CI: 49.8–67.0%; p < 0.001). The multi-agent system also demonstrated superior treatment appropriateness (74.4% vs. 63.9% and 61.7%; both p < 0.017), markedly lower hallucination rates (1.5% vs. 15.0% and 9.8%; both p < 0.001), and zero critical safety errors (0% vs. 3.8% and 2.3%). Subgroup analysis revealed perfect diagnostic accuracy (100%) for volvulus cases in the multi-agent system versus 78.6% and 75.0% for the single-model systems. Performance convergence was observed in diagnostically ambiguous entities including Ogilvie syndrome (67.6% vs. 48.6% and 51.4%) and toxic megacolon (50.0% vs. 41.7%). Conclusions A neurosymbolic multi-agent pipeline that decomposes the clinical reasoning workflow into specialized perception, synthesis, and verification stages significantly outperforms general-purpose LLMs in diagnosing and managing ileus-spectrum and volvulus-spectrum emergencies. The architectural separation of neural pattern recognition from symbolic rule-based verification substantially reduces hallucination and eliminates critical safety errors. These findings support the integration of neurosymbolic design principles in clinical AI systems for acute abdominal pathology, while underscoring persistent limitations in diagnostically ambiguous conditions.

Article activity feed