Arkangel AI, OpenEvidence, ChatGPT, Medisearch: are they objectively up to medical standards? A real-life assessment of LLMs in healthcare

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Large language models (LLMs) are increasingly used in healthcare, but standardized benchmarks fail to capture their validity and safety in real-world scenarios. Evaluating their quality and reliability is critical for safe integration into practice.

Methods

Four fictitious clinical vignettes (orthopedics, pediatrics, gynecology, psychiatry) were developed by independent specialists and tested in four conversational agents: ArkangelAI, OpenEvidence, ChatGPT, and Medisearch. Each vignette included four questions (diagnosis, management, research, and general knowledge). Responses were evaluated by four external clinicians using an eight-criterion Likert scale: 1–2 = dissatisfaction, 3 = neutral, 4–5 = satisfaction, 6 = not applicable. The criteria considered correctness, consensus, bias, standard of care, updated information, patient safety, real sources in references, and context-awareness. Response times were measured with medians and interquartile ranges (IQR). Results were reported as frequencies. Hypothesis tests were applied (α = 0.05).

Results

We assessed 128 question–answer (Q&A) pairs (1024 evaluations). ArkangelAI-Deep was the highest in satisfaction (92.9%), followed by OpenEvidence (83.6%), ChatGPT-Deep (80.5%), and Medisearch (71.1%). The most Dissatisfaction was for the real source of references: GPT-Personalized 75%, GPT-Regular 97%. Conversely, ArkangelAI-Deep, ChatGPT-Deep, and OpenEvidence obtained perfect marks in Satisfaction (100%). All performed well in correctness and agreement withthe consensus. ChatGPT was the lowest-scoring in non-biased answers. The safest for patients was GPT-Personalized, followed by Arkagel AI-Deep. By specialty, Gynecology scored the highest, whereas Pediatrics had the lowest. Response times varied widely: Medisearch was fastest (18 s), while GPT-Deep (13 min) and ArkangelAI-Deep (7.4 min) were slowest, showing a trade-off between depth and usability.

Conclusions

Conversational agents showed marked performance, safety, and stability. ArkangelAI-Deep and OpenEvidence consistently outperformed others, while Medisearch and GPT-Regular had significant limitations. These results underscore the need for standardized frameworks to ensure safe use of LLMs in healthcare.

Article activity feed