Retrieval Augmented Generation for 10 Large Language Models and its Generalizability in Assessing Medical Fitness

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Purpose: Large Language Models (LLMs) offer potential for medical applications, but often lack the specialized knowledge needed for clinical tasks. Retrieval Augmented Generation (RAG) is a promising approach, allowing for the customization of LLMs with domain-specific knowledge, well-suited for healthcare. We focused on assessing the accuracy, consistency and safety of RAG models in determining a patient’s fitness for surgery and providing additional crucial preoperative instructions. Methods: We developed LLM-RAG models using 35 local and 23 international preoperative guidelines and tested them against human-generated responses, with a total of 3682 responses evaluated. Clinical documents were processed, stored, and retrieved using Llamaindex. Ten LLMs (GPT3.5, GPT4, GPT4-o, Llama2-7B, Llama2-13B, LLama2-70b, LLama3-8b, LLama3-70b, Gemini-1.5-Pro and Claude-3-Opus) were evaluated with 1) native model, 2) with local and 3) international preoperative guidelines. Fourteen clinical scenarios were assessed, focusing on 7 aspects of preoperative instructions. Established guidelines and expert physician judgment determined correct responses. Human-generated answers from senior attending anesthesiologists and junior doctors served as a comparison. Comparative analysis was conducted using Fisher’s exact test and agreement for inter-rater agreement within human and LLM responses. Results : The LLM-RAG model demonstrated good efficiency, generating answers within 20 seconds, with guideline retrieval taking less than 5 seconds. This performance is faster than the 10 minutes typically estimated by clinicians. Notably, the LLM-RAG model utilizing GPT4 achieved the highest accuracy in assessing fitness for surgery, surpassing human-generated responses (96.4% vs. 86.6%, p=0.016). The RAG models demonstrated generalizable performance, exhibiting similarly favorable outcomes with both international and local guidelines. Additionally, the GPT4 LLM-RAG model exhibited an absence of hallucinations and produced correct preoperative instructions that were comparable to those generated by clinicians. Conclusions: This study successfully implements LLM-RAG models for preoperative healthcare tasks, emphasizing the benefits of grounded knowledge, upgradability, and scalability for effective deployment in healthcare settings.

Article activity feed