Evaluating a customised large language model (DELSTAR) and its ability to address medication-related questions associated with delirium: a quantitative exploratory study
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
A customised large language model (LLM) could serve as a next-generation clinical pharmacy research assistant to prevent medication-associated delirium. Comprehensive evaluation strategies are still missing.
Aim
This quantitative exploratory study aimed to develop an approach to comprehensively assess the domain-specific customised delirium LLM (DELSTAR) ability, quality and performance to accurately address complex clinical and practice research questions on delirium that typically require extensive literature searches and meta-analyses.
Method
DELSTAR, focused on delirium-associated medications, was implemented as a ‘Custom GPT’ for quality assessment and as a Python-based software pipeline for performance testing on closed and leading open models. Quality metrics included statement accuracy and data credibility; performance metrics covered F1-Score, sensitivity/specificity, precision, AUC, and AUC-ROC curves.
Results
DELSTAR demonstrated more accurate and comprehensive information compared to information retrieved by traditional systematic literature reviews (SLRs) ( p < 0.05) and accessed Application Programmer Interfaces (API), private databases, and high-quality sources despite mainly relying on less reliable internet sources. GPT-3.5 and GPT-4o emerged as the most reliable foundation models. In Dataset 2, GPT-4o (F1-Score: 0.687) and Llama3-70b (F1-Score: 0.655) performed best, while in Dataset 3, GPT-3.5 (F1-Score: 0.708) and GPT-4o (F1-Score: 0.665) led. None consistently met desired threshold values across all metrics.
Conclusion
DELSTAR demonstrated potential as a clinical pharmacy research assistant, surpassing traditional SLRs in quality. Improvements are needed in high-quality data use, citation, and performance optimisation. GPT-4o, GPT-3.5, and Llama3-70b were the most suitable foundation models, but fine-tuning DELSTAR is essential to enhance sensitivity, especially critical in pharmaceutical contexts.