Standardized Assessment Framework for Evaluations of Large Language Models in Medicine (SAFE-LLM)
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models (LLMs) are AI-powered systems that have demonstrated significant potential in various fields, including medicine. Despite their promise, the methods for evaluating their performance in medical contexts remain inconsistent. This paper introduces the Standardized Assessment Framework for Evaluations of Large Language Models (SAFE-LLM) to streamline and standardize the evaluation of LLMs in healthcare. SAFE-LLM assesses five domains: accuracy, comprehensiveness, supplementation, consistency, and fluency. Accuracy refers to the correctness of the model's response, comprehensiveness to the detail and reasoning provided, supplementation to additional relevant information, consistency to uniformity in repeated answers, and fluency to the coherence of responses. Each prompt is given three times, with responses evaluated by two independent experts. Discrepancies between evaluations trigger a third assessment to ensure reliability. Grading is performed on a scale specific to each domain, with a maximum possible score of seven points. The SAFE-LLM score can be applied to individual answers or averaged across responses for a holistic assessment. This framework aims to unify evaluation standards, facilitating the comparison and improvement of LLMs in medical applications. Developing standardized evaluation tools like SAFE-LLM is critical for integrating AI into healthcare effectively. This framework is a preliminary step towards more rigorous and comparable assessments of LLMs, enhancing their applicability and trustworthiness in medical settings.