Standardized Assessment Framework for Evaluations of Large Language Models in Medicine (SAFE-LLM)

Ida Mohammadi
Shahryar Rajai Firouzabadi
Omid Kohandel Gargari
Gholamreza Habibi

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) are AI-powered systems that have demonstrated significant potential in various fields, including medicine. Despite their promise, the methods for evaluating their performance in medical contexts remain inconsistent. This paper introduces the Standardized Assessment Framework for Evaluations of Large Language Models (SAFE-LLM) to streamline and standardize the evaluation of LLMs in healthcare. SAFE-LLM assesses five domains: accuracy, comprehensiveness, supplementation, consistency, and fluency. Accuracy refers to the correctness of the model's response, comprehensiveness to the detail and reasoning provided, supplementation to additional relevant information, consistency to uniformity in repeated answers, and fluency to the coherence of responses. Each prompt is given three times, with responses evaluated by two independent experts. Discrepancies between evaluations trigger a third assessment to ensure reliability. Grading is performed on a scale specific to each domain, with a maximum possible score of seven points. The SAFE-LLM score can be applied to individual answers or averaged across responses for a holistic assessment. This framework aims to unify evaluation standards, facilitating the comparison and improvement of LLMs in medical applications. Developing standardized evaluation tools like SAFE-LLM is critical for integrating AI into healthcare effectively. This framework is a preliminary step towards more rigorous and comparable assessments of LLMs, enhancing their applicability and trustworthiness in medical settings.

Version published to 10.20944/preprints202501.0471.v1
Jan 7, 2025

Evaluation of Large Language Models in Medical Examinations: A Scoping Review Protocol

This article has 5 authors:
1. Weiqi Wang
2. Baifeng Wang
3. Yan Zhu
4. Zhe Wang
5. Suyuan Peng
This article has no evaluationsLatest version Jun 12, 2025
AI-literacy training enhances physician-LLM diagnostic collaboration in a resource-limited setting: a randomized controlled trial

This article has 6 authors:
1. Ihsan Ayyub Qazi
2. Ayesha Ali
3. Asad Ullah Khawaja
4. Muhammad Junaid Akhtar
5. Ali Zafar Sheikh
6. Muhammad Hamad Alizai
This article has no evaluationsLatest version Jun 27, 2025
Large Language Models in Real-World Clinical Workflows: A Systematic Review of Applications and Implementation

This article has 6 authors:
1. Yaara Artsi
2. Vera Sorin
3. Benjamin S. Glicksberg
4. Panagiotis Korfiatis
5. Girish N Nadkarni
6. Eyal Klang
This article has no evaluationsLatest version Jun 11, 2025

Listed in

Abstract

Article activity feed

Related articles

Evaluation of Large Language Models in Medical Examinations: A Scoping Review Protocol

AI-literacy training enhances physician-LLM diagnostic collaboration in a resource-limited setting: a randomized controlled trial

Large Language Models in Real-World Clinical Workflows: A Systematic Review of Applications and Implementation