Comparative Evaluation of State‑of‑the‑Art Large Language Models for Patient Education Prior to Interventional Radiology procedures

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Purpose : This study evaluates four large language models’ (LLMs) ability to answer common patient questions preceding transarterial periarticular embolization (TAPE), computed tomography (CT)-guided high-dose-rate (HDR) brachytherapy, and bleomycin electrosclerotherapy (BEST). The goal is to evaluate their potential to enhance clinical workflows and patient comprehension, while also assessing associated risks. Materials and Methods: 35 TAPE, 34 CT‑HDR brachytherapy, and 36 BEST related questions were presented to ChatGPT-4o, DeepSeek-V3, OpenBioLLM-8b, and BioMistral-7b. The LLM-generated responses were independently assessed by two board-certified radiologists. Accuracy was rated on a 5-point Likert scale. Statistics compared LLM performance across question categories for patient-education suitability. Results: DeepSeek-V3 attained the highest mean scores for BEST [4.49 (± 0.77)] and CT-HDR [4.24 (± 0.81)] and demonstrated comparable performance to ChatGPT-4o for TAPE-related questions (DeepSeek-V3 [4.20 (± 0.77)] vs. ChatGPT-4o [4.17 (± 0.64)]; p = 1.000). In contrast, OpenBioLLM-8b (BEST 3.51 (± 1.15), CT-HDR 3.32 (± 1.13), TAPE 3.34 (± 1.16)) and BioMistral-7b (BEST 2.92 (± 1.35), CT-HDR 3.03 (± 1.06), TAPE 3.33 (± 1.28)) performed significantly worse than DeepSeek-V3 and ChatGPT-4o across all procedures. Preparation/Planning was the only category without statistically significant differences across all three procedures. Conclusion: DeepSeek‑V3 and ChatGPT‑4o excelled on TAPE, BEST and CT‑HDR brachytherapy questions, indicating potential to enhance patient education in interventional radiology, where complex but minimally invasive procedures often are explained in brief consultations. However, OpenBioLLM‑8b and BioMistral‑7b exhibited more frequent inaccuracies, suggesting that LLMs cannot replace comprehensive clinical consultations yet. Patient feedback and clinical workflow implementation should validate these findings.

Article activity feed