Comparative Evaluation of State‑of‑the‑Art Large Language Models for Patient Education Prior to Interventional Radiology procedures

Bogdan Levita
Semil Eminovic
Willie Magnus Lüdemann
Dirk Schnapauff
Robin Schmidt
Anna-Maria Haack
Andrea Dell’Orco
Jawed Nawabi
Tobias Penzkofer

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Purpose : This study evaluates four large language models’ (LLMs) ability to answer common patient questions preceding transarterial periarticular embolization (TAPE), computed tomography (CT)-guided high-dose-rate (HDR) brachytherapy, and bleomycin electrosclerotherapy (BEST). The goal is to evaluate their potential to enhance clinical workflows and patient comprehension, while also assessing associated risks. Materials and Methods: 35 TAPE, 34 CT‑HDR brachytherapy, and 36 BEST related questions were presented to ChatGPT-4o, DeepSeek-V3, OpenBioLLM-8b, and BioMistral-7b. The LLM-generated responses were independently assessed by two board-certified radiologists. Accuracy was rated on a 5-point Likert scale. Statistics compared LLM performance across question categories for patient-education suitability. Results: DeepSeek-V3 attained the highest mean scores for BEST [4.49 (± 0.77)] and CT-HDR [4.24 (± 0.81)] and demonstrated comparable performance to ChatGPT-4o for TAPE-related questions (DeepSeek-V3 [4.20 (± 0.77)] vs. ChatGPT-4o [4.17 (± 0.64)]; p = 1.000). In contrast, OpenBioLLM-8b (BEST 3.51 (± 1.15), CT-HDR 3.32 (± 1.13), TAPE 3.34 (± 1.16)) and BioMistral-7b (BEST 2.92 (± 1.35), CT-HDR 3.03 (± 1.06), TAPE 3.33 (± 1.28)) performed significantly worse than DeepSeek-V3 and ChatGPT-4o across all procedures. Preparation/Planning was the only category without statistically significant differences across all three procedures. Conclusion: DeepSeek‑V3 and ChatGPT‑4o excelled on TAPE, BEST and CT‑HDR brachytherapy questions, indicating potential to enhance patient education in interventional radiology, where complex but minimally invasive procedures often are explained in brief consultations. However, OpenBioLLM‑8b and BioMistral‑7b exhibited more frequent inaccuracies, suggesting that LLMs cannot replace comprehensive clinical consultations yet. Patient feedback and clinical workflow implementation should validate these findings.

Version published to 10.21203/rs.3.rs-7329930/v1 on Research Square
Aug 22, 2025

Comparison of Multimodal Large Language Models and Physicians for Medical Diagnosis Using NEJM Image Challenge Cases: Cross-sectional Study

This article has 6 authors:
1. Chiyu Sheng
2. Shumin Shen
3. Lin Wang
4. Wei Chen
5. Shanghu Wang
6. Nianfei Wang
This article has no evaluationsLatest version Sep 1, 2025
A Comparative Performance Analysis of AI-Assisted Language Models in Preoperative Patient Education for Mitral Valve Surgery

This article has 10 authors:
1. Banu Bahriye Akdağ
2. Mehmet Şenel Bademci
3. İhsan Peker
4. Okay Güven Karaca
5. Çağrı Kandemir
6. Barçın Özcem
7. Hüseyin Durmaz
8. Meryem Çakır
9. İrem Özçetin
10. Hidayet Onur Selçuk
This article has no evaluationsLatest version Sep 9, 2025
High-resolution semiconductor 18F-FDG PET/CT with prone positioning for assessing the extent of mucinous breast carcinoma: initial experience

This article has 11 authors:
1. Hiroyuki Kuroda
2. Takeshi Yoshizako
3. Anna Murata
4. Nobuhiro Yada
5. Rika Yoshida
6. Mitsunari Maruyama
7. Nobuko Yamamoto
8. Takayuki Kadoya
9. Manabu Yoshida
10. Daisuke Niino
11. Yasushi Kaji
This article has no evaluationsLatest version Sep 9, 2025

Listed in

Abstract

Article activity feed

Related articles

Comparison of Multimodal Large Language Models and Physicians for Medical Diagnosis Using NEJM Image Challenge Cases: Cross-sectional Study

A Comparative Performance Analysis of AI-Assisted Language Models in Preoperative Patient Education for Mitral Valve Surgery

High-resolution semiconductor 18F-FDG PET/CT with prone positioning for assessing the extent of mucinous breast carcinoma: initial experience