Evaluating the Impact of Authoritative and Subjective Cues on Large Language Model Reliability for Clinical Inquiries: An Experimental Study

Yu Chang
Po-Chung Ju
Ming-Hong Hsieh
Cheng-Chen Chang

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Large Language Models (LLMs) show significant promise in medicine but are typically evaluated using neutral, standardized questions. In real-world scenarios, inquiries from patients, students, or clinicians are often framed with subjective beliefs or cues from perceived authorities. The impact of these non-neutral, noisy prompts on LLM reliability is a critical but understudied area. This study aimed to experimentally evaluate how subjective impressions (e.g., a flawed self-recalled memory) and authoritative cues (e.g., a statement attributed to a teacher) embedded in user prompts influence the accuracy and reliability of LLM responses to a clinical question with a definitive answer.

Method

Five state-of-the-art LLMs were tested on a clinical question regarding the treatment line of aripiprazole, for which established guidelines (CANMAT) provide a gold standard answer. LLM performance was assessed under three prompt conditions: a neutral baseline, a prompt containing an incorrect “self-recalled” memory, and a prompt containing an incorrect “authoritative” cue. Response accuracy, self-rated confidence, efficacy, and tolerability scores were collected across 250 test runs (5 models x 5 scenarios x 10 repetitions). Accuracy differences were tested with χ ² and Cramér’s V, and score shifts were analyzed with van Elteren tests.

Results

In the baseline condition, all models achieved 100% accuracy. However, accuracy significantly decreased in conditions with misleading cues, dropping to 45% with self-recall prompts and 1% with authoritative prompts. A strong association was found between the prompt condition and accuracy (Cramér’s V = 0.75, P < .001). Similarly, both efficacy and tolerability scores decreased in response to misleading cues. Notably, while accuracy collapsed in the authoritative condition, the models’ self-rated confidence remained high, showing no statistical difference from the baseline condition.

Conclusions

The results suggest that LLMs can be highly vulnerable to biased inquiries, especially those invoking authority, often responding with overconfidence. This highlights potential limitations in current LLMs’ reliability and underscores the need for new standards in validation, user education, and system design for their safe and effective deployment across the healthcare ecosystem.

Version published to 10.1101/2025.07.15.25331607 on medRxiv
Jul 16, 2025

Evaluating General-Purpose LLMs for Patient-Facing Use: Dermatology-Centered Systematic Review and Meta-Analysis

This article has 1 author:
1. Irene S. Gabashvili
This article has no evaluationsLatest version Aug 11, 2025
Incorporating Preprints in Systematic Reviews: A Preliminary Study of a Novel Method for Rapid Evidence Synthesis

This article has 13 authors:
1. Jiayi Tong
2. Yifei Sun
3. Rebecca A. Hubbard
4. M. Elle Saine
5. Hua Xu
6. Xu Zuo
7. Lifeng Lin
8. Chunhua Weng
9. Christopher Schmid
10. Stephen E. Kimmel
11. Craig A. Umscheid
12. Adam Cuker
13. Yong Chen
This article has no evaluationsLatest version Jul 16, 2025
Elaboration and validation of R-LAST (Right Language Screening Test), a rapid and reliable screening tool to detect cognitive communication disorders in acute right hemisphere stroke

This article has 11 authors:
1. Constance Flamand-Roze
2. Wendy Régnier
3. Ryad Zerarka
4. Heather Flowers
5. Laura Monetta
6. Edwige Lescieux
7. Cosmin Alecu
8. Didier Smadja
9. Fernando Pico
10. Bruno Falissard
11. Nicolas Chausson
This article has no evaluationsLatest version Aug 12, 2025

Listed in

Abstract

Background

Method

Results

Conclusions

Article activity feed

Related articles

Evaluating General-Purpose LLMs for Patient-Facing Use: Dermatology-Centered Systematic Review and Meta-Analysis

Incorporating Preprints in Systematic Reviews: A Preliminary Study of a Novel Method for Rapid Evidence Synthesis

Elaboration and validation of R-LAST (Right Language Screening Test), a rapid and reliable screening tool to detect cognitive communication disorders in acute right hemisphere stroke