Evaluating the Impact of Authoritative and Subjective Cues on Large Language Model Reliability for Clinical Inquiries: An Experimental Study

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Large Language Models (LLMs) show significant promise in medicine but are typically evaluated using neutral, standardized questions. In real-world scenarios, inquiries from patients, students, or clinicians are often framed with subjective beliefs or cues from perceived authorities. The impact of these non-neutral, noisy prompts on LLM reliability is a critical but understudied area. This study aimed to experimentally evaluate how subjective impressions (e.g., a flawed self-recalled memory) and authoritative cues (e.g., a statement attributed to a teacher) embedded in user prompts influence the accuracy and reliability of LLM responses to a clinical question with a definitive answer.

Method

Five state-of-the-art LLMs were tested on a clinical question regarding the treatment line of aripiprazole, for which established guidelines (CANMAT) provide a gold standard answer. LLM performance was assessed under three prompt conditions: a neutral baseline, a prompt containing an incorrect “self-recalled” memory, and a prompt containing an incorrect “authoritative” cue. Response accuracy, self-rated confidence, efficacy, and tolerability scores were collected across 250 test runs (5 models x 5 scenarios x 10 repetitions). Accuracy differences were tested with χ 2 and Cramér’s V, and score shifts were analyzed with van Elteren tests.

Results

In the baseline condition, all models achieved 100% accuracy. However, accuracy significantly decreased in conditions with misleading cues, dropping to 45% with self-recall prompts and 1% with authoritative prompts. A strong association was found between the prompt condition and accuracy (Cramér’s V = 0.75, P < .001). Similarly, both efficacy and tolerability scores decreased in response to misleading cues. Notably, while accuracy collapsed in the authoritative condition, the models’ self-rated confidence remained high, showing no statistical difference from the baseline condition.

Conclusions

The results suggest that LLMs can be highly vulnerable to biased inquiries, especially those invoking authority, often responding with overconfidence. This highlights potential limitations in current LLMs’ reliability and underscores the need for new standards in validation, user education, and system design for their safe and effective deployment across the healthcare ecosystem.

Article activity feed