Quantifying UncertAInty: Testing the Accuracy of LLMs’ Confidence Judgments

Trent N. Cash
Daniel M. Oppenheimer
Sara Christie

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rise of Large Language Model (LLM) chatbots, such as ChatGPT and Gemini, has revolutionized how we access information. These LLMs can answer a wide array of questions, including those for which the answer is uncertain, such as predictions about future events. For example, a user may ask an LLM to predict whether a stock will go up in the next week or if they will earn an A on their final exam. When humans make these predictions, they often accompany their responses with metacognitive confidence judgments indicating their belief in the accuracy of their prediction. LLMs are certainly capable of – and willing to – provide confidence judgments, but it is currently unclear how meaningful or accurate these confidence judgments are. To fill this gap in the literature, the present studies investigate the capability of LLMs to quantify uncertainty through confidence judgments. We evaluate the absolute and relative accuracy of confidence judgments from two LLMs (ChatGPT and Gemini) compared to human participants across three prediction domains: NFL game winners (Study 1a; n = 502), Oscar award winners (Study 1b; n = 109), and future Pictionary performance (Study 2; n = 164). Our findings reveal that LLMs’ confidence judgments closely align with those of humans in terms of accuracy, biases, and errors. However, unlike humans, LLMs struggle to adjust their confidence judgments based on past performance, highlighting a key area for improvement in their design.

Version published to 10.31234/osf.io/47df5 on OSF Preprints
Jul 24, 2024

Assessing the Response Strategies of Large Language Models Under Uncertainty: A Comparative Study Using Prompt Engineering

This article has 2 authors:
1. Nehoda Lainwright
2. Moyat Pemberton
This article has no evaluationsLatest version Aug 1, 2024
Assessing Reasoning Capabilities of Commercial LLMs: A Comparative Study of Inductive and Deductive Tasks

This article has 3 authors:
1. Rowena Witali
2. Quentin Latrese
3. Giles Ravenscroft
This article has no evaluationsLatest version Aug 6, 2024
Bland and faraway bugs: Does the effect of perceptual richness on transfer depend on the semantic distance of the test items?

This article has 1 author:
1. David Menendez
This article has no evaluationsLatest version Aug 6, 2024

Listed in

Abstract

Article activity feed

Related articles

Assessing the Response Strategies of Large Language Models Under Uncertainty: A Comparative Study Using Prompt Engineering

Assessing Reasoning Capabilities of Commercial LLMs: A Comparative Study of Inductive and Deductive Tasks

Bland and faraway bugs: Does the effect of perceptual richness on transfer depend on the semantic distance of the test items?