Quantifying UncertAInty: Testing the Accuracy of LLMs’ Confidence Judgments

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The rise of Large Language Model (LLM) chatbots, such as ChatGPT and Gemini, has revolutionized how we access information. These LLMs can answer a wide array of questions, including those for which the answer is uncertain, such as predictions about future events. For example, a user may ask an LLM to predict whether a stock will go up in the next week or if they will earn an A on their final exam. When humans make these predictions, they often accompany their responses with metacognitive confidence judgments indicating their belief in the accuracy of their prediction. LLMs are certainly capable of – and willing to – provide confidence judgments, but it is currently unclear how meaningful or accurate these confidence judgments are. To fill this gap in the literature, the present studies investigate the capability of LLMs to quantify uncertainty through confidence judgments. We evaluate the absolute and relative accuracy of confidence judgments from two LLMs (ChatGPT and Gemini) compared to human participants across three prediction domains: NFL game winners (Study 1a; n = 502), Oscar award winners (Study 1b; n = 109), and future Pictionary performance (Study 2; n = 164). Our findings reveal that LLMs’ confidence judgments closely align with those of humans in terms of accuracy, biases, and errors. However, unlike humans, LLMs struggle to adjust their confidence judgments based on past performance, highlighting a key area for improvement in their design.

Article activity feed