Evaluating Log-Likelihood for Confidence Estimation in LLM-Based Multiple-Choice Question Answering
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Reliable deployment of large language models (LLMs) in question-answering tasks requires well-calibrated confidence estimates. This work investigates whether token-level log-likelihoods—sums of log-probabilities over answer tokens—can serve as effective confidence signals in multiple-choice question answering (MCQA). We compare three methods: (1) raw log-likelihood, (2) length-normalized loglikelihood, and (3) conventional softmax-based choice probability. Across four diverse MCQA benchmarks, we find that no single scoring method is universally best. Length normalization can significantly improve calibration but may reduce accuracy, while softmax and raw log-likelihood yield identical predictions. These results highlight important trade-offs between calibration and accuracy, and offer insights into selecting or adapting confidence measures for different tasks. Our findings inform the design of more trustworthy LLM-based QA systems and lay groundwork for broader uncertainty quantification efforts.