A Study of Calibration as a Measurement of Trustworthiness of Large Language Models in Biomedical Research

Rodrigo de Oliveira
Matthew Garber
James M Gwinnutt
Emaan Rashidi
Jwu-Hsuan (Shantina) Hwang
William Gilmour
Jay Nanavati
Khaldoun Zine El Abidine
Christina DeFilippo Mack

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objectives

To assess the calibration of 9 large language models (LLMs) within biomedical natural language processing (BioNLP) tasks, furthering understanding of trustworthiness and reliability in real-world settings.

Materials and Methods

For each LLM, we collected responses and corresponding confidence scores for all 13 datasets (grouped into 6 tasks) of the Biomedical Language Understanding & Reasoning Benchmark (BLURB). Confidence scores were assigned using 3 strategies: Verbal, Self-consistency, Hybrid. For evaluation, we introduced Flex-ECE (Flexible Expected Calibration Error): a novel adaptation of ECE that accounts for partial correctness in model responses, allowing for a more realistic assessment of calibration in language-based settings. Two post-hoc calibration techniques—isotonic regression and histogram binning—were evaluated.

Results

Across tasks, mean calibration ranged from 23.9% (Population-Intervention-Comparison-Outcome extraction) to 46.6% (Relation Extraction). Across LLMs, Medicine-Llama3-8B had the best mean overall calibration (29.8%); Flan-T5-XXL had the highest ranking on 5/13 datasets. Across strategies, self-consistency (mean: 27.3%) had better calibration than Verbal (mean: 42.0%) and Hybrid (mean: 44.2%). Post-hoc methods substantially improved calibration, with best mean calibrated Flex-ECEs ranging from 0.1% to 4.1%.

Discussion

The poor out-of-the-box calibration of LLMs poses a risk to trustworthy deployment of such models in real-world BioNLP applications. Calibration can be improved post-hoc and is a recommended practice. Non-binary metrics for LLM evaluation such as Flex-ECE provide a more realistic assessment of trustworthiness of LLMs, and indeed any model that can be partially right/wrong.

Conclusion

This study shows that out-of-the-box calibration of LLMs is very poor, but traditional post-hoc calibration techniques are useful to calibrate LLMs.

Version published to 10.1101/2025.02.11.637373v1 on bioRxiv
Feb 15, 2025

CARDBiomedBench: A Benchmark for Evaluating Large Language Model Performance in Biomedical Research

This article has 24 authors:
1. Owen Bianchi
2. Maya Willey
3. Chelsea X. Alvarado
4. Benjamin Danek
5. Marzieh Khani
6. Nicole Kuznetsov
7. Anant Dadu
8. Syed Shah
9. Mathew J. Koretsky
10. Mary B. Makarious
11. Cory Weller
12. Kristin S. Levine
13. Sungwon Kim
14. Paige Jarreau
15. Dan Vitale
16. Elise Marsan
17. Hirotaka Iwaki
18. Hampton Leonard
19. Sara Bandres-Ciga
20. Andrew B Singleton
21. Mike A Nalls
22. Shekoufeh Mokhtari
23. Daniel Khashabi
24. Faraz Faghri
This article has no evaluationsLatest version Jan 21, 2025
An Overview of Medical Knowledge Evaluation of Large Language Models: An Endeavor Toward a Standardized Evaluation and Reporting Guideline

This article has 2 authors:
1. Omid Kohandel Gargari
2. Gholamreza Habibi
This article has no evaluationsLatest version Jan 9, 2025
Large language models improve transferability of electronic health record-based predictions across countries and coding systems

This article has 6 authors:
1. Matthias Kirchler
2. Matteo Ferro
3. Veronica Lorenzini
4. FinnGen
5. Christoph Lippert
6. Andrea Ganna
This article has no evaluationsLatest version Feb 4, 2025

Listed in

Abstract

Objectives

Materials and Methods

Results

Discussion

Conclusion

Article activity feed

Related articles

CARDBiomedBench: A Benchmark for Evaluating Large Language Model Performance in Biomedical Research

An Overview of Medical Knowledge Evaluation of Large Language Models: An Endeavor Toward a Standardized Evaluation and Reporting Guideline

Large language models improve transferability of electronic health record-based predictions across countries and coding systems