Open-source solution for evaluation and benchmarking of large language models for public health

Laura Espinosa
Djilani Kebaili
Sergio Consoli
Kyriaki Kalimeri
Yelena Mejova
Marcel Salathé

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks, including text classification, information extraction, and sentiment analysis. Despite their increased usage and development, there are no standardized benchmarking frameworks for domain-specific applications. The objective of this study is to develop and test an open-source solution that enables an intuitive, easy and rapid benchmarking of open-source and commercial LLMs for public health use cases.

An LLM prediction prototype was developed, supporting multiple open-source and commercial LLMs. It enables users to upload datasets, apply prompts, and generate structured JSON outputs for LLM tasks. The application prototype built with R and the Shiny library facilitates automated LLM evaluation and benchmarking by computing key performance metrics for categorical and variables. We tested these prototypes on four public health use cases: stance detection towards vaccination in tweets and Facebook posts, detection of vaccine adverse reaction in BabyCenter forum posts, and extraction of epidemiological quantitative features from the World Health Organization Disease Outbreak News.

Results revealed high variability in LLM performance depending on the task, dataset, and model, with no single LLM consistently outperforming others across all tasks. While larger models generally excelled, smaller models performed competitively in specific scenarios, highlighting the importance of task-specific model selection.

This study contributes to the effective LLM integration in public health by providing a structured, user-friendly and scalable solution for LLM prediction, evaluation and benchmarking. Our results underline the relevance of standardised and task-specific evaluation methods and model selection, and the use of clear and structured prompts to improve LLM performance in domain-specific use cases.

Version published to 10.1101/2025.03.20.25324040v1 on medRxiv
Mar 21, 2025
Version published to 10.1101/2025.03.20.25324040v2 on medRxiv
Mar 21, 2025

Improving the Robustness of Large Language Models in Extracting Social Determinants of Health

This article has 2 authors:
1. Jiashu Chen
2. Chase Simmons
This article has no evaluationsLatest version Mar 24, 2025
Evaluation of Large Language Models: Review of Metrics, Applications, and Methodologies

This article has 1 author:
1. Satyadhar Joshi
This article has no evaluationsLatest version Apr 7, 2025
Evaluation of Large Language Models: Review of Metrics, Applications, and Methodologies

This article has 1 author:
1. Satyadhar Joshi
This article has no evaluationsLatest version Apr 7, 2025

Listed in

Abstract

Article activity feed

Related articles

Improving the Robustness of Large Language Models in Extracting Social Determinants of Health

Evaluation of Large Language Models: Review of Metrics, Applications, and Methodologies

Evaluation of Large Language Models: Review of Metrics, Applications, and Methodologies