Open-source solution for evaluation and benchmarking of large language models for public health
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks, including text classification, information extraction, and sentiment analysis. Despite their increased usage and development, there are no standardized benchmarking frameworks for domain-specific applications. The objective of this study is to develop and test an open-source solution that enables an intuitive, easy and rapid benchmarking of open-source and commercial LLMs for public health use cases.
An LLM prediction prototype was developed, supporting multiple open-source and commercial LLMs. It enables users to upload datasets, apply prompts, and generate structured JSON outputs for LLM tasks. The application prototype built with R and the Shiny library facilitates automated LLM evaluation and benchmarking by computing key performance metrics for categorical and variables. We tested these prototypes on four public health use cases: stance detection towards vaccination in tweets and Facebook posts, detection of vaccine adverse reaction in BabyCenter forum posts, and extraction of epidemiological quantitative features from the World Health Organization Disease Outbreak News.
Results revealed high variability in LLM performance depending on the task, dataset, and model, with no single LLM consistently outperforming others across all tasks. While larger models generally excelled, smaller models performed competitively in specific scenarios, highlighting the importance of task-specific model selection.
This study contributes to the effective LLM integration in public health by providing a structured, user-friendly and scalable solution for LLM prediction, evaluation and benchmarking. Our results underline the relevance of standardised and task-specific evaluation methods and model selection, and the use of clear and structured prompts to improve LLM performance in domain-specific use cases.