Open-source solution for evaluation and benchmarking of large language models for public health

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks, including text classification, information extraction, and sentiment analysis. However, most existing benchmarks are general-purpose and lack relevance for domain-specific applications such as public health. To address this gap, the objective of this study is to develop and test an open-source solution enabling intuitive, easy and rapid benchmarking of popular LLMs in public health contexts.

An LLM prediction prototype was developed, supporting multiple popular LLMs. It enables users to upload datasets, apply prompts, and generate structured JSON outputs for LLM tasks. The application prototype built with R and the Shiny library facilitates automated LLM evaluation and benchmarking by computing key performance metrics for categorical and variables. We tested these prototypes on four public health use cases: stance detection towards vaccination in tweets and Facebook posts, detection of vaccine adverse reaction in BabyCenter forum posts, and extraction of epidemiological quantitative features from the World Health Organization Disease Outbreak News.

Results revealed high variability in LLM performance depending on the task, dataset, and model, with no single LLM consistently outperforming others across all tasks. While larger models generally excelled, smaller models performed competitively in specific scenarios, highlighting the importance of task-specific model selection.

This study contributes to the effective LLM integration in public health by providing a structured, user-friendly and scalable solution for LLM prediction, evaluation and benchmarking. Our findings underline the relevance of standardised and task-specific evaluation methods and model selection, and the use of clear and structured prompts to improve LLM performance in domain-specific use cases.

Article activity feed