Assessing the feasibility and acceptability of a bespoke large language model pipeline to extract data from different study designs for public health evidence reviews

Zalaya Simmons
Beti Evans
Tamsyn Harris
Harry Woolnough
Lauren Dunn
Jonathon Fuller
Kerry Cella
Daphne Duval

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Introduction

Data extraction is a critical but resource-intensive step of the evidence review process. Whilst there is evidence that artificial intelligence (AI) and large language models (LLMs) can improve the efficiency of data extraction from randomised controlled trials, their potential for other study designs is unclear. In this context, this study aimed to evaluate the performance of a bespoke LLM model pipeline (Retrieval-Augmented Generation pipeline utilising LLaMa 3-70B) to automate data extraction from a range of study designs by assessing the accuracy, reliability and acceptability of the extractions.

Methods

Accuracy was assessed by comparing the LLM outputs for 173 data fields with data extracted from a sample of 24 articles (including experimental, observational, qualitative, and modelling studies) from a previously conducted review, of which 3 were used for prompt engineering. Reliability (consistency) was assessed by calculating the mean maximum agreement rate (the highest proportion of identical returns from 10 consecutive extractions) for 116 data fields from 16 of the 24 studies. Acceptability of the accuracy and reliability outputs for each data field was assessed on whether it would be usable in real-world settings if the model acted as one reviewer and a human as a second reviewer.

Results

Of the 173 data fields evaluated for accuracy, 68% were rated by human reviewers as acceptable (consistent with what is deemed to be acceptable data extraction from a human reviewer). However, acceptability ratings varied depending on the data field extracted (33% to 100%), with at least 90% acceptability for ‘objective’, ‘setting’, and ‘study design’, but 54% or less for data fields such as ‘outcome’ and ‘time period’. For reliability, the mean maximum agreement rate was 0.71 (SD: 0.28), with variation across different data fields.

Conclusion

This evaluation demonstrates the potential for LLMs, when paired with human quality assurance, to support data extraction in evidence reviews that include a range of study designs, however further improvements in performance and validation are required before the model can be introduced into review workflows.

Version published to 10.1101/2025.07.21.25331917 on medRxiv
Jul 21, 2025

Using Elicit AI research assistant for data extraction in systematic reviews: a feasibility study across environmental and life sciences

This article has 7 authors:
1. Malgorzata Lagisz
2. Ayumi Mizuno
3. Kyle Morrison
4. Pietro Pollo
5. Lorenzo Ricolfi
6. Yefeng Yang
7. Shinichi Nakagawa
This article has no evaluationsLatest version Aug 11, 2025
Validation of Synthesa AI, a Large Language Model-Based Screening Tool for Systematic Reviews: Results from Nine Studies

This article has 3 authors:
1. Lefteris Teperikidis PharmD
2. Christos Trampoukis
3. Kyiakos Polymenakos
This article has no evaluationsLatest version Jul 17, 2025
Assessment of Bias in Clinical Trials with LLMs Using ROBUST-RCT: A Feasibility Study

This article has 4 authors:
1. Pedro Rodrigues Vidor
2. Yohan Casiraghi
3. Adolfo Moraes de Souza
4. Maria Inês Schmidt
This article has no evaluationsLatest version Aug 13, 2025

Listed in

Abstract

Introduction

Methods

Results

Conclusion

Article activity feed

Related articles

Using Elicit AI research assistant for data extraction in systematic reviews: a feasibility study across environmental and life sciences

Validation of Synthesa AI, a Large Language Model-Based Screening Tool for Systematic Reviews: Results from Nine Studies

Assessment of Bias in Clinical Trials with LLMs Using ROBUST-RCT: A Feasibility Study