AI-Assisted Data Extraction with a Large Language Model: A Study Within Reviews

Gerald Gartlehner
Shannon Kugley
Karen Crotty
Meera Viswanathan
Andreea Dobrescu
Barbara Nussbaumer-Streit
Graham Booth
Jonathan R. Treadwell
Jung Min Han
Jesse Wagner
Eric A. Apaydin
Erin L. Coppola
Margaret Maglione
Rainer Hilscher
Robert Chew
Meagan Pilar
Bryan Swanton
Leila C. Kahwati

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Data extraction is a critical but error-prone and labor-intensive task in evidence synthesis. Unlike other artificial intelligence (AI) technologies, large language models (LLMs) do not require labeled training data for data extraction.

Objective

To compare an AI-assisted to a traditional y data extraction process.

Design

Study within reviews (SWAR) utilizing a prospective, parallel group comparison with blinded data adjudicators.

Setting

Workflow validation within six ongoing systematic reviews of interventions under real-world conditions.

Intervention

Initial data extraction using an LLM (Claude versions 2.1, 3.0 Opus, and 3.5 Sonnet) verified by a human reviewer.

Measurements

Concordance, time on task, accuracy, recall, precision, and error analysis.

Results

The six systematic reviews of the SWAR contributed 9,341 data elements, extracted from 63 studies. Concordance between the two methods was 77.2%. The accuracy of the AI-assisted approach compared with enhanced human data extraction was 91.0%, with a recall of 89.4% and a precision of 98.9%. The AI-assisted approach had fewer incorrect extractions (9.0% vs. 11.0%) and similar risks of major errors (2.5% vs. 2.7%) compared to the traditional human-only method, with a median time saving of 41 minutes per study. Missed data items were the most frequent errors in both approaches.

Limitations

Assessing the concordance of data extractions and classifying errors required subjective judgment. Tracking time on task consistently was challenging.

Conclusion

The use of an LLM can improve accuracy of data extraction and save time in evidence synthesis. Results reinforce previous findings that human-only data extraction is prone to errors.

Primary Funding Source

US Agency for Healthcare Research and Quality, RTI International

Registration

SWAR28 Gerald Gartlehner (2023 FEB 11 2102).pdf

Version published to 10.1101/2025.03.20.25324350v1 on medRxiv
Mar 21, 2025