Leveraging Large Language Models for Data Extraction in Metaresearch

Benjamin Simsa
Artem Buts
Ivan Ropovik
Matus Adamkovic

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The manual data extraction in metaresearch is often a tedious, time-consuming, and error-prone process. In this paper, we investigate whether the current generation of Large Language Models (LLMs) can be used to extract accurate information from scientific papers. Across the metaresearch literature, these usually range from extracting verbatim information (e.g., the number of participants in a study, effect sizes, or whether the study is preregistered) to making subjective inferences. Using a publicly available dataset (Blanchard et al., 2022) containing a wide range of meta-scientific variables from 34 network psychometrics papers, we tested six LLMs (Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku, GPT 4o, GPT 4o mini, o1-preview). We used the API for extracting the variables from the documents automatically. This automated pipeline allows batch processing of research papers. As such, it represents a more efficient and scaleable way to extract metascientific data, compared to using the default chat interface. Our results point to a high accuracy and high potential of LLMs for metascientific data extraction. The accuracy of the respective models ranged from 75 % to 90 %, and most models were able to convey uncertainty in the more contentious cases. We provide a comparison of accuracy and cost-effectiveness of the individual models and discuss the characteristics of variables that are (non)suitable for automatic coding. Furthermore, we describe some of the common pitfalls and best practices of automatised LLM data extraction. The proposed procedure can decrease the time and costs associated with conducting metaresearch by orders of magnitude.

Version published to 10.31219/osf.io/h9fnb_v3 on OSF Preprints
Aug 19, 2025
Version published to 10.31219/osf.io/h9fnb_v1 on OSF Preprints
Aug 2, 2025

OmniExtract: An automatic data extraction tool based on Large Language Model and Prompt Engineering

This article has 6 authors:
1. Yibo Wang
2. Bixia Tang
3. Sicheng Wu
4. Yuyan Meng
5. Demian Kong
6. Wenming Zhao
This article has no evaluationsLatest version Sep 13, 2025
Issue Detection and Future Proofing Dutch Government Apps Using Language Technologies

This article has 3 authors:
1. Anca-Mihaela Matei
2. Flor Miriam Plaza-del-Arco
3. Natalia Amat-Lefort
This article has no evaluationsLatest version Aug 21, 2025
The Tabular Accessibility Dataset: A Benchmark for LLM-Based Web Accessibility Auditing

This article has 4 authors:
1. Manuel Andruccioli
2. Barry Bassi
3. Giovanni Delnevo
4. Paola Salomoni
This article has no evaluationsLatest version Sep 19, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

OmniExtract: An automatic data extraction tool based on Large Language Model and Prompt Engineering

Issue Detection and Future Proofing Dutch Government Apps Using Language Technologies

The Tabular Accessibility Dataset: A Benchmark for LLM-Based Web Accessibility Auditing