Accelerating the pace and accuracy of systematic reviews using AI: a validation study

Jiada Zhan
Kara Suvada
Muwu Xu
Wenya Tian
Kelly C. Cara
Taylor C. Wallace
Mohammed K. Ali

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Artificial intelligence (AI) can greatly enhance efficiency in systematic literature reviews and meta-analyses, but its accuracy in screening titles/abstracts and full-text articles is uncertain.

Objectives

This study evaluated the performance metrics (sensitivity, specificity) of a GPT-4 AI program, Review Copilot, against human decisions (gold standard) in screening titles/abstracts and full-text articles from four published systematic reviews/meta-analyses.

Research Design

Participant data from four already-published systematic literature reviews were used for this validation study. This was a study comparing Review Copilot to human decision-making (gold standard) in screening titles/abstracts and full-text articles for systematic reviews/meta-analyses. The four studies that were used in this study included observational studies and randomized control trials. Review Copilot operates on the OpenAI, GPT-4 server. We examined the performance metrics of Review Copilot to include and exclude titles/abstracts and full-text articles as compared to human decisions in four systematic reviews/meta-analyses. Sensitivity, specificity, and balanced accuracy of title/abstract and full-text screening were compared between Review Copilot and human decisions.

Results

Review Copilot’s sensitivity and specificity for title/abstract screening were 99.2% and 83.6%, respectively, and 97.6% and 47.4% for full-text screening. The average agreement between two runs was 95.4%, with a kappa statistic of 0.83. Review Copilot screened in one-quarter of the time compared to humans.

Conclusions

AI use in systematic reviews and meta-analyses is inevitable. Health researchers must understand these technologies’ strengths and limitations to ethically leverage them for research efficiency and evidence-based decision-making in health.

Version published to 10.1101/2024.12.10.24318803 on medRxiv
Dec 11, 2024

Risk prediction for lung cancer screening: a systematic review and meta-regression

This article has 9 authors:
1. Ramin Rezaeianzadeh
2. Crystal leung
3. Soo Jeong Kim
4. Kayly Choy
5. Kate Johnson
6. Miranda Kirby
7. Stephen Lam
8. Benjamin Smith
9. Mohsen Sadatsafavi
This article has no evaluationsLatest version Sep 12, 2025
Textbook-Level Medical Knowledge in Large Language Models: A Comparative Evaluation Using the Japanese National Medical Examination

This article has 8 authors:
1. Mingxin Liu
2. Tsuyoshi Okuhara
3. Zhehao Dai
4. Minghong Zhao
5. Wenqiang Yin
6. Hiroko Okada
7. Emi Furukawa
8. Takahiro Kiuchi
This article has no evaluationsLatest version Sep 12, 2025
Clinical evaluation of a natural language processing system for assisting structured diagnosis recording at the point of care: MiADE (Medical Information AI Data Extractor)

This article has 13 authors:
1. Mairead McErlean
2. Jack Ross
3. Jonathan Kossoff
4. Maisarah Amran
5. James Brandreth
6. Leilei Zhu
7. Gary Philippo
8. Wai Keong Wong
9. Folkert W. Asselbergs
10. Richard J.B. Dobson
11. Yogini H Jani
12. Enrico Costanza
13. Anoop D. Shah
This article has no evaluationsLatest version Sep 12, 2025

Discuss this preprint

Listed in

Abstract

Background

Objectives

Research Design

Results

Conclusions

Article activity feed

Related articles

Risk prediction for lung cancer screening: a systematic review and meta-regression

Textbook-Level Medical Knowledge in Large Language Models: A Comparative Evaluation Using the Japanese National Medical Examination

Clinical evaluation of a natural language processing system for assisting structured diagnosis recording at the point of care: MiADE (Medical Information AI Data Extractor)