Accelerating the pace and accuracy of systematic reviews using AI: a validation study
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Artificial intelligence (AI) can greatly enhance efficiency in systematic literature reviews and meta-analyses, but its accuracy in screening titles/abstracts and full-text articles is uncertain.
Objectives
This study evaluated the performance metrics (sensitivity, specificity) of a GPT-4 AI program, Review Copilot, against human decisions (gold standard) in screening titles/abstracts and full-text articles from four published systematic reviews/meta-analyses.
Research Design
Participant data from four already-published systematic literature reviews were used for this validation study. This was a study comparing Review Copilot to human decision-making (gold standard) in screening titles/abstracts and full-text articles for systematic reviews/meta-analyses. The four studies that were used in this study included observational studies and randomized control trials. Review Copilot operates on the OpenAI, GPT-4 server. We examined the performance metrics of Review Copilot to include and exclude titles/abstracts and full-text articles as compared to human decisions in four systematic reviews/meta-analyses. Sensitivity, specificity, and balanced accuracy of title/abstract and full-text screening were compared between Review Copilot and human decisions.
Results
Review Copilot’s sensitivity and specificity for title/abstract screening were 99.2% and 83.6%, respectively, and 97.6% and 47.4% for full-text screening. The average agreement between two runs was 95.4%, with a kappa statistic of 0.83. Review Copilot screened in one-quarter of the time compared to humans.
Conclusions
AI use in systematic reviews and meta-analyses is inevitable. Health researchers must understand these technologies’ strengths and limitations to ethically leverage them for research efficiency and evidence-based decision-making in health.