Evaluating the use of Artificial Intelligence (AI) in Systematic Review Abstract Screening: A Comparative Study of AI-aided Tools
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Manual abstract screening in systematic reviews is a time-consuming and labour-intensive task. With the rise of artificial intelligence (AI), the number of published articles has grown substantially, adding to the workload of review studies that rely on robust and timely evidence synthesis. At the same time, AI-aided screening tools have been developed to accelerate this process. While previous studies have demonstrated the efficiency of such tools, ongoing technological advances necessitate updated evaluations, particularly for tools that are freely accessible. In review types such as umbrella reviews, where both the topic area and study design are central to eligibility decisions, the performance of AI-aided tools remains underexplored. Methods: We conducted a comparative evaluation of six freely available AI-aided abstract screening tools: Rayyan, RobotAnalyst, PICO Portal, Abstrackr, ASReview, and Colandr using a previously completed umbrella review of interdisciplinary urban planning and public health studies. We assessed (1) early recall performance (i.e., identification of included studies within the first 25% of screening), (2) feature availability and depth, and (3) user experience. This Study WIthin a Review (SWAR) was registered in the SWAR repository as SWAR 25. Results: All evaluated tools supported the review process by facilitating screening and offering features such as prioritization and keyword highlighting. However, none identified more than 50% of the previously included studies within the first 25% of screening. Feature analysis and user feedback suggested that Rayyan and PICO Portal provided the most useful functionality for our interdisciplinary umbrella review context, although limitations were noted in duplicate removal and in recognising the importance of study design in eligibility decisions. Conclusions: Although a growing number of AI-assisted abstract screening tools are publicly and freely available, their accuracy, usability, and adaptability to different review designs remain limited. Enhanced support for duplicate detection and integration of study design considerations could improve their utility in umbrella reviews and other complex evidence syntheses. Continued evaluation and user training may support broader adoption across diverse research contexts.