A machine learning model to support the screening for methods guidance articles in MEDLINE: A performance evaluation of ASReview simulation mode

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Advances in clinical research methods are frequently published in biomedical journals, but identifying these articles remains challenging due to their rapid growth and insufficient indexing in biomedical databases. These challenges hinder the curation of methodologically focused resources like the Library of Guidance for Health Scientists (LIGHTS). Traditional screening approaches, such as Boolean search strategies and manual abstract screening, are inefficient and resource-intensive, limiting the feasibility of regularly updating LIGHTS. Machine learning (ML), particularly active learning models, presents a promising solution to improve the efficiency of article screening.

Objectives

This study evaluates the performance of ASReview’s active learning feature in identifying relevant methods guidance articles using pre-labeled data in simulation mode.

Methods

Using pre-labeled dataset composed of 1500 methods guidance articles and 20000 clinical studies, categorized as relevant or irrelevant, we trained and compared multiple simulation models in ASReview using various classifiers and feature extraction models. These included combinations of Support Vector Machine (SVM), Naïve Bayes (NB), Neural Network with Sentence BERT (sBERT), Doc2Vec, and TF-IDF. Model performance was evaluated based on screening burden, recall, Work Saved over Sampling (WSS), and precision. All model combinations used maximum query and dynamic double sampling settings.

Results

At 95-99.5% recall, SVM with TF-IDF required the fewest screened records (6.87-7.66% burden), while SVM with Doc2Vec achieved the best overall performance at 100% recall with only 11.47% screening burden (WSS@100 = 88.5%) in 42 minutes. Models using sBERT for feature extraction performed comparably through 99.5% recall but exhibited severe performance degradation at 100% recall, requiring screening of over 65% of the corpus.

Conclusion

Classical feature extraction methods, TF-IDF and Doc2Vec, paired with SVM outperform deep learning embeddings methods. ASReview in this controlled setting is a feasible tool for screening methodological literature. Future work should include prospective, human-in-the-loop experiments that embed the Doc2Vec-based SVM pipeline in comparison to human screening.

Article activity feed