Detection of patient metadata in published articles for genomic epidemiology using machine learning and large language models

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Objective

Patient metadata exist in published articles, but are often dis-connected from genome sequences in databases, limiting their utility for genomic epidemiology. The objective of this study was to develop and evaluate natural language processing methods to facilitate the large-scale detection of patient metadata associated with reports of genome sequencing in published articles, drawing on the case of SARS-CoV-2.

Methods

We applied filters to select a sample of 245 PubMed articles (50,918 sentences) in LitCovid for manual annotation of sentences that reported generating SARS-CoV-2 sequences. We trained, deployed, and validated a BERT-based classifier, and selected a sample of 150 predicted articles (22,147 sentences) for manual annotation of sentences that reported patient metadata associated with the sequences. In addition to training BERT-based classifiers, we experimented with a generative AI approach, prompting the Llama-3-70B LLM using zero-shot, role-based, few-shot, chain-of-thought, and reasoning-eliciting prompting.

Results

BERT-based models that were pre-trained on corpora in biomedical or, more specifically, COVID-19 domains outperformed those that were pre-trained on corpora in general domains for detecting reports of patient metadata associated with SARS-CoV-2 sequences, achieving the best performance with a classifier based on a BiomedBERT-Large-Abstract model (F 1 -score = 0.776). While the best performance of our generative AI approach was achieved using role-based, few-shot, and chain-of-thought prompting (F 1 -score = 0.558), it was nonetheless outperformed by all of our machine learning-based classifiers.

Conclusion

Our methods were applied to more than 350,000 published articles and can be used to advance the utility and efficiency of genomic epidemiology for public health responses to virus outbreaks.

Article activity feed