How to Write Effective Prompts for Screening Biomedical Literature Using Large Language Models

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large language models (LLMs) such as GPT-4 have emerged as powerful tools for (semi-)automating the initial screening of abstracts in systematic reviews, offering the potential to significantly reduce the manual burden on research teams. This paper provides a broad overview of prompt engineering principles—ranging from zero-shot to few-shot learning—and highlights how traditional PICO (Population, Intervention, Comparison, Outcome) criteria can be converted into actionable instructions for LLMs. We analyze the trade-offs between “soft” prompts, which maximize recall by accepting articles unless they explicitly fail an inclusion requirement, and “strict” prompts, which demand explicit evidence for every criterion. Using a periodontics case study, we illustrate how prompt design affects recall, precision, and overall screening efficiency, and discuss metrics (accuracy, precision, recall, F1 score) to evaluate performance. We also examine common pitfalls, such as overly lengthy prompts or ambiguous instructions, and underscore the continuing need for expert oversight to mitigate hallucinations and biases inherent in LLM outputs. Finally, we explore emerging trends, including multi-stage screening pipelines and fine-tuning, while noting ethical considerations related to data privacy and transparency. By applying systematic prompt engineering and rigorous evaluation, researchers can optimize LLM-based screening processes, allowing for faster and more comprehensive evidence synthesis across biomedical disciplines.

Article activity feed