Prompt-driven large language model for evidence extraction in implementation science: exploring semantic similarity between manual and automated coding across studies
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
BackgroundDespite rapid scientific advancements that generate clinical interventions to optimize disease prevention, diagnostics, and treatments, significant gaps remain between evidence and clinical adoption. Implementation science evaluates targeted strategies systematically designed to promote the adoption, integration, and sustainment of evidence-based interventions, programs, and policies in real-world settings. However, identifying and coding factors associated with implementation success reported in the literature is time-consuming and labor-intensive. Large language models (LLMs) offer the potential to automate and scale this process. We aimed to evaluate the potential of LLMs to accurately extract determinants associated with implementation in oncology aligned with a well-recognized implementation framework. MethodsThis validation study compared the performance of a Llama-based LLM with manual extraction of barriers and facilitators from twelve published studies, including oncology-focused implementation studies (containing structured implementation data) and published clinical intervention studies (containing unstructured implementation data). An ontology-based schema guided data structuring and three prompting strategies – zero-shot (task instructions only), few-shot (task instructions plus illustrative examples), and chain-of-thought (CoT; instructs the model to generate step-by-step reasoning) – were evaluated to assess extraction performance. Semantic similarity between LLM-generated and human-coded outputs was assessed using both human rating and automated rating with a sentence transformer model.ResultsA total of 144 extraction outputs were generated. For implementation science studies, zero-shot prompting achieved the highest average semantic similarity for both barriers and facilitators, outperforming few-shot and CoT prompting. In clinical intervention studies, few-shot prompting led with average similarity scores of 0.78 (barriers) and 0.76 (facilitators) via sentence transformer, and 0.68/0.67 via human rating, surpassing zero-shot and CoT. Across all prompting approaches, human ratings were slightly lower than automated ratings, reflecting a more conservative scoring pattern rather than differences in performance quality. Overall, the LLM reduced annotation time by approximately 75–95%, decreasing per article coding from several hours to under 20 minutes.ConclusionLLMs can support extraction of implementation science determinants, with optimal prompting strategies varying by study type. Zero-shot and few-shot prompting showed modest but inconsistent performance differences, suggesting context-dependent utility. In contrast, CoT prompting lacked consistent benefits, highlighting the need for broader model comparisons, improved prompt design, and larger gold-standard datasets.