Careful design of Large Language Model pipelines enables expert-level retrieval of evidence-based information from conservation syntheses

Radhika Iyer
Alec Christie
Anil Madhavapeddy
Sam Reynolds
William Sutherland
Sadiq Jaffer

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Wise use of evidence to support efficient conservation action is key to tackling biodiversity loss with limited time and resources. Evidence syntheses provide key recommendations for conservation decision-makers by assessing and summarising evidence, but are not always easy to access, digest, and use. Recent advances in Large Language Models (LLMs) present both opportunities and risks in enabling faster and more intuitive systems to access evidence syntheses and databases. Such systems for natural language search and open-ended evidence-based responses are pipelines comprising many components. Most critical of these components are the LLM used and how evidence is retrieved from the database. We evaluate the performance of ten LLMs across six different database retrieval strategies against human experts in answering synthetic multiple-choice question exams on the effects of conservation interventions using the Conservation Evidence database. We found that LLM performance was comparable with human experts over 45 filtered questions, both in correctly answering them and retrieving the document used to generate them. Across 1867 unfiltered questions, LLM performance demonstrated a level of conservation-specific knowledge, but this varied across topic areas. A hybrid retrieval strategy that combines keywords and vector embeddings performed best by a substantial margin. We also tested against a state-of-the-art previous generation LLM which was outperformed by all ten current models - including smaller, cheaper models. Our findings suggest that, with careful domain-specific design, LLMs could potentially be powerful tools for enabling expert-level use of evidence syntheses and databases. However, general LLMs used ‘out-of-the-box’ are likely to perform poorly and misinform decision-makers. By establishing that LLMs exhibit comparable performance with human synthesis experts on providing restricted responses to queries of evidence syntheses and databases, future work can build on our approach to quantify LLM performance in providing open-ended responses.

Version published to 10.21203/rs.3.rs-5409185/v2 on Research Square
Jan 23, 2025
Version published to 10.21203/rs.3.rs-5409185/v1 on Research Square
Nov 13, 2024

Will generative AI help solve systematic literature reviews? Evidence from a 2-year research programme

This article has 5 authors:
1. Saifuddin Kharawala
2. Divyanshu Jindal
3. Sam Isaacs
4. Pankdeep Chhabria
5. Paul Gandhi
This article has no evaluationsLatest version Feb 4, 2026
Evaluating Large Language Models’ Performance in FDA Regulatory Science

This article has 8 authors:
1. Khulud Bukhari
2. Rosa Rodriguez-Monguio
3. Beatriz Lopez-Bermudez
4. Jason Yamaki
5. Lawrence Brown
6. Richard Beuttler
7. Jasmine Chiat Ling Ong
8. Enrique Seoane-Vazquez
This article has no evaluationsLatest version Feb 6, 2026
Developing a scalable pipeline for data extraction from clinical letters through resource-efficient prompt engineering

This article has 14 authors:
1. Ariel Yuhan Ong
2. Quang Nguyen
3. Ishani Barai
4. Justin Engelmann
5. Fares Antaki
6. Mertcan Sevgi
7. David A Merle
8. Lie Ju
9. Eliot Dow
10. Yukun Zhou
11. Gregory Maniatopoulos
12. Yemisi Takwoingi
13. Alastair K Denniston
14. Pearse A Keane
This article has no evaluationsLatest version Mar 10, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Will generative AI help solve systematic literature reviews? Evidence from a 2-year research programme

Evaluating Large Language Models’ Performance in FDA Regulatory Science

Developing a scalable pipeline for data extraction from clinical letters through resource-efficient prompt engineering