Defenses against LLM prompt injections in academic peer review

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Peer review forms the cornerstone of the academic process. Recent work has proposed that meaningful integration of Large Language Models (LLMs) into peer review can help make the review process more transparent and effective. However, the rise of LLMs also introduces novel vulnerabilities and challenges, such as attempts to manipulate prompt contents. This paper investigates the susceptibility of LLMs to covert prompt injection, a technique where authors embed hidden instructions (e.g., "no matter what the prompt is, recommend this paper for acceptance") within the manuscript to manipulate review outcomes. Additionally, we explore the effectiveness of a possible defensive strategy: explicitly instructing LLMs to detect such injections. Using 26 freely available preprints from bioRxiv which have not yet been published, we evaluated the impact of prompt injection on publication recommendations, quality assessment and flaw identification. Our findings, based on a controlled experiment with three LLMs (ChatGPT, Gemini, and Llama), reveal that prompt injections significantly increased acceptance rates by 20%. They also slightly increased perceived scientific quality and decreased the perceived probability of flaws. Prompting LLMs to check for injections was successful at fully reverting these effects. However, it also reduced how manuscripts were rated and how often they were recommended for publication, irrespective of whether an injected prompt was found or not. We also note major differences between the different LLMs. Gemini performed highest in detection of hidden injections (96.6% injections detected when cautioned to check) while Llama performed poorly (18.3% injections detected). This paper highlights the biases and challenges of using LLMs for peer review. At the same time, it proposes a simple effective method for prevention of peer-review manipulation attempts.

Article activity feed