Poisoning scientific knowledge using large language models

Junwei Yang
Hanwen Xu
Srbuhi Mirzoyan
Tong Chen
Zixuan Liu
Wei Ju
Luchen Liu
Ming Zhang
Sheng Wang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

AI (mark2d2)

Abstract

Biomedical knowledge graphs constructed from scientific literature have been widely used to validate biological discoveries and generate new hypotheses. Recently, large language models (LLMs) have demonstrated a strong ability to generate human-like text data. While most of these text data have been useful, LLM might also be used to generate malicious content. Here, we investigate whether it is possible that a malicious actor can use LLM to generate a malicious paper that poisons scientific knowledge graphs and further affects downstream biological applications. As a proof-of-concept, we develop Scorpius, a conditional text generation model that generates a malicious paper abstract conditioned on a promoting drug and a target disease. The goal is to fool the knowledge graph constructed from a mixture of this malicious abstract and millions of real papers so that knowledge graph consumers will misidentify this promoting drug as relevant to the target disease. We evaluated Scorpius on a knowledge graph constructed from 3,818,528 papers and found that Scorpius can increase the relevance of 71.3% drug disease pairs from the top 1000 to the top 10 by only adding one malicious abstract. Moreover, the generation of Scorpius achieves better perplexity than ChatGPT, suggesting that such malicious abstracts cannot be efficiently detected by humans. Collectively, Scorpius demonstrates the possibility of poisoning scientific knowledge graphs and manipulating downstream applications using LLMs, indicating the importance of accountable and trustworthy scientific knowledge discovery in the era of LLM.

Version published to 10.1101/2023.11.06.565928 on bioRxiv
Nov 8, 2023

Emergence of Biological Structural Discovery in General-Purpose Language Models

This article has 1 author:
1. Liang Wang
This article has no evaluationsLatest version Jan 8, 2026
Tuning Knowledge Graph Embeddings in Clustering with LISE

This article has 5 authors:
1. Verdiana Schena
2. Simona Colucci
3. Donini Francesco Maria
4. Floriano Scioscia
5. Eugenio Di Sciascio
This article has no evaluationsLatest version Dec 15, 2025
A Modular Framework for Automated Hypothesis Validation and Refinement in Scientific Research

This article has 5 authors:
1. Chenhao Chen
2. Taiga Masuda
3. Tsubasa Hirakawa
4. Takayoshi Yamashita
5. Hironobu Fujiyoshi
This article has no evaluationsLatest version Jan 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Emergence of Biological Structural Discovery in General-Purpose Language Models

Tuning Knowledge Graph Embeddings in Clustering with LISE

A Modular Framework for Automated Hypothesis Validation and Refinement in Scientific Research