BadInterpreter: Backdoor Attack on LLM-based Interpretable Recommendation

Bing Wang
Jing Fang
Shengsheng Qian

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large Language Models (LLMs) has promoted miscellaneous models and downstream applications, driving the progress of LLM agents by enhancing their ability to comprehend and generate interpretable reasoning. Recently, the security of LLM agents has become an increasingly popular research topic, where backdoor attacks show potential devastation by injecting a covert backdoor to manipulate the output. Our findings show that LLM agents fine-tuned for recommendation tasks are particularly vulnerable to the embedding of imperceptible backdoors, even when recommendation explanations are required. We introduce BadInterpreter, a simple yet effective backdoor attack for LLM-based interpretable recommendation systems, enabling attackers to manipulate product recommendations and explanations without altering ground-truth labels. In interpretable recommendation, LLM agents are asked to provide explanations for product recommendations to meet user needs. We propose a novel LLM-based pipeline to construct poisoned fine-tuning data, where the agent is expected to recommend the target product with rational recommendation explanations. Attacked by our BadInterpreter, LLM agents prioritize recommending the target products whose information contains attacker-designed triggers in a dynamic interactive environment, along with convincing explanations. Our attack consistently achieves robust attack success rates exceeding 94% on two benchmark e-shopping datasets with four distinct LLMs. While backdoor attacks represent a well-explored threat in natural language processing models, their application and impact within the specific context of LLM-based interpretable recommendation systems remain largely uncharted territory. To our knowledge, this study pioneers the investigation of such vulnerabilities in this critical domain. Our work reveals that constructing LLM-based recommendation systems on untrusted LLMs poses a severe threat.

Version published to 10.21203/rs.3.rs-8024735/v1 on Research Square
Dec 12, 2025

From Generation to Detection: Leveraging Empirically Derived Linguistic Hints for LLM-Based Fake News Detection

This article has 1 author:
1. Piyush Ghasiya
This article has no evaluationsLatest version Jan 28, 2026
Pruning and Malicious Injection: A Retraining Free Backdoor Attack on Transformer Models

This article has 6 authors:
1. Taibiao Zhao
2. Mingxuan Sun
3. Hao Wang
4. Xiaobing Chen
5. Xiangwei Zhou
6. Xugui Zhou
This article has no evaluationsLatest version Jan 8, 2026
Alignment Via Interpretability: Layerwise Counterfactuals To Detect Maladaptive Llm Behaviors

This article has 2 authors:
1. Nnaemeka Kingsley Ugwumba
2. Juan Sebastian Murillejo Contreras
This article has no evaluationsLatest version Jan 29, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

From Generation to Detection: Leveraging Empirically Derived Linguistic Hints for LLM-Based Fake News Detection

Pruning and Malicious Injection: A Retraining Free Backdoor Attack on Transformer Models

Alignment Via Interpretability: Layerwise Counterfactuals To Detect Maladaptive Llm Behaviors