Nexus Scissor: Enhance Open-Access Language Model Safety by Connection Pruning
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models (LLMs) are vulnerable to adversarial attacks that bypass safety measures and induce the model to generate harmful content. Securing open-access LLMs against these adversarial attacks is especially challenging due to their offline, unregulated, and often secretive usage. To defend open-access models against a broad spectrum of attacks, it is critical to mitigate their inherent capacity to retrieve malicious responses. Inspired by Spreading Activation theory, this paper proposes the Nexus Scissor, a framework based on connection pruning, to prevent LLMs from recalling harmful content, thereby bolstering its security against a range of jailbreak attacks. Nexus Scissor servers the link between the malicious target and its immediate harmful knowledge, while maintaining the integrity of the remaining knowledge graph. Our framework can also be generalized to closed-access models, such as ChatGPT and Claude. Empirical analysis demonstrates that our proposed Nexus scissor effectively enhances the safety of open-access LLMs against various adversarial attacks, with minimal impact on performance across on common benchmarks.