Nexus Scissor: Enhance Open-Access Language Model Safety by Connection Pruning

Yan Pang
Peihua Mai
Youjia Yang
Ran Yan

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) are vulnerable to adversarial attacks that bypass safety measures and induce the model to generate harmful content. Securing open-access LLMs against these adversarial attacks is especially challenging due to their offline, unregulated, and often secretive usage. To defend open-access models against a broad spectrum of attacks, it is critical to mitigate their inherent capacity to retrieve malicious responses. Inspired by Spreading Activation theory, this paper proposes the Nexus Scissor, a framework based on connection pruning, to prevent LLMs from recalling harmful content, thereby bolstering its security against a range of jailbreak attacks. Nexus Scissor servers the link between the malicious target and its immediate harmful knowledge, while maintaining the integrity of the remaining knowledge graph. Our framework can also be generalized to closed-access models, such as ChatGPT and Claude. Empirical analysis demonstrates that our proposed Nexus scissor effectively enhances the safety of open-access LLMs against various adversarial attacks, with minimal impact on performance across on common benchmarks.

Version published to 10.21203/rs.3.rs-5315067/v1 on Research Square
Oct 28, 2024

TestLock: A Testability Logic Locking method against Machine Learning-based Oracle-less attacks

This article has 3 authors:
1. Marziye Pandi
2. Mostafa Moghaddas
3. Hakem Beitollahi
This article has no evaluationsLatest version Jun 16, 2025
Unknown Vulnerability Mining for Power Monitoring Systems Aided by Large Language Modeling

This article has 4 authors:
1. Manpo Li
2. Xuerui Yang
3. Xiaochen Yang
4. Shugui Zhang
This article has no evaluationsLatest version Jun 18, 2025
Proposal for a whitelist-based countermeasure against abusing BPF

This article has 2 authors:
1. Yuta Takabayashi
2. Masahiro Mambo
This article has no evaluationsLatest version May 14, 2025

Listed in

Abstract

Article activity feed

Related articles

TestLock: A Testability Logic Locking method against Machine Learning-based Oracle-less attacks

Unknown Vulnerability Mining for Power Monitoring Systems Aided by Large Language Modeling

Proposal for a whitelist-based countermeasure against abusing BPF