Automated Enhancements for Cross-Modal Safety Alignment in Open-Source Large Language Models

Alexander Rateri
Luciano Thompson
Emilia Hartman
Leonard Collins
James Patterson

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Handling safety across multiple input modalities such as text, images, and audio has become a critical challenge in machine learning, particularly when models are deployed in environments that require high reliability and security. The introduction of cross-modal safety alignment addresses the growing complexity of multi-modal systems through novel modifications that enhance a model’s ability to consistently detect and filter unsafe content. The architectural improvements to LLaMA, including cross-modal embedding regularization, filtering mechanisms, and attention adjustments, significantly improved the model’s performance on safety metrics across various benchmarks. Empirical evaluations demonstrated substantial gains in recall, precision, and adversarial robustness, with marked reductions in false positive rates for unsafe content detection. Furthermore, the modifications allowed the model to withstand adversarial attacks more effectively, increasing its resilience across diverse input types. The results emphasize the importance of refining cross-modal alignment in language models to ensure their safe deployment in real-world, safety-critical applications. The comprehensive evaluations, including ablation studies, highlight the significance of these enhancements for advancing model robustness and cross-modal safety.

Version published to 10.31219/osf.io/n29vx on OSF Preprints
Sep 27, 2024

Towards Robust and Scalable Mixture of Experts Architectures for Large Language and Vision Models

This article has 3 authors:
1. Aamina Yousra
2. Jumanah Fawziya
3. Fawzi Gamal
This article has no evaluationsLatest version Jul 2, 2025
LLM-as-Critic: Contrastive and Adversarial Strategies for Authentic Text Verification

This article has 2 authors:
1. Wei Chen
2. Dexin Chen
This article has no evaluationsLatest version Jun 3, 2025
WAFAN: Bridging Domain Gaps in X-ray Security Inspection via Weighted Adaptive Feature Adversarial Networks

This article has 5 authors:
1. Menghao Li
2. Mingxun Wang
3. Weiwei Zhang
4. Wangpengfei Yu
5. Wenfeng Guo
This article has no evaluationsLatest version Jul 10, 2025

Listed in

Abstract

Article activity feed

Related articles

Towards Robust and Scalable Mixture of Experts Architectures for Large Language and Vision Models

LLM-as-Critic: Contrastive and Adversarial Strategies for Authentic Text Verification

WAFAN: Bridging Domain Gaps in X-ray Security Inspection via Weighted Adaptive Feature Adversarial Networks