Multi-Sallm: A Multilingual Security Assessment of Generated Code
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
As Large Language Models (LLMs) become increasingly integrated into software engineers' daily workflows, it is critical to ensure the code they generate is not just functionally correct but also secure. While LLMs can boost developer productivity, prior empirical studies have shown that they often produce insecure code. This issue stems from two key factors. First, the datasets commonly used to evaluate LLMs don't accurately reflect real-world software engineering tasks where security is a concern. Instead, they tend to focus on competitive programming problems or classroom-style exercises, which lack the complexity and security risks of production code integrated into larger systems. Second, current evaluation metrics mostly emphasize functional correctness and overlook security aspects altogether. To address these gaps, we introduce Multi-Sallm, a benchmarking framework designed to systematically evaluate LLMs' ability to generate secure code. The framework includes three main components: (1) a novel dataset of security-focused Python prompts translated into 23 natural languages, (2) configurable assessment techniques for analyzing generated code, and (3) new metrics that assess models from the perspective of secure code generation.