Data and Context Matter: Towards Generalizing AI-based Software Vulnerability Detection

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

AI-based solutions demonstrate re- markable results in identifying vulnerabilities in software, but research has consistently found that this performance does not generalize to unseen codebases. In this paper, we specifically investi- gate the impact of model architecture, parameter configuration, and quality of training data on the ability of these systems to generalize.For this purpose, we introduce VulGate, a high quality state of the art dataset that mitigates the shortcomings of prior datasets, by removing mis- labeled and duplicate samples, updating new vul- nerabilities, incorporating additional metadata, in- tegrating hard samples, and including dedicated test sets. We undertake a series of experiments to demonstrate that improved dataset diversity and quality substantially enhances vulnerability detec- tion. We also introduce and benchmark multiple encoder-only and decoder-only models. We find that encoder-based models outperform other mod- els in terms of accuracy and generalization. Our model achieves 6.8% improvement in recall on the benchmark BigVul dataset and outperforms others on unseen projects, demonstrating enhanced gen- eralizability. Our results highlight the role of data quality and model selection in the development of robust vulnerability detection systems. Our find- ings suggest a direction for future systems with high cross-project effectiveness.

Article activity feed