Data and Context Matter: Towards Generalizing AI-based Software Vulnerability Detection
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
AI-based solutions demonstrate re- markable results in identifying vulnerabilities in software, but research has consistently found that this performance does not generalize to unseen codebases. In this paper, we specifically investi- gate the impact of model architecture, parameter configuration, and quality of training data on the ability of these systems to generalize.For this purpose, we introduce VulGate, a high quality state of the art dataset that mitigates the shortcomings of prior datasets, by removing mis- labeled and duplicate samples, updating new vul- nerabilities, incorporating additional metadata, in- tegrating hard samples, and including dedicated test sets. We undertake a series of experiments to demonstrate that improved dataset diversity and quality substantially enhances vulnerability detec- tion. We also introduce and benchmark multiple encoder-only and decoder-only models. We find that encoder-based models outperform other mod- els in terms of accuracy and generalization. Our model achieves 6.8% improvement in recall on the benchmark BigVul dataset and outperforms others on unseen projects, demonstrating enhanced gen- eralizability. Our results highlight the role of data quality and model selection in the development of robust vulnerability detection systems. Our find- ings suggest a direction for future systems with high cross-project effectiveness.