Enhancing malware detection reliability in non-executable files using confidence score prediction
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Malware attacks targeting widely used non-executable formats, namely Microsoft Office and PDF files, have become a prevalent threat. These files, which encompass a broad spectrum of data types are classified as complex files. Existing malware detection models currently lack transparency, providing only binary labels without confidence scores. Incorporating confidence score enhances interpretability and detection accuracy. This article proposes a learning-based malware detection approach including two complementary parts. The first part involves the development of binary classifiers, on an enriched dataset of related files, with an extended feature set to achieve high accuracy. The second methodology employs regression models to ascribe a confidence score to each sample. A reliability score is assigned to various antiviruses to accurately label samples with confidence scores. By completion of the detection process, a pair consisting of x and y is provided, where x is the binary classifier output and y is the regressor output, showing the confidence score. Our findings demonstrate an enhancement compared to existing malware detection classifiers, with improvements of approximately 2.44% for PDF files and 2.27% for MS Office. Using confidence score along with binary classification boosts detection accuracy to 99.74% for PDFs and 99.77% for office files.