Explainable and Adversarial Robust Deep Learning for Malware Campaigns Forensic Attribution

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This research paper proposes a machine learning framework for malware attribution in digital forensics based on Random Forest. The model takes malware samples as input and classifies them into attribution labels such as APT1, APT28, APT33, CyberGangX. These features include File_Size_KB, Num_Functions, Num_Imports, and Entropy. The model was trained on a dataset consisting of 5000 samples. The important features have a wide range of values. The feature File_Size_KB varies from 11 KB to 4998 KB. Likewise, entropy varies from 1.5 to 8. The model's accuracy was found to be 19.7% with precision, recall and F1 scores at 20% on average overall attribution. According to the feature importance plot of the Random Forest model, the key features were File_Size_KB, Num_Functions, and Entropy. In terms of methodology, the model was trained using the normal method, whereby 80% of the dataset was used for training and 20% for testing. The features were normalized using StandardScaler and labels were converted from categorical to numerical values using LabelEncoder. A decline in performance was observed during adversarial robustness testing, with F1-score dropping from 20% on clean data to 15% on adversarially perturbed data. The model has difficulty due to class imbalance causing it to misclassify classes mostly represented, like CyberGangX and Unknown. The model produced higher results for some classes (e.g. APT1), it had low precision and recall for many other classes. The authors mention another challenge of robustness; the model can be fooled with a small perturbation. To sum up, while the proposed model can strengthen the existing malware attribution processes, its scalability, performance, and adversarial defenses need improvement. To improve robustness in our future efforts, we should focus on hyperparameter tuning, picking better models and trying the adversarial training. Further, feature engineering and network traffic data inclusion can enhance the model performance by increasing the accuracy and by allowing it to classify more complex malware. The findings of the study indicate that it is very important to develop a good predictive model that can be interpreted. Further, it will help cybersecurity professionals and law enforcement agencies in the field of digital forensics.

Article activity feed