A Comparative Study of Linear and Non-Linear Dimensionality Reduction for Opcode-Frequency Malware Classification
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
High-dimensional feature spaces in malware classification pose significant challenges for machine learning performance. To address these challenges, this paper presents a comparative evaluation of four dimensionality-reduction techniques—Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Uniform Manifold Approximation and Projection (UMAP), and Autoencoder-based reduction—applied to opcode-frequency representations of malware. Using a corpus comprising 82,569 samples and 1,796 opcodes, we analyze the effect of each reduction method across multiple target dimensions and two classifier architectures: Extreme Gradient Boosting (XGBoost) and a three-layer Multilayer Perceptron (MLP). Results show that LDA achieves strong separability at lower dimensions, while PCA performs best at higher dimensions where variance preservation is critical. Autoencoder-based reduction provides consistently high accuracy with compact representations, whereas UMAP exhibits limited benefits for tabular opcode data. The findings highlight trade-offs between linear and non-linear reduction strategies and provide guidance for selecting efficient feature compression methods in large-scale malware analysis.