A Comparative Study of Linear and Non-Linear Dimensionality Reduction for Opcode-Frequency Malware Classification

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

High-dimensional feature spaces in malware classification pose significant challenges for machine learning performance. To address these challenges, this paper presents a comparative evaluation of four dimensionality-reduction techniques—Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Uniform Manifold Approximation and Projection (UMAP), and Autoencoder-based reduction—applied to opcode-frequency representations of malware. Using a corpus comprising 82,569 samples and 1,796 opcodes, we analyze the effect of each reduction method across multiple target dimensions and two classifier architectures: Extreme Gradient Boosting (XGBoost) and a three-layer Multilayer Perceptron (MLP). Results show that LDA achieves strong separability at lower dimensions, while PCA performs best at higher dimensions where variance preservation is critical. Autoencoder-based reduction provides consistently high accuracy with compact representations, whereas UMAP exhibits limited benefits for tabular opcode data. The findings highlight trade-offs between linear and non-linear reduction strategies and provide guidance for selecting efficient feature compression methods in large-scale malware analysis.

Article activity feed