Addressing Class Imbalance in Network Intrusion Detection: An Enhanced Hybrid Deep Learning Framework with Advanced Sampling and Attention Mechanisms
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Network intrusion detection systems (NIDS) are critical components of cybersecurity infrastructure, yet they face significant challenges when dealing with highly imbalanced datasets where critical attack types comprise less than 0.001\% of network traffic. This severe class imbalance leads to poor detection rates for rare but potentially devastating attacks such as Heartbleed, SQL injection, and infiltration attempts. This paper proposes an adaptive SMOTE-based sampling strategy combined with feature selection to address extreme class imbalance in the CIC-IDS2017 dataset, which exhibits an imbalance ratio of 191,678:1. Our methodology involves comprehensive data preprocessing, XGBoost-based feature selection reducing dimensionality from 78 to 50 features, and an adaptive SMOTE strategy that strategically oversamples minority classes based on their severity and rarity. The proposed approach achieved a 99.9\% improvement in class imbalance ratio (from 191,678:1 to 204:1), increasing minority class samples by up to 1,916 times for Heartbleed attacks, 958 times for SQL injection, and 583 times for infiltration attempts. Experimental results using a Random Forest classifier demonstrated 99.79\% overall accuracy on 504,473 test samples, with significant improvements in detecting specific minority classes including SSH-Patator (97.5\% accuracy) and Bot attacks (60.1\% accuracy). While some ultra-rare classes with fewer than 10 test samples presented ongoing challenges, the study validates the effectiveness of adaptive sampling strategies for improving minority class representation in highly imbalanced network intrusion datasets. The framework demonstrates practical applicability for cloud computing environments where diverse attack patterns must be detected despite severe data imbalance.