Toward AI‑Driven IoT Cybersecurity: A Preprocessing Framework for Benchmark Datasets

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The rapid expansion of Internet of Things (IoT) systems, found in environments such as smart homes, poses growing cybersecurity challenges. In response, research has examined the role of artificial intelligence, particularly machine learning, in enhancing IoT security. To support this effort, machine learning models have been developed and evaluated on benchmark datasets. However, preparing datasets for machine learning requires preprocessing techniques that are tailored to the specific characteristics of the data. In this context, exploratory data analysis provides insights into dataset structure and distribution, thereby supporting informed preprocessing decisions prior to modeling. Accordingly, this study introduces a reproducible five‑step preprocessing framework for IoT cybersecurity datasets and demonstrates its application to the NF‑ToN‑IoT V1 dataset. The proposed framework is organized into two phases: an exploratory data analysis phase consisting of (1) dataset overview and identification of categorical and numerical features, (2) analysis of missing and zero values, (3) assessment of categorical feature distributions, and (4) assessment of numerical feature distributions; and a preprocessing phase consisting of (5) proportional stratified random downsampling to produce a reduced dataset that preserves the original class distribution. By establishing a systematic, data-driven framework, this study contributes to the preparation of structured datasets for attack detection in IoT environments, with potential applications in smart homes.

Article activity feed