Domain Knowledge-Infused Synthetic Data Generation for LLM-Based ICS Intrusion Detection: Mitigating Data Scarcity and Imbalance

Seokhyun Ann
Hongeun Kim
Suhyeon Park
Seong-je Cho
Joonmo Kim
Harksu Cho

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Industrial control systems (ICSs) are increasingly interconnected with enterprise IT networks and remote services, which expands the attack surface of operational technology (OT) environments. However, collecting sufficient attack traffic from real OT/ICS networks is difficult, and the resulting scarcity and class imbalance of malicious data hinder the development of intrusion detection systems (IDSs). At the same time, large language models (LLMs) have shown promise for security analytics when system events are expressed in natural language. This study investigates an LLM-based network IDS for a smart-factory OT/ICS environment and proposes a synthetic data generation method that injects domain knowledge into attack samples. Using the ICSSIM simulator, we construct a bottle-filling smart factory, implement six MITRE ATT&CK for ICS-based attack scenarios, capture Modbus/TCP traffic, and convert each request–response pair into a natural-language description of network behavior. We then generate synthetic attack descriptions with GPT by combining (1) statistical properties of normal traffic, (2) MITRE ATT&CK for ICS tactics and techniques, and (3) expert knowledge obtained from executing the attacks in ICSSIM. The Llama 3.1 8B Instruct model is fine-tuned with QLoRA on a seven-class classification task (Benign vs. six attack types) and evaluated on a test set composed exclusively of real ICSSIM traffic. Experimental results show that synthetic data generated only from statistical information, or from statistics plus MITRE descriptions, yield limited performance, whereas incorporating environment-specific expert knowledge is associated with substantially higher performance on our ICSSIM-based expanded test set (100% accuracy in binary detection and 96.49% accuracy with a macro F1-score of 0.958 in attack-type classification). Overall, these findings suggest that domain-knowledge-infused synthetic data and natural-language traffic representations can support LLM-based IDSs in OT/ICS smart-factory settings; however, further validation on larger and more diverse datasets is needed to confirm generality.

Version published to 10.3390/electronics15020371
Jan 14, 2026
Version published to 10.20944/preprints202512.2199.v1
Dec 24, 2025

Enhancing Security in Distributed Event-Based Systems Using AI/ML Models

This article has 1 author:
1. Apeksha Bhuekar
This article has no evaluationsLatest version Dec 24, 2025
Unified Anomaly Detection in IoT and Cyber-Physical Networks Using Evo-Transformer-LSTM: Validation on Four CIC Benchmarks

This article has 5 authors:
1. Pardis Sadatian Moghaddam
2. Mahyar Mahmoudi
3. Nuria Serrano
4. Francisco Hernando-Gallego
5. Diego Martín
This article has no evaluationsLatest version Dec 9, 2025
SGA-FL NIDS: A Similarity-Gated Asynchronous Federated Learning for Network Intrusion Detection

This article has 6 authors:
1. XiaoFang Dong
2. Kai Yang
3. JingChao Liu
4. Wen Chen
5. HuiMin Hou
6. XiaoMing Liu
This article has no evaluationsLatest version Jan 30, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Enhancing Security in Distributed Event-Based Systems Using AI/ML Models

Unified Anomaly Detection in IoT and Cyber-Physical Networks Using Evo-Transformer-LSTM: Validation on Four CIC Benchmarks

SGA-FL NIDS: A Similarity-Gated Asynchronous Federated Learning for Network Intrusion Detection