Domain-Knowledge-Infused Synthetic Data Generation for LLM-based ICS Intrusion Detection: Mitigating Data Scarcity and Imbalance
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Industrial control systems (ICSs) are increasingly interconnected with enterprise IT networks and remote services, which expands the attack surface of operational technology (OT) environments. However, collecting sufficient attack traffic from real OT/ICS networks is difficult, and the resulting scarcity and class imbalance of malicious data hinder the development of intrusion detection systems (IDSs). At the same time, large language models (LLMs) have shown promise for security analytics when system events are expressed in natural language. This study investigates an LLM-based network IDS for a smart-factory OT/ICS environment and proposes a synthetic data generation method that injects domain knowledge into attack samples. Using the ICSSIM simulator, we construct a bottle-filling smart factory, implement six MITRE ATT&CK for ICS based attack scenarios, capture Modbus/TCP traffic, and convert each request–response pair into a natural-language description of network behavior. We then generate synthetic attack descriptions with GPT by combining (1) statistical properties of normal traffic, (2) MITRE ATT&CK for ICS tactics and techniques, and (3) expert knowledge obtained from executing the attacks in ICSSIM. The Llama 3.1 8B Instruct model is fine-tuned with QLoRA on a seven-class classification task (Benign vs. six attack types) and evaluated on a test set composed exclusively of real ICSSIM traffic. Experimental results show that synthetic data generated only from statistical information or from statistics plus MITRE descriptions yield limited performance, whereas incorporating environment-specific expert knowledge enables the model to achieve 99.61% accuracy in binary detection and 95.63% accuracy with a macro F1-score of 0.952 in attack-type classification. These results demonstrate that domain-knowledge–infused synthetic data and natural-language traffic representations can make LLM-based IDSs a practical option for deployment in OT/ICS smart-factory environments.