Let’s Read the Log: Root Cause Analysis of Railway Test Execution Logs with Large Language Models

Rahmanu Hermawan
Alessio Bucaioni
Eduard Enoiu
Wasif Afzal
Mehrdad Saadatmand
Nedim Zaimovic
Md Saleh Ibtasham

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Software quality assurance is pivotal in safety-critical domains such as railway systems, where failures could have catastrophic consequences. In this context, the train control and management system, which enables communication and control across multiple subsystems (such as doors and information panels) within a modern train, and its software must undergo rigorous validation. Alstom Rail Sweden AB employs a digital twin infrastructure to simulate and validate train control and management system software. While this setup improves system-level testing, root-cause analysis of test failures remains a manual, time-consuming bottleneck.In this study, we explore the potential of large language models to automate root cause analysis by interpreting test execution logs generated during digital twin-based testing. We benchmark nine state-of-the-art large language models: Aion-1.0, DeepSeek R1, DeepSeek V3 0324, Mistral Small 3.1 24B, GPT o3-mini, Gemini 2.5 Pro Experimental, QwB 32B, Gemini 2.0 Flash Experimental, and Amazon Nova 2 Lite using zero-shot chain-of-thought prompting to assess their ability to reason about fault patterns in real-world industrial test execution logs. The logs were sourced from Alstom’s digital twin-based testing environment and captured complex operational behaviour typical of embedded, safety-critical systems.Our results show that long-context large language models tended to achieve higher accuracy than smaller models. We also found that when a log exceeded an LLM’s context window, the model failed to reliably predict the root cause. Gemini 2.5 Pro Experimental achieved the best performance with 66.7\% accuracy and produced strong reasoning in this domain, motivating further research on improving prediction accuracy for log-based root cause analysis.

Version published to 10.21203/rs.3.rs-8869896/v1 on Research Square
Mar 20, 2026

ReATest: enhancing policy-as-code workflows through automated test case generation from Rego policies

This article has 2 authors:
1. Thanh-Binh Trinh
2. Ngoc-Minh Le
This article has no evaluationsLatest version Apr 10, 2026
Test Case Generation with Hecate: To Infinity and Beyond!

This article has 8 authors:
1. Nunzio Marco Bisceglia
2. Michael Marzella
3. Daniele Lazzari
4. Marcello Minervini
5. Federico Formica
6. Angelo Gargantini
7. Claudio Menghi
8. Andrea Bombarda
This article has no evaluationsLatest version Apr 13, 2026
System-Level Safety and Certification Implications of Linux in Airborne Avionics

This article has 1 author:
1. Haoran Lu
This article has no evaluationsLatest version Apr 20, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

ReATest: enhancing policy-as-code workflows through automated test case generation from Rego policies

Test Case Generation with Hecate: To Infinity and Beyond!

System-Level Safety and Certification Implications of Linux in Airborne Avionics