Let’s Read the Log: Root Cause Analysis of Railway Test Execution Logs with Large Language Models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Software quality assurance is pivotal in safety-critical domains such as railway systems, where failures could have catastrophic consequences. In this context, the train control and management system, which enables communication and control across multiple subsystems (such as doors and information panels) within a modern train, and its software must undergo rigorous validation. Alstom Rail Sweden AB employs a digital twin infrastructure to simulate and validate train control and management system software. While this setup improves system-level testing, root-cause analysis of test failures remains a manual, time-consuming bottleneck.In this study, we explore the potential of large language models to automate root cause analysis by interpreting test execution logs generated during digital twin-based testing. We benchmark nine state-of-the-art large language models: Aion-1.0, DeepSeek R1, DeepSeek V3 0324, Mistral Small 3.1 24B, GPT o3-mini, Gemini 2.5 Pro Experimental, QwB 32B, Gemini 2.0 Flash Experimental, and Amazon Nova 2 Lite using zero-shot chain-of-thought prompting to assess their ability to reason about fault patterns in real-world industrial test execution logs. The logs were sourced from Alstom’s digital twin-based testing environment and captured complex operational behaviour typical of embedded, safety-critical systems.Our results show that long-context large language models tended to achieve higher accuracy than smaller models. We also found that when a log exceeded an LLM’s context window, the model failed to reliably predict the root cause. Gemini 2.5 Pro Experimental achieved the best performance with 66.7\% accuracy and produced strong reasoning in this domain, motivating further research on improving prediction accuracy for log-based root cause analysis.

Article activity feed