Integrating Large Language Models with Cloud-Native Observability for Automated Root Cause Analysis and Remediation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Cloud-native systems based on microservices, containers, and server- less architectures present unprecedented challenges for observabil- ity and incident management. Traditional rule-based monitoring and manual root cause analysis are increasingly inadequate for han- dling the complexity and scale of modern distributed systems. This paper presents a novel framework that leverages large language models (LLMs) to enhance cloud-native observability, enabling automated root cause analysis and self-healing capabilities. Our system integrates OpenTelemetry-based telemetry collection with a domain-adapted LLM capable of performing multimodal analysis over metrics, logs, and traces. Through fine-tuning on operational data and chain-of-thought reasoning, the LLM generates explain- able root cause hypotheses and actionable remediation plans. Exper- imental evaluation on public microservice datasets demonstrates that our approach reduces mean time to resolution (MTTR) by 84.2% compared to rule-based methods, achieving 95% F1-score in anomaly detection while maintaining low computational overhead. The system successfully automated 91% of common incidents with- out human intervention, significantly improving service reliability and reducing operational burden.

Article activity feed