A Survey on Distributed System Testing Techniques

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Modern digital infrastructure increasingly relies on distributed computing environments to achieve high availability and horizontal scalability. However, the inherent complexity of cloud-native architectures---characterized by network partitions, partial failures, and non-deterministic execution interleavings---renders traditional unit and integration testing methodologies largely ineffective. This paper proposes the Unified Resilience Testing Framework (URTF), a multi-modular system designed to transition distributed system verification from reactive debugging to proactive framework-based reconstruction. URTF integrates cross-service observability with heuristic state space pruning to identify critical execution paths susceptible to "distributed system taxes" like global consistency violations. The research gap addressed by this work lies in the inability of current tools to handle the state space explosion associated with microservice interdependencies without sacrificing detection fidelity. By synthesizing lineage-driven fault injection with real-time telemetry aggregation, the proposed framework enables systematic evaluation of system invariants under environmental stressors such as packet loss, node crashes, and clock drifts. Furthermore, the framework introduces a novel heuristic scoring mechanism that prioritizes execution paths based on their probability of surfacing hidden concurrency bugs. Experimental benchmarks demonstrate that URTF significantly reduces the state space explosion associated with exhaustive model checking while maintaining a 94% detection rate for intricate concurrency bugs across various metrics like AUC and robustness. This work provides a comprehensive roadmap for architecting self-healing systems, offering actionable insights for researchers and practitioners aiming to enhance the reliability of large-scale microservices through uncertainty-driven testing and automated recovery protocols. The impact of URTF extends to improving the resilience of critical digital services, ultimately reducing the downtime associated with unforeseen systemic failures in production cloud-native ecosystems.

Article activity feed