Cast: Automated Resilience Testing for Production Cloud Service Systems

Cast: Automated Resilience Testing for Production Cloud Service Systems
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The distributed nature of microservice architecture introduces significant resilience challenges. Traditional testing methods, limited by extensive manual effort and oversimplified test environments, fail to capture production system complexity. To address these limitations, we present Cast, an automated, end-to-end framework for microservice resilience testing in production. It achieves high test fidelity by replaying production traffic against a comprehensive library of application-level faults to exercise internal error-handling logic. To manage the combinatorial test space, Cast employs a complexity-driven strategy to systematically prune redundant tests and prioritize high-value tests targeting the most critical service execution paths. Cast automates the testing lifecycle through a three-phase pipeline (i.e., startup, fault injection, and recovery) and uses a multi-faceted oracle to automatically verify system resilience against nuanced criteria. Deployed in Huawei Cloud for over eight months, Cast has been adopted by many service teams to proactively address resilience vulnerabilities. Our analysis on four large-scale applications with millions of traces reveals 137 potential vulnerabilities, with 89 confirmed by developers. To further quantify its performance, Cast is evaluated on a benchmark set of 48 reproduced bugs, achieving a high coverage of 90%. The results show that Cast is a practical and effective solution for systematically improving the reliability of industrial microservice systems.


💡 Research Summary

The paper introduces Cast, an end‑to‑end automated framework designed to test the resilience of microservice‑based cloud systems directly in production‑like environments. Traditional resilience testing approaches either focus on coarse‑grained infrastructure faults (e.g., VM termination, network latency) or rely on manually crafted functional tests that cannot scale to the complexity of modern microservice deployments. Cast addresses these gaps by combining four core techniques: (1) Production traffic recording and replay, (2) Application‑level fault injection, (3) Complexity‑driven test selection, and (4) Fully automated execution with a multi‑faceted oracle.

Traffic Recording and Replay
Cast captures live production traffic using a lightweight instrumentation based on Java agents and dynamic AOP. It records distributed traces (spans) that include HTTP/gRPC calls, database queries, message‑broker interactions, and cache accesses. To make replay feasible, Cast automatically identifies state‑dependent variables (e.g., session IDs, timestamps, idempotency keys) through a two‑stage heuristic: (i) intra‑span correlation finds tokens that appear both in request and response payloads, and (ii) inter‑span variability checks whether those tokens change across multiple instances of the same operation. Only variables that vary are marked as dynamic and are refreshed at replay time, eliminating the need for manual annotation or static taint analysis.

Complexity‑Driven Test Selection
Because production traces can number in the millions, Cast aggregates similar user interactions and builds a call‑graph for each logical flow. It then computes a composite complexity metric that considers the number of nodes/edges, depth of dependency chains, and historical error frequencies. Flows with the highest complexity are prioritized, while redundant or low‑impact flows are pruned. This strategy reduces the combinatorial explosion of possible fault‑injection points while ensuring that the most failure‑prone execution paths are exercised.

Application‑Level Fault Library
Instead of generic network errors, Cast’s fault library contains realistic, application‑specific failure modes such as business‑logic exceptions, serialization errors, and third‑party SDK failures. Each fault is declaratively described with target method, injection point (pre‑call, post‑call, response generation), and parameter transformation rules. By coupling fault definitions with the dynamically identified variables, Cast can inject failures that faithfully mimic real production anomalies, thereby exercising internal error‑handling code (retries, circuit breakers, fallbacks).

Three‑Phase Automated Pipeline and Multi‑Faceted Oracle
The testing lifecycle is orchestrated in three phases: (i) Startup – replay the selected traffic against a test cluster; (ii) Fault Injection – trigger the chosen faults at the precise moments identified in the trace; (iii) Recovery – remove the fault and observe the system’s return to a healthy state. Cast validates outcomes using a multi‑faceted oracle that goes beyond simple HTTP status checks. It aggregates service‑level metrics (error counters, circuit‑breaker states), log patterns, and data‑consistency checks (e.g., database invariants) to automatically decide whether resilience criteria are met.

Empirical Evaluation
Cast was deployed in Huawei Cloud for eight months, serving multiple business domains (storage, networking, digital platforms, IoT). Four large‑scale applications, each comprising hundreds of microservices and handling millions of daily requests, were evaluated. Cast discovered 137 potential resilience vulnerabilities, of which 89 were confirmed by developers as real bugs. In a controlled experiment using a benchmark of 48 reproduced bugs, Cast achieved a 90 % detection coverage, outperforming existing chaos‑engineering tools that typically target only infrastructure‑level faults. The complexity‑driven selection reduced the number of executed test cases to roughly 3 % of the total trace volume, yet still uncovered the majority of critical issues.

Discussion and Limitations
The authors acknowledge several challenges: (1) the heuristic for dynamic variable identification may miss encrypted or highly nested tokens, requiring occasional manual tuning; (2) the current fault library is tailored to Java‑based services, so extending to other runtimes (e.g., Go, Node.js) will need additional adapters; (3) storing and processing massive trace datasets incurs non‑trivial storage and compute costs, suggesting the need for efficient compression and sampling strategies.

Future Work
Planned directions include (i) applying machine‑learning techniques to improve dynamic variable detection and to predict high‑risk execution paths; (ii) automating the generation of fault libraries from service contracts or API specifications; (iii) extending Cast to multi‑cloud and hybrid environments; and (iv) tighter integration with CI/CD pipelines to provide continuous resilience feedback.

Conclusion
Cast demonstrates that high‑fidelity, production‑traffic‑based resilience testing can be fully automated and scaled to industrial microservice ecosystems. By unifying trace‑based replay, fine‑grained fault injection, complexity‑aware test selection, and a comprehensive verification oracle, Cast provides a practical solution that uncovers real‑world resilience bugs with high coverage while keeping the testing effort manageable. The successful eight‑month deployment and the substantial bug detection results validate Cast’s effectiveness and its potential to become a cornerstone of reliability engineering practices in cloud‑native organizations.


Comments & Academic Discussion

Loading comments...

Leave a Comment