Preventing Correlated Failures through Snap-Audits

18 September 2017

New Image

Correlated failures significantly affect the reliability of cloud services. Current post-failure forensics (e.g., diagnosis tools) need prolonged troubleshooting and recovery time in the face of complex network/software stacks, while pre-deployment analysis (e.g.,, INDaaS) cannot detect root causes introduced by network and software updates in service runtime. This paper argues for snap-audits that aim to prevent correlated failures by enabling administrators to accurately, rapidly and continuously check potential risks in each important snapshot, e.g., system update, during the service runtime. We build CloudCanary system that uses snap-audits to analyze the network/software stack of any snapshot of interest to alter administrators to two types of root causes: 1) bugs potentially resulting in correlated failures and 2) network fragility, i.e., the simultaneous glitches of few network components resulting in correlated failures. CloudCanary offers a suite of efficient and accurate snap-audit implementations by tweaking verification tools (e.g., MinCostSAT and model counter) in a non-trivial way. We further build two useful applications on CloudCanary: service deployment recommendation and network-fragility repair. Our experiments show CloudCanary can determine the top-20 critical network fragility in a 70,656-node system in 3 minutes, which is 500x faster than state-of-the-arts.