SRAudit: Auditing the Structural Reliability of the Clouds to Ward Off Correlated Failures
07 September 2016
Cloud computing systems heavily rely on redundancy techniques for ensuring reliability. Nevertheless, as cloud systems become ever more structurally complex, infrastructure components, e.g., replica servers, may unwittingly share deep dependencies, such as aggregation switches. These unexpected common dependencies may result in correlated failures that undermine redundancy efforts. Although existing diagnosis tools offer post-failure forensics, they typically lead to prolonged failure recovery time. This paper presents SRAudit, a practical framework that aims to prevent correlated failures before cloud outages occur, by allowing administrators to proactively audit the structural reliability of redundant systems of interest. SRAudit is capable of simultaneously offering expressive, accurate and efficient structural reliability auditing by introducing three novel components: 1) a declarative domain-specific language, RAL, enabling administrators to easily write auditing programs to express diverse auditing tasks; 2) a high-performance auditing engine that parses RAL program, and efficiently generates accurate auditing results by leveraging various verification tools (e.g., MinCostSAT and model counter); and 3) a repair engine that can automatically generate the reliability improvement plans based on easily written specifications. Our experimental result shows SRAudit can determine the top-20 critical correlated failure root causes in a 70,656- node redundant system within 5 minutes, which is 300x more efficient in auditing time than previous systems.