Measurement and Analysis of Cloud Service Reliability
12 May 2016
Cloud computing powers a diverse range of web services today. Cloud-based applications enable us to conduct e-commerce transactions, interact with others via social applications, track our fitness, and even helps us to organize and plan our daily lives. Given the growth and our increasing reliance on cloud services, it is crucial that we have a good understanding of the reliability of these services. Moreover, today, a cloud computing service is composed of a complex array of building blocks and any of the building blocks can fail, which makes it a hard scenario to assess. Prior empirical studies that focused on assessing cloud reliability have three major limitations: (1) They focus on understanding reliability of individual building blocks of the service (e.g., software bugs or hardware failures) instead of considering the cloud application as a whole. (2) Prior studies have mostly focused on a limited set of cloud service providers (e.g., IaaS) and fails to capture a broader view of the reliability of cloud services. (3) Prior methodologies analyzing reliability of multiple cloud services were limited in the amount of information they could obtain about failure incidents. Overcoming the above limitations requires data about failure incidents in a large number of diverse cloud services. Our key insight here is that, today, cloud services are publicly exposing detailed failure incident information on the web. We crawled such publicly available incident reports published by a large number of cloud services using our own web crawling framework. Using this approach, we conduct the first large-scale measurement study of reliability of 160 cloud applications and providers, covering over 12500+ incidents over a period of up to 3 years.