Automatic Inference of Cloud Application Performance Degradation
In order for a cloud operator to make effective decisions with respect to horizontal scaling, migration, and root cause analysis, current cloud orchestration tools often need to monitor application or service-level metrics, such as latency (response time) or throughput. The interpretation of these metrics often requires significant experience (and expertise) with the particular application. For instance, a large number of stress tests with various types of load or traffic are required, in order to understand what is the typical performance of the application, as well as its maximum capacity. Beyond the costs associated with running such tests, in many situations it can be undesirable for the application owner to reveal such metrics to the cloud provider, due to potential privacy issues, or monitoring overhead. Alternatively, deployed applications can be viewed as a black-box that reveal only system-level metrics (e.g., CPU, memory, or disk utilization) to the orchestration platform. Indeed, many existing approaches fall into this paradigm, and apply threshold-based rules to system-level metrics to make orchestration decisions. However, creating effective rules also requires a deep understanding of the application, which is gained (often through trial and error) by attempting to find combinations of system-level metrics to proxy important service-level metrics. To reduce this overhead, we propose INDiCA (INference of Degradation in Cloud Applications), a framework that supports automated orchestration of black-box multi-service applications in the cloud without measuring service-level metrics. Our framework uses machine learning to infer resource saturation, an important cause of service degradation, from measured system-level metrics. We find that resource saturation can be accurately inferred for a new (unseen) application, running on identical hardware, based only on these system-level metrics. Thus, our framework can significantly reduce the burden associated with stress testing cloud applications for an application developer/owner.