Skip to main content

Measuring Resiliency through Field Data: Techniques, Tools and Challenges

03 October 2016

New Image

Data collected under real workload conditions provide valuable information about the stresses the systems encounter and their responses to them. Textual/numeric data and log files produced by applications, operating systems, networks, and other monitoring sources play a key role for assessing system reliability and resiliency properties. Practitioners, academia, and industry strongly recognize the inherent value of log data. Data-driven evaluation deepens our understanding of the system dependability behavior, and enables stronger design and better monitoring strategies.The role of log files and data for measuring the dependability of production systems is recognized since many years. Seminal contributions date back to the late 70s, with studies on VAX mainframes [1]. Today, these studies are assuming particular relevance for failure analysis and prediction in industrial systems and networks [2], [3]; logs are the primary source of data available to gain insight on runtime issues in these systems. The understanding that can be gained from logs on today's systems enables improved design and better monitoring and failure prediction strategies for future systems. However, in spite of recent advances, data-driven reliability evaluation still presents challenges due to the scale, complexity and diversity of applications.