Domain-Specific Reliability Modeling

01 July 2001

New Image

We present a domain-specific reliability modeling framework that is ties to fault classification and fault-tolerance mechanisms and that can be used to model classes of systems. The basic structure of the models is an escalating fault recovery strategy modeled as a Markov chain. Escalating fault recovery strategy is an essential element of highly reliable systems which are further characterized by fast local and specific error detection, error partitioning to determine recovery action and fast recovery with minimal system impact. Empirical and analytic models for predicting outage duration distributions for systems with escalating recovery mechanisms will be presented. The models have been validated with historical data from several switching systems. The models are used to elucidate the structural properties of these recovery mechanisms that cause a system to appear either chaotic or well-behaved.