Skip to main content

Architecting for Reliability - Detection and Recovery Mechanisms

01 January 2008

New Image

High availability, in the form of continuous service availability, is achieved in telecommunications systems by implementing extensive and effective error detection and recovery mechanisms with high coverage. Escalating detection and recovery mechanisms start with those that can deal with targeted errors with very low latency and impact can escalate to actions with longer recovery times and broader system impact. In this work we extend previous studies, combining Markov models for escalating detection and recovery into a unified model. The results of this model show that a unified view, such as this, produces results that more closely align with our experience. It also non-intuitively shows that detection and recovery coverage should be balanced. Designers can use these models to evaluate alternative schemes for error detection and recovery to achieve a given system/service availability target. Escalation is a technique where the system attempts a local error recovery action, followed by more severe and more wide-ranging actions if the local actions do not succeed. When the fault and error are not covered the system might enter a failed state that requires human intervention to recover. Whenever the system requires human intervention the period of unavailability is increased and is generally a long outage. We combined the escalating detection and recovery Markov models from the previous papers into a comprehensive Markov model to provide insight into how escalating fault recovery mechanisms can be used optimally to achieve high system availability. The recovery model begins with three levels of detection followed by three levels of recovery. If the earlier levels of detection and recovery detect and recover from the error the system returns to a working state more quickly and the later levels of detection and recovery escalation are avoided thus resulting in higher system availability. The results of the combined detection and recovery escalation model exhibits more of the expected behaviour in terms of system availability and recovery than the individual models showed. For example, the results of the combined model show the range of availability that we expect from varying coverage factors. The model shows that both detection and recovery coverage factors must be high in order to achieve high availability. Building a system with a low coverage factor for either detection or recovery will not result in a system with high levels of availability, which is counter to conventional wisdom. Future work includes analyzing additional scenarios, as well as considering the cases when the detection and recovery rates and coverage factors vary independently based upon the type of fault and error that is present.