The availability boost provided by geographically redundant systems can finally be estimated and optimized using breakthrough mathematical modeling. Businesses invest in geographically redundant systems, also known as “georedundancy,” to enable business continuity following a force majeure or major disaster event, whether natural or man-made. Professionals have often suspected that investments in georedundancy could mitigate less dramatic failure events, but no quantitative modeling or predictions were available to characterize and maximize the benefits.
Further reading For a book-length discussion of this topic, see: Beyond Redundancy: How Geographic Redundancy Can Improve Service Availability and Reliability For Computer-Based Systems, by Eric Bauer, Randee Adams and Daniel Eustace, Wiley-IEEE Press, October 2011.
Now, mathematical modeling developed by Alcatel-Lucent shows how businesses have an opportunity to use their systems’ existing georedundancy to enhance service availability. This analysis has also identified specific methods that can be used for optimization — such as client-initiated recovery, faster time out and retry parameters, and rapid context restoration. As the name implies, client-initiated recovery enables client applications to automatically detect unavailability of a server (either due to a disaster event or an extended service outage) and automatically recover service to an alternate system, often the georedundant system used for disaster recovery. Client-initiated recovery is particularly important for promptly addressing “sleeping” failures that go undetected by the failed system itself for minutes or longer, and subsequently, the failed system does not promptly initiate automatic service recovery action, resulting in service downtime for users.
Expectations for availability
The primary purpose for georedundant systems is to enable business continuity following a fire, flood or other disaster that impacts an entire data center or multiple systems in a single office. In these situations, safety of people in the impacted facility is the first priority, followed by recovery of critical services. Georedundancy, combined with methodical disaster recovery plans and mostly manual procedures, enable service to be recovered in days or hours after a disaster event. High-availability systems and mechanisms, on the other hand, are concerned with automatically detecting and recovering service following a single failure event, usually a hardware or software failure. In these scenarios, high-availability systems must recover within seconds — and often strive to have no more than 315 seconds of service downtime per system per year (99.999% service availability). Many engineers are familiar with the following simple equation for estimating service availability across a pair of elements, like headlights on an automobile:
This equation implies that a pair of 99.9% elements (each with almost 9 hours of annualized unplanned service downtime) will have a service availability of 99.9999% (six 9s), which equals 32 seconds of service downtime per year. While this formula may be reasonable for simple components like headlights, it makes several assumptions that do not hold for information and communications systems. For instance, the formula assumes instantaneous and perfect failure detection (that is, no silent failures), as well as instantaneous and perfect service recovery to the redundant system. However, for information and communications services, failure detection and service recovery are inherently complex. Therefore, they are neither instantaneous nor perfect, and some downtime is inevitable. In other words, having twice as much hardware in two locations doesn’t provide an immediate exponential boost in service availability. Georedundancy can only offer service availability improvement if failure detection and failover to the redundant system are rapid and reliable. This principle can be easily understood by imagining two laptop computers operating side by side. If one crashes, then the other doesn’t instantly provide backup service. First the crash must be detected then additional time is required to recreate the first computer’s pre-failure context and application state on the second computer.
Consequently, georedundant systems need to speed up failure detection and recovery at the element level to optimize availability. Alcatel-Lucent used continuous time Markov availability models to quantitatively compare the service availability benefit of today’s three most widely used external recovery schemes to a standalone system configuration. As shown in Figure 1, client-initiated recovery clearly makes the greatest contribution to availability. Manual recovery is adequate for disaster recovery, and system-driven recovery provides some modest reduction in downtime through automation. But client-initiated recovery is the only scheme that can address uncovered failures, which make up most of the product-attributed service downtime accrued by individual high-availability system instances. These failures are not automatically detected — and therefore are not automatically recovered — by the failed system itself. However, they can be detected by a server’s clients, which enable the clients to automatically initiate service recovery to a redundant (external) system as soon as the failure is detected.
Most widely used georedundant recovery schemes
Network elements have implemented georedundancy in many ways, but their recovery schemes can be divided into the following general categories: Manual georedundant recovery – This method is suited to disaster recovery, where service outages are typically lengthy. Notified by alarms, complaints or other means, a human detects that a data center is damaged or a system element has failed and a decision is made to manually initiate a georedundant recovery by switching service to a redundant element — which could take hours. System-driven recovery – Automatic georedundant recovery mechanisms can usually identify system failures and recover traffic onto the georedundant system faster than manual recovery. With this automated scheme, server systems monitor each other, using a heartbeat between the active and standby elements. If one system detects a profound failure in the other (for example, after a few lost heartbeats), the redundant system takes over and traffic is automatically rerouted — which takes a few minutes. Unfortunately, system-driven recovery mechanisms are often ineffective at detecting uncovered failures, because the system instance experiencing the uncovered failure is, by definition, unaware of the failure, and the redundant system instance may not be effective at detecting the uncovered failure either. For example, if B1 (shown in Figure 2) experienced an uncovered failure, it would continue communicating to the external redundant system (B2), indicating that it is operational. Consequently, there is no trigger for an automatic system-driven recovery action. Another problematic scenario for system-driven recovery occurs when the systems are unable to communicate with each other and try to assess the situation independently. In this case, a “split brain” situation may occur, which results in both systems believing they are active and that the other system is not operational .
Client-initiated recovery – This automated recovery strategy can help mitigate uncovered system failures, because the system’s client elements (which are outside the system) can detect a server failure when it stops responding, even if the server itself or its georedundant mate cannot detect the failure. In this case, the client then redirects service to a redundant system — which takes only a few seconds. As an example, this recovery method is well suited to a Domain Name Service (DNS). When a DNS client finds a particular DNS server instance is non-responsive or returns RCODE 2 ‘Server Failure,’ the client simply retries its request to an alternate DNS server. In this way, highly available DNS service can be delivered by a pool of simplex DNS server instances. However, DNS has a strong potential for high availability because a client can easily choose another DNS server without the need to maintain context from another DNS server or reauthenticate with the alternate DNS server. Client-initiated recovery is not necessarily as effective for some traditional applications. For some state- and context-driven systems in which key data may be lost in the switch to another system, the client may need to make additional checks to ensure that the primary system has truly failed rather than experiencing a momentary issue, such as network congestion, before attempting recovery to the alternate system.
Recommendations for improving service availability
High-availability systems are optimized to minimize service disruption due to single hardware or software failures by automatically detecting the failure and recovering service to a redundant instance. On rare occasions, multiple failures will overwhelm or silent failures will confuse high-availability mechanisms and, thus, accrue outage downtime. While it is possible for georedundant recovery to mitigate multiple failure outages, client-initiated georedundant recovery can also mitigate silent failure downtime. Thus, to leverage georedundant system investments to augment standard high-availability mechanisms, network providers should consider the following recommendations:
- Use client-initiated recovery – Clients should detect critical failures explicitly indicated with return codes, as well as failures implicitly indicated by expiration of time outs and by exceeding the maximum number of retries, and support recovery to redundant system instances when internal high-availability mechanisms do not recover service fast enough.
- Optimize time out and retry parameters – This can shorten failure detection latency, without false positives for transient events that can be mitigated by retries — or for occasional service delays that can be mitigated through generous time outs.
- Enable rapid session reestablishment – If a client’s session must be identified, authenticated and authorized to another server instance, this process should be made as brief as possible, so it adds minimal incremental latency to service recovery.
- Implement overload control – These mechanisms can prevent floods of primary and retried client-initiated recovery requests, which have the potential to crash the alternate system moments after the primary’s failure has been detected.
The Alcatel-Lucent IP Multimedia Subsystem (IMS) solution has been enhanced to follow these recommendations. Client-initiated recovery is used on most of the interfaces between components within the solution (for example, S-CSCF to HSS, S-CSCF to application servers, BGCF to MGCF.). Failure detection and failover times are often sufficiently short so that if a component in the solution fails, the solution can automatically recover fast enough that sessions still succeed, and the IMS subscribers are not aware that a failure occurred.
Innovative mathematical modeling now provides a deeper view into the availability benefits of georedundant systems. By enabling a better understanding of the factors that enhance availability, this information supports a cost-effective, methodical approach to optimization. Alcatel-Lucent is researching how these insights can be applied to new and existing solutions and applications, particularly in the context of cloud computing. To contact the authors or request additional information, please send an e-mail to firstname.lastname@example.org.