Skip to main content

Reliable Cloud Computing – Key Considerations


Cloud-based services offer greater flexibility and economy than many traditional information services. But can they meet or even exceed end-users’ expectations for reliability and availability?

Cloud computing offers a compelling business model for information services. Consequently, many new applications are being developed explicitly for cloud deployment, while many traditional applications will eventually evolve to the cloud. End users want these cloud-based services to be at least as reliable and available as traditional offerings. And to meet these expectations, cloud service providers and cloud consumers need to gain a solid understanding of the unique challenges of cloud computing and learn how to mitigate risks.

The new challenges are primarily related to virtualization, rapid elasticity and resource sharing. These capabilities enable a new level of flexibility, convenience and economy, but they also make cloud computing inherently more complicated than traditional computing. This complexity adds more areas for potential failures.

Delivering reliable and available cloud-based services must start with an awareness of how operations have changed in the cloud, including recognition of where new points of vulnerability lie. For example, load distribution, overload control and data management are all more complex in the cloud, and new usage models enabled by cloud computing can increase the impact of a site or server failure.

After carefully identifying these issues, cloud service providers and cloud consumers can then take advantage of architectural opportunities for mitigating the risks. When this approach is backed by traditional engineering diligence, cloud-based services have the potential to meet or exceed the service reliability and availability requirements of traditional deployments. Satisfying these requirements can be crucial for all players in the cloud environment, where accountability is often split between cloud service providers and cloud consumers — and where standards bodies are still working to establish clear outage measurement rules.

Evolving to the cloud

To benefit from elastic growth and other new capabilities offered by cloud computing, many traditional applications will be evolved to a cloud environment over several releases. The following usage scenarios, organized from the simplest to the most complex, illustrate a variety of advantages following virtualization.

  • Hardware independence – Virtualization minimizes the dependence of an application on its underlying hardware. That means the application may still require the same machine instructions, such as Intel. But its software is decoupled from hardware-based details, such as physical memory and storage, so the application can be easily moved onto new hardware.
  • Server consolidation – In this case, virtualization increases resource utilization because multiple applications can share hardware resources, including previously underutilized hardware.
  • Multi-tenant – With this usage scenario, multiple independent instances of an application, such as e-mail or web service, can be consolidated on a single virtualized platform. The instances are then available simultaneously to diverse user communities.
  • Virtual appliance – In this vision of virtualization, defined by the Distributed Management Task Force (DSP2017), applications are delivered as turnkey software, prepackaged with operating systems, protocol stacks and supporting software. This approach allows suppliers to thoroughly test the production configuration of all system software, while customers enjoy simpler installation and maintenance.
  • Cloud deployment – This usage scenario includes rapid elasticity and is the typical endpoint of the evolution of an application to the cloud. It offers the most flexible configuration, which can expand or contract automatically in response to changing workloads.

Understanding the impact

While the new usage scenarios of the cloud deliver important benefits, they also present new challenges. For example:

Co-residency: This type of server consolidation usage model makes it more difficult to predict application performance. Vulnerability to service impairments due to “noisy neighbor” applications is greater.

These challenges can be mitigated with a fully tested, high-availability architecture that supports failure containment and recovery of each of the applications.

Multi-tenancy: Multi-tenancy has the same cost benefits and challenges as co residency. But the challenges are more pronounced because failures may impact multiple user populations. Multi-tenancy also has an increased security challenge, as user populations must be kept completely separate.

To mitigate these challenges, high-availability architecture is required. It should support rigid failure containment and independent service recovery. Workflows should be tested under various failure scenarios. Robustness testing must insure that each tenant is appropriately isolated; security testing should make sure that there is no cross-tenant access to applications or resources.

Establishing new ways to handle service load

In a cloud environment, service load can potentially be distributed seamlessly across multiple servers, locations and cloud providers, with the assistance of load balancing mechanisms and policies. The challenge is to satisfy wide-ranging requirements, such as subscriber affinity, redundancy, latency, availability, security, capacity, and even regulatory issues.

For example, appropriate load distribution architecture should consider the number of application instances, their proximity to end users, and application and data redundancy. Policies must also be clearly defined, so service distribution can be managed in accordance with latency, regulatory and security requirements. The distance between data centers should be considered, too, particularly when data exchanges are frequent and high transactional reliability is required.

Overload control: To handle overload events, traditional systems set capacity thresholds, then shed or reject traffic as needed to keep the system from crashing. Cloud management mechanisms, however, can add new instance(s) of the application, to share the growing traffic load. For example, rapid elasticity can be used to address traffic spikes and shorten the time a system is in overload as extra service capacity is brought online. Native overload control mechanisms should also be present to handle any excess traffic during the interval before scaling activates and the new instances are sharing the traffic. In addition, the mechanisms should be there to manage traffic when the offered load exceeds maximum elastic capacity (for example, license or policy limits).

Rapid elasticity: Besides supporting overload control, this powerful mechanism enables more efficient use of hardware resources. It can automatically increase (or decrease) resources (vertical growth) of a virtual machine — or expand (or reduce) the number of virtual machines (horizontal growth). Horizontal growth can occur within the limits of a single data center or grow into an additional data center. Outgrowth expands capacity by adding resources in other cloud data centers.

Effective use of rapid elasticity is based on resource monitoring, policies and thresholds. Hysteresis (that is, different growth and shrink thresholds) should be used to prevent capacity oscillations. To mitigate the risks associated with rapid elasticity, systems must be thoroughly tested and cloud-based applications must be designed to:

  • Manage scaling and de-scaling
  • Accurately monitor resource utilization and performance
  • Support well-defined policies, backed by robust trigger mechanisms to control growth and contraction

Addressing widespread data storage

For service reliability, all data must be redundantly stored and managed to survive the failure of a component. In addition, data synchronization presents new challenges, because cloud transactions can span multiple application instances and be stored in several locations. ACID and BASE mechanisms are typically used to keep data synchronized.

  • ACID (atomicity, consistency, isolation, durability) properties are essential for transactional reliability and immediate consistency. However, they can be resource intensive and introduce latency into transactions.
  • BASE (basically available, soft state, eventual consistency) properties enable simpler solutions that are less resource intensive. They are appropriate when data consistency can be achieved over longer time periods. For example, they are well suited to many web services, such as e-mail.

Using additional mechanisms for high availability

Cloud services should be redundant at the software and hardware levels and incorporate high-availability mechanisms at their foundation, including automatic failure detection, reporting and recovery mechanisms. To enhance the internal mechanisms, the virtualization platform can provide an additional layer of failure detection and recovery at the virtual machine level. One must assure that the two mechanisms can peacefully coexist and don’t collide during failure recovery.

Dealing with latency challenges

For isochronal applications like video calling, it’s crucial to prevent latencies that disrupt service quality. But with virtualized configurations, resource contention, real-time notification latency, and virtualization overhead can all add latency. To address these issues, architects need to take the following actions:

  • Carefully identify the real-time isochronal expectations for a virtualized platform. For example, the maximum notification latency must explicitly represent how “late” a real-time notification interrupt can be.
  • Determine whether the target platform or infrastructure service can actually meet the identified requirements.
  • Establish a recommended architecture and configuration for optimal isochronous performance on the specified platform or infrastructure service.
  • Prototype and test the service to validate whether it is technically feasible to meet its requirements on a virtualized platform.

For an in-depth analysis of these challenges, with recommendations for mitigating risks, see Reliability and Availability of Cloud Computing, which will be published by Wiley-IEEE Press in summer 2012.

Maintaining traditional engineering diligence

Cloud computing introduces new technologies with unique benefits and risks. But it does not change the basic structure or importance of the engineering diligence required to maintain reliability and availability. The process of maintaining this diligence can be summarized in the following steps:

  • Clearly define service reliability and availability requirements.
  • Model and analyze overall solution architecture to ensure that it is capable of meeting reliability requirements over the long term.
  • Carry out reliability diligence on individual components to make sure they can meet the overall solution requirements.
  • Test the solution thoroughly and make sure that automated methods of failure detection and recovery work effectively.
  • Track the performance of the solution in the field and follow up with corrective actions as needed.

When this diligence process is applied to mitigate both traditional risks and the new challenges of the cloud, cloud-based services have the potential to meet or exceed the service reliability and availability requirements of traditional deployments.

To contact the authors or request additional information, please send an e-mail to

Eric Bauer

About Eric Bauer

Eric Bauer is reliability engineering manager in the IP Platforms Group of Alcatel-Lucent. He currently focuses on reliability and availability of Alcatel-Lucent's cloud related offerings, IMS and other solutions. Before focusing on reliability engineering topics, Mr. Bauer spent two decades designing and developing embedded firmware, networked operating systems, IP PBXs, internet platforms, and optical transmission systems. He has been awarded more than a dozen US patents, authored four engineering books and has published several papers in the Bell Labs Technical Journal. Mr. Bauer holds a BS in Electrical Engineering from Cornell University, Ithaca, New York, and an MS in Electrical Engineering from Purdue University, West Lafayette, Indiana. He lives in Freehold, New Jersey.

Randee Adams

About Randee Adams

Randee Adams is a Consulting Member of Technical Staff in the IP Platforms group of Alcatel-Lucent. Currently she is focusing on the reliability of Alcatel-Lucent’s software applications. Randee has written Beyond Redundancy: How Geographic Redundancy Can Improve Service Availability and Reliability of Computer-Based Systems , and Reliability and Availability of Cloud Computing.

Article tags