Designing for software resiliency and robustness in high-performance routing environments
Digital communications networks are of growing importance to society at large, including a wide variety of key industries, as essential, mission-critical infrastructure. At the heart of today’s advanced communications networks are IP routers supporting an ever-growing number of IP-based applications. These high-performance networks are under constant risk of attack. As we evolve to 5G, society is growing more and more dependent on this routing infrastructure, and the impact of failures on business and end users will be increasingly grave. It is, thus, no surprise that they are under scrutiny from the highest levels of government. Their resilience and robustness are now becoming issues of national security.
These qualities are founded in the design architecture, design approach and testing methodologies of a router’s operating system. These fundamental principles determine how communications systems measure up on reliability, high availability and security, and they have to be built in from the get-go — they cannot be an afterthought.
In developing the Nokia Service Router Operating System (SR OS), these principles have been fundamental. For instance, one of the first principles of SR OS is to maintain a single development stream — avoiding custom streams or forks. This strengthens our ability to test the system for software quality. We make a heavy investment in test automation with similar numbers of software developers and test engineers. They work side by side to write automated test code for every new line of code in the OS, which is then continuously regression tested using tens of thousands of servers 24x7x365. With a single OS stream, all the test cycles focus on the same software image. This is a big reason our code is so robust, virtually eliminating the occurrence of major bugs in the field.
Part of what makes it possible to have a single OS stream across all platforms is the modular design with separation of control and forwarding planes. Nokia uses a hardware abstraction layer that allows components of the OS, such as the control protocols, to be developed independently of the hardware. The result is that a single SR OS image will run on any hardware or chipset across the entire routing portfolio.
As well as being hardware independent, the Nokia SR OS also uses distributed processes for added reliability, scalability and performance. We design the code to be symmetric multi-processing (SMP) safe so that the software can scale out and run multiple threads in parallel, taking advantage of modern multi-core and multi-threaded processors.
We also use real-time scheduling of processes to efficiently manage the allocation of system resources based on priority and process state. This ensures that time-critical processes always have the CPU cycles needed, which prevents network meltdowns from misconfigurations, buggy code in other routers and/or malicious attacks.
As the need for the constant availability of mission-critical IP applications grows, redundancy is essential. With SR OS, all state is replicated across redundant control processors. This is an inherent architectural feature that can’t be added after the fact. It enables us to do cool things such as in-service software upgrades (ISSU) and non-stop routing and services, which ensures no (or minimal) impact to network and service availability.
The need for high availability also depends on security. First, SR OS has a robust set of features to protect and secure the router from attacks. This set of features include securing access to the router, out-of-band management to prevent administrative access, and using hardware QoS to prioritize traffic to the control processors, which prevents DoS attacks aimed at the control plane.
SR OS also makes the router part of the network defense with streaming telemetry data for comprehensive stats and counters, flow analysis, traffic mirroring and traffic filtering. Properly architected, routers should have the ability to filter out 90% of the nuisance traffic associated with today’s volumetric DDoS attacks, reducing the reliance on much more expensive mitigation hardware.
To do this filtering, SR OS is designed so that accesses control lists (ACLs) can be applied at the flow level on a per-service or per-interface basis with zero performance penalties. SR OS leverages the capabilities of the Nokia FP4 IP routing silicon with terabit-level forwarding capacity and enhanced packet inspection to the packet payload level. This enables SR OS to isolate and discard malicious traffic flows tied to volumetric DDoS attacks at the first stage with the router line card at the network edge before it does damage to the network or targeted applications and services.
These requirements, and many more too numerous to be covered in a blog post, are built into the Nokia SR OS by design. This is the basis of our rock-solid reputation with customers for our high-level of software quality. This is because resilience, robustness and security are built-in at the architectural level. As we evolve to 5G and routed networks take on national-level importance as essential infrastructure for our society, we believe that our OS design architecture, design approach and testing methodologies will come to be seen as a benchmark.
Share your thoughts on this topic by joining the Twitter discussion with @nokia and @nokianetworks using #NFV #telcocloud #VNF #virtualization #cloud