Rethinking data center fabric operations
Data center operations must keep pace with evolving applications
Data centers have been the center of an architectural shift from traditional business applications to more distributed applications offered as SaaS running in the cloud. The underlying deployment models for these cloud applications have evolved with the advent of distributed microservices. Kubernetes is a distributed microservices operating system (OS) and an intent-based platform that has contributed to the success of distributed applications by automating their life cycle management while providing much needed observability.
In today’s 5G and cloud era, modern, highly distributed applications are being implemented by a wide range of business, including cloud providers, colocation and interconnection providers, Communication Service Providers (CSPs) and enterprises across many industries. Software-defined 5G networks are a good example of a modern, highly distributed implementation, where operators and vendors agree on the requirement of a cloud-native 5G stack. Additionally, 5G will bring the telco cloud and the public cloud together, pushing modern data centers to become a highly distributed set of data centers.
These modern, distributed data center ecosystems will combine the cloud-native telco cloud with insightful data coming from the public cloud. This in turn drives the build out of hierarchical and distributed data centers, which will need end-to-end network automation combined with plug-and-play capabilities to help improve efficiency and reduce costs.
Reliability in these implementations is not only about applying network automation to normal deployment scenarios, but also requires a ‘design for failure’ thinking. We need to rethink the role of Network automation, to proactively implement cloud-native stacks that are ‘designed for failure’.
Managing all phases of the data center fabric operations lifecycle
Large-scale data center IT infrastructures typically have more than 100,000 servers hosting distributed cloud applications. The data center networks that support these IT infrastructures need to provide scalable connectivity and operations. This is achieved by managing groups of switches in the data center network as a logical unit called a fabric, and by operating the fabric as a whole by introducing automation. The terms Day-0, Day-1 and Day-2+ refer to the different phases of the software life cycle, such as design, deployment and operations. These terms can also be used to refer to the design, deployment and operation phases of the data center fabric life cycle.
Enhancing Network Operations (NetOps) to help deliver business and operational goals was a key decision and design criteria for the Nokia Data Center Fabric solution. In addition, our primary focus was to offer an intent based automation toolkit that benefits all levels of operations teams, without any prerequisite to have a team of network specialists. For operations teams with specialist staff, we focused on providing the tools to fine tune and customize the system to their unique requirements.
Let’s review some of the essential building blocks and capabilities for a modern fabric operations toolkit.
Automation delivered via intent
First, let’s look at large scale infrastructures of 100K+ servers that host modern distributed applications. To manage the life cycle of data center fabrics at scale requires automation, and automation at scale can only be delivered via intent. Similar to applications, network infrastructure needs observability working together with intent based automation.
Large scale data center fabric deployments can minimize fabric OPEX by using simpler designs, such as horizontally scalable CLOS designs. Additionally, we may limit the set of network protocols and features and create smaller fault domains such as smaller broadcast domains. To minimize OPEX, we need modular and abstract intent-based fabric automation for multi-layer CLOS networks that automates Day-0 to Day-2+ operations.
A fabric is a form of network virtualization, and modern data centers need different forms of ‘network virtualizations’ – for example, a ‘logical distributed switch’, or a ‘logical distributed router’. Network virtualization combined with network automation applied across physical and virtual networks reduces human errors in highly distributed modern data center stacks.
Applying intent-based automation to data center design and workload connectivity
In order to deliver automation at scale, the fabric operations toolkit must enable certified template-based abstract intent. In this model, operations teams can use fabric design templates that have been tested for stability and certified in the vendor’s lab networks. Additionally, in order to scale, the ‘fabric intent’ needs to be abstracted to such a level that operations teams do not need to be aware of the underlying advanced networking details. The abstract intent focuses on generic constructs of data center infrastructure, such as the ‘number of racks’, ‘servers per rack’, ‘dual-homing’ etc., to automatically design and deploy standard BGP-based IP fabrics that maximize the bisectional bandwidth via CLOS based topologies.
The connectivity for modern application workloads needs seamless connectivity for virtual machines (VMs) or containers across a multi-layer CLOS network. This requires standards-based Layer 2 or Layer 3 connectivity so that everything is ‘open on wire’ (that is, not subject to proprietary control or data plane protocols) and contributes to minimizing OPEX. EVPN-VxLAN is becoming a building block for service networking. Service network automation also deserves an intent based approach, with ‘abstract’ intent so that operations teams don’t need highly trained and certified personnel to provision a service.
Fabric as code
The positive impact of DevOps on the distributed applications world indicates that applying similar methodologies to networks using NetOps will have a similar impact. A NetOps-friendly approach for small or large-scale data center fabric operations will provide extendable automation platforms.
The fabric operations toolkit must ensure its intent-based automation can be expressed in a declarative form, so that it fits the bigger movement of ‘infrastructure as code’. This is important for solutions spanning on-premises and off-premises hybrid clouds.
Manage risk before applying changes with a ‘digital twin’ of the real network
The ability to make frequent changes to the network configuration while managing the risk of a change is also a key requirement for modern data center fabrics. Vendors develop and test various scenarios in their network labs. However, not every failure scenario can be created or validated. Operations teams can benefit greatly from a ‘digital sandbox’, which creates a digital twin of the real network. This allows operations teams to experiment, test and validate various automation steps and, more importantly, validate failure scenarios and associated closed loop automation without the risk of trying them out on the production network.
Beyond basic telemetry – From raw data to ‘contextual insights’
Automation and observability go together. However, the traditional approach of simply collecting all kinds of data and just pushing ‘big-data’ at operations teams without interpretation makes the operator’s task complex, while providing little useful information. The industry refers to this as ‘telemetry’, but what is needed is not raw data but insights. Extracting and delivering ‘contextual insights’ that enable the operator to understand the root cause of an issue and perform corrective actions is the need of the day.
The modern data center operations platform must enable an ‘insight database’ that combines configuration and observability data to present contextual operational insights to the operator in an easy to understand fashion. These operational insights must also enable the operator to perform closed loop automation in a programmable way. As the randomization and complexity of the data collected increases, applying regular business logic will not be sufficient. Instead, advanced machine learning (ML) based baselining and analytics will provide further and deeper insights to a human operator. In this new approach, a ‘software operator’ can enable a ‘human operator’ to perform the advanced operations needed in modern data centers.
Integration with software-defined data centers
Network infrastructure automation also needs to enable ‘invisible networks’ when there is value to integrate into a surrounding ecosystem, such as a software-defined data center (SDDC) platform or a cloud-native 5G stack. Here the network should align with the ecosystem so tightly that it follows the needs of applications and becomes invisible until a problem occurs. The fabric operations platform must adopt a loosely coupled cloud-native approach to enable pluggable ‘integrations’ with SDDC stacks such as VMware or Kubernetes-based stacks.
In summary to enable scalable modern data centers, network automation is an important component which needs to be delivered via abstract intent, combined with innovative network virtualization. It must become invisible in an ecosystem when needed, design network automation for failures, deliver plug-and-play, and most importantly tightly combine observability with automation. Such an approach enables operations teams to deliver much-needed closed loop automation for their data centers.
Nokia Fabric Services System
The building blocks and capabilities introduced in this blog, in particular the ability to make these available in a easy to consume manner, are the driving force for the Nokia Fabric Services System, a declarative, intent-based automation and operations toolkit that delivers agile and scalable network operations for data center and cloud environments.
Watch out for upcoming blogs that dive deeper into the capabilities introduced in this blog.