Scaling IP fabrics for the digital era
“Internet traffic is expected to grow at 30% year-over-year, give or take,” is a well-known truism in the industry. Every five years internet traffic is tripling. Since the main contributors to this steep curve are video and ramping-up 5G traffic, and both traffic types are only expected to grow, it is very likely this curve will only steepen further.
This insatiable demand has turned the internet into a giant content-delivery network and is driving the network evolution for Communication Service Providers (CSPs). Distributed internet peering, on-net video caching and the emerging edge cloud are causing a shift in predominantly north-south traffic from centralized peering points and data centers to east-west traffic in metro/regional networks.
With this shift in traffic demands to the edge, the leaf and spine design approaches used in data center networks are becoming more attractive to help scale out CSP’s metro aggregation, edge and peering infrastructure. While CSP leaf-spine fabrics share the design patterns and scale-out properties of data-center fabrics, they are an entirely different species that has adapted to meet the vastly different deployment requirements of CSP network environments.
This blog examines the requirements, considerations and deployment options for adopting leaf-spine fabrics in CSP network designs.
Scalability has been a constant challenge since the dawn of networking, and most network innovations and investments are focused on this need. The quest for scalability is really a search for efficiency — to increase network capacity and simultaneously lower the delivery cost per bit. Investing in next-generation hardware and innovations in network design practices are equally important to efficiently scale network capacity and sustain profitable growth.
Leaf-spine fabrics are a new option for scaling high-capacity IP aggregation, core and peering nodes that would otherwise require massively scalable routers or even multi-chassis system configurations.
Scaling up or out?
Network capacity can be increased in two ways: by scaling up or scaling out. Both methods have merits and limitations and can be applied in concert (see Figure 1).
Scaling up grows capacity in the existing network footprint. This is done by adding line cards to existing chassis, by swapping in a bigger chassis, or by upgrading systems with higher density hardware and larger capacity links. The network topology is unchanged, and we keep a single and integrated management node with efficient internal connectivity via fabric cards and optimal use of port capacity for service traffic. There is an upfront investment in common hardware and significant cost breakpoints when system chassis capacity limits are reached and system hardware upgrades (or swap-outs) are needed. Adding line cards to an existing chassis is a low-cost investment, but the impact of failures will grow proportionally with system capacity, which can lead to a large “blast radius” in typical 1+1 system configurations.
Scaling out grows capacity by adding more systems. This is the horizontal scaling approach of leaf and spine network fabrics, in which leaves (and spines) are added to expand fabric capacity. Leaves can be considered as line cards in a fixed-form chassis while spines act as the switching backplane. Leaf–spine fabrics offer a linear pay-as-you-grow cost model with diminishing returns on incremental investments in leaf routers due to the added cost of fabric connectivity, control and management overhead as the fabric scales out. There is an upfront investment required in spine switches with significant cost breakpoints when more spine nodes are needed to expand the fabric. However, the availability requirements of individual leaf and spine nodes are reduced, and the impact of system outages is much smaller due to an N+1 leaf node redundancy model.
Operational deployment characteristics of leaf-spine fabrics
A leaf-spine network fabric replaces a large 1+1 routing system by a set of leaves (modular or fixed-chassis “pizza boxes”) that connect to two or more spine routers (Figure 2). Leaf-spine fabrics scale by varying the number of leaf and spine devices and provided interface capacity.
Fabric capacity and performance can be optimized by using dedicated leaves for specific interface roles. Alternatively, the number of leaf routers and interface ports required can be reduced by directly connecting service edge (e.g. BNG) and/or peering routers to the spine.
Properly designing and engineering leaf-spine fabrics for growth is not a trivial task since there are several scaling factors, such as port capacity of leaf and spine switches, over-subscription ratios, protocol choices and failure recovery performance, that require careful consideration. Ironically, leaf-spine network fabrics also introduce several new scaling challenges since the number of managed devices will significantly increase:
- Increase in control plane signaling and state. When replacing N chassis-based systems by leaf-spine fabrics consisting of M routers on average, there are now N*M more nodes in the IGP topology and many more links as well. Also, at the BGP layer there are now N*M IBGP next-hops. These changes result in more state information to be held in the IGP and BGP protocol databases, and some routers may not be able to handle this increase. Route churn will increase with connectivity, and convergence times will lengthen because there is more information to process after each change event.
- Traffic flow inefficiencies. When a chassis-based router is replaced with a set of leaf and spine devices and hop-by-hop shortest path forwarding is assumed to be adequate, certain traffic inefficiencies are unavoidable (especially with ECMP routes) and, in the worst case, packet loops may develop.
- Failure propagation and recovery. When a leaf node fails, this event must be propagated across the entire fabric via changed IGP link-state PDUs. Even with optimizations to the flooding algorithm and aggressive timer settings, the IGP PDUs propagate slower than the intra-chassis messages in scale-up models. Introducing BGP can result in other complications as the withdrawal of the BGP FIB routes caused by the failure of a leaf or external facing leaf port will incur additional delays in the BGP path selection algorithm.
- Unintended traffic path changes. Ideally, the replacement of a traditional chassis by a scale-out leaf-spine system should not cause routing changes that impact the proportion of traffic sent by the “virtual chassis” to different peers. Overly simplified leaf-spine designs may cause original distinctions to be lost (e.g., between EBGP and IBGP peers) or new distinctions to be added (e.g., extra clusters IDs in the CLUSTER_LIST). Original traffic engineering goals may be compromised when losing these distinctions leads to different best-path selections.
- Additional management complexity. With far more nodes and links to manage, operations teams will have to understand and differentiate fabric-internal (leaves/spines) vs fabric-external alarms and be able to differentiate and correlate these alarms into root causes. Some operational procedures will require direct management access to leaf and spine elements, for example to troubleshoot connectivity issues within the logical system or to perform software upgrades or maintenance activities.
Ideally the use of a leaf-spine fabric is transparent to both the operator and the surrounding network by abstracting it as a single entity from a control and management plane perspective, while fully taking advantage of individual device capabilities. Otherwise the operational cost and complexity may increase considerably and outweigh any potential scaling benefits.
The right scaling approach
There are benefits and limitations to both vertical system scale-up and horizontal leaf-spine scale-out architectures that determine economies of scale, network availability and operational complexity. These aspects must be considered at the node, fabric and network level based on projected traffic demands, including power, space, cooling and incremental connectivity costs.
In general, chassis-based scale-up equipment practices enable simpler to maintain network designs with fewer high capacity nodes to manage, with the consequence of a larger traffic impact in case of equipment failure. The impact of equipment failures and upgrades can of course be mitigated by various high-availability features including major and minor in-service software upgrade, hot standby control redundancy and multi-chassis Link Aggregation Groups, to name a few options. Scale up equipment practices are well understood and widely deployed.
Leaf and spine scale-out fabrics offer more granular scaling and a reduced impact of individual device failures, with the added cost and complexity of a bigger footprint with more nodes to manage. However, the deployment experience with leaf-spine fabrics in CSP production environments is still limited and there are several operational challenges to address.
Introducing next generation routing silicon will benefit both scaling methods, even if only to reduce space, power and cooling requirements. To optimize all scaling dimensions in next generation networks will require a balanced approach that combines scale-up and scale-out architecture elements. Operational preferences, practices and equipment capabilities will determine to which point it is prudent to continue scaling up chassis-based systems or start scaling out by introducing leaf-spine fabrics. Nokia offers solutions for both options and is actively engaged with CSPs to help them build scalable IP networks that can efficiently meet evolving demands from Internet video and 5G traffic.
Future blogs in this series will examine the operational management and control plane scaling challenges and solutions of leaf-spine fabrics in more detail.
Share your thoughts on this topic by joining the Twitter discussion with @nokiaindustries using #IP, #digitalera, #5G, #IoT