Building ultra-reliable AI and HPC Data Centers

Nokia team

As the ICT industry gathered at Supercomputing 2025 in St. Louis, advanced AI use cases and applications were top of mind. Among those who attended this annual event were researchers and scientists from prestigious national and international universities, government research labs, and the broader AI/HPC data center community. All have a unique role in accelerating AI innovation. To reach their goals, they’ll need both high-capacity and performance computing and, equally important, high-performance and reliable research and education networks (RENs) for securely transmitting their massive data sets to AI/HPC centers around the world. 

Platinum sponsor to Supercomputing 2025

Nokia was proud to be, for the fourth year in a row, a platinum sponsor and contributor to SCINet, the world’s highest capacity (9 Tbps), temporary, multi-vendor network at the event. We worked closely with industry partners, research institutions and over 200 volunteers to build this state-of-the-art network for Supercomputing 2025. It featured some of our newest data center switching, IP routing, and optical transport. As well, our DDoS mitigation system monitored the WAN traffic from SCInet and booth exhibitors to outside laboratories and data centers. Of note, our 7750 SR-1s connected SCInet over 800G ZR+ pluggable optics to the StartLight network point of presence at Northwestern University in Chicago, a distance of just over 300 miles.

Our presence at the event demonstrated the strategic ambition to address the explosive growth of AI/HPC data centers with high capacity, resilient, secure and flexible networking. Of special interest to government agencies, research universities and enterprises was a new line of 7220 IXR-H6 data center fabric switches that we recently announced. These high-capacity switches support 102.6 Tb of throughput capacity and up to 1.6 Tb per port and are built on the latest generation of Broadcom Tomahawk 6 silicon. They run on our modern, open and extensible SR Linux network operating system (NOS) but can also run on open-source community SONiC as well. 

One of the strategic partnership solutions we showcased this year was for sovereign-AI clusters that require optimized data center network architectures to meet industry standard capacity and performance metrics. We teamed up with Lenovo and AMD to design, test and validate a modular and scalable blueprint solution that optimizes the compute and improves GPU processing performance. It also reduces the amount of network switching required, thus lowering the overall power and cost of the solution. 

Along with the sovereign AI solution, we showcased our data center fabric and wide area network solutions including our industry-leading IP Routing and 800G pluggable optics along with a quantum-safe network solution that protects against the threat of a powerful quantum computer breaking today’s encryption algorithms.

Finally, we demonstrated how to improve data center network operational reliability using AI-enabled network automation.  

Operational reliability

As AI moves into the mainstream and applications become business- and mission-critical, governments and enterprises will need to change the way they operate their data centers. Automation of operations with the assistance of AI will improve operational reliability in several ways.  

Our Bell Labs Consulting group worked recently with Futurum to do a survey of data center network operators to determine what keeps them up at night. 86% of respondents pointed to the issue of data center reliability. They want solutions to address:

  1. Mean-time to innocence (MTTI)
  2. Alert fatigue
  3. Barriers to automation

The last of these worries points us down the path to more open, automated data center fabrics. Automation that focuses on making changes safely addresses the principal cause of unreliability: human error. Automated workflows are repetitive and precise and don’t make mistakes. The more that routine tasks like configuration, provisioning and updates can be automated the more reliable operations become. This is especially important for consistent implementation of security and compliance across the data center fabric. It becomes even more critical when deploying AI clusters, which can require massive scaling for training or inference modeling across the network.  

Automated self-healing fabrics help to reduce alert fatigue. There are several dimensions to proactive problem resolution. The first is real-time monitoring using advanced telemetry and AI/ML data analytics. Anomalies in fabric performance can be spotted early and trigger predictive maintenance and/or automate incident responses such as re-routing traffic, initiating failover or sub-system isolation. 

Through real-time monitoring, and the use of AI for trouble resolution and diagnosis, the operations team can more rapidly determine the root-cause of a problem, thus reducing MTTI, and take remedial actions with the responsible organization armed with real-time data. The point is to avoid an exhausting and endless cycle of fire drills so that operations staff can focus on more high-value and complex projects.  

Improving operational reliability is one of the goals of Nokia’s Event Driven-Automation (EDA) platform. With its self-healing architecture, intent-based automation, and robust validation tools that simplifies operations, EDA meets our mission of “Human Error Zero”. 

Besides reducing human error—the primary cause of data fabric downtime—EDA introduces a structured process for all network changes. Operators simply define the desired state of the network (intent), and EDA translates this into the required configurations automatically. 

EDA also includes a digital twin of the actual data center network, which allows pre/post-testing of all proposed changes across multiple devices with automatic rollbacks to the previous stable configuration. It also incorporates event-driven monitoring using advanced telemetry protocols, automated incident response for pre-defined corrective actions, and AI-driven insights to enable proactive maintenance. 

At Supercomputing 2025, we demonstrated our EDA network automation platform to rapidly construct a back-end AI data center network. We used AI to assist in designing and configuring the AI cluster along with its digital twin capabilities to test and perform pre-checks before deployment. We also demonstrated AIOps and NetOps capabilities: from design (day 0), deployment (day 1), to full operations (day 2).  With agentic AI-driven real-time analysis of data center fabric health, logs, alarms, and events, EDA provides the observability needed to evolve from manual processes. It reduces human errors and improves operational reliability while giving network operators the ability to quickly respond to their organization’s needs with confidence.

Flavio Caracas

About Flavio Caracas

Flavio Caracas leads a team responsible for IP/MPLS, Optics and Data Center Networking, Consulting Engineering and Business Development support for Vertical markets in North America within the IP and Optical Divisions. Previously he was the head of Product Line Management of the 7705 SAR IP/MPLS Routing Portfolio and the Multiservice WAN Switches product families, with responsibilities for product strategy, roadmap definition, future development, and portfolio life cycle.  

Prior to Nokia, Flavio joined Alcatel-Lucent in 2000 through the acquisition of Newbridge Networks, having held senior management positions in Business Management, Systems Engineering, Marketing and Sales.  Flavio has a very broad experience in Information and Communications technologies encompassing IP/MPLS networking, satellite communications, and radio transmission. Flavio Caracas holds a bachelor degree in Electrical Engineering.

Article tags