AI is driving the evolution of autonomous networks
I can’t help but notice that amid the hype surrounding AI, chips and data centers, the network rarely seems to get a mention. As a Bell Labs researcher, it seems to me that this is a critical gap. Intelligent, agile and reliable connectivity that dynamically meets the needs of AI-driven applications will be essential. I would go even further and say that without autonomous or mostly autonomous networks, it’s difficult to see how AI will fully realize its potential.
The key to achieving networks that can support AI, ironically, will also depend on AI; specifically, the development of specialized AI agents that are able to work in concert to operate the network in all its aspects. Let’s see what that might look like.
AI demands network evolution
Imagine, for instance, that an enterprise AI application using a hybrid private-public cloud makes an API request for an on-demand network slice to connect a customer, employee or IoT device back to the enterprise’s private cloud for data security reasons. The slice might only be needed for the time period of the transaction or event. Now imagine this request for a slice being repeated by thousands of different AI applications across the same network in this same period.
This represents a new paradigm for network operations, which traditionally relied on predictable traffic patterns to plan for and manually manage network resource deployment. Automation, though it’s being used today, will have to evolve massively. Just as AI forced this evolution, unsurprisingly, network operators will turn to AI, especially AI agents, to create networks that self-operate.
Network operations
The building blocks of networks that self-operate (NSOs) will be network module agents (NMAs), which are AI agents that intelligently operate a network node, for instance, in the RAN, core or optical domains. These agents address a specific function to perform a task like predictive maintenance, root cause analysis, or resource management. They receive inputs such as KPIs and alarms or analyze local log or sensor data and do things like tune parameters and self-heal the node.
NMAs can interact directly with each other, but are also orchestrated by other agents, called large network agents (LNAs) that are assigned to a network domain or service layer in the network. LNAs aggregate the outputs of multiple NMAs and orchestrate their activities. They have their own AI internal components that do such things as reason, filter-out hallucinations, and remember the chain of interactions. They also manage communications between agents, including API calls to foundational AI models, local non-language models, applications, tools, and digital twins, as well as interactions with human engineers and customers.
In order, for instance, to provide fully autonomous predictive maintenance, an NMA may monitor KPIs and sensors for specific network equipment. The NMA would convey information back to an LNA of the network that a particular card under its watch is starting to fail. The LNA has a system-wide view of many NMAs allowing it to draw conclusions through reasoning as to whether this was just one card failing or a more serious potential node failure. Depending on its prediction, the LNA may interact with a supply chain AI agent to order a new card and assign a human technician to replace the card or take more holistic actions possibly involving human technicians to prevent a service impacting failure.
Spatial awareness
There are many things that happen to networks in the physical world that won’t be properly captured by network equipment data, especially if the physical fault involves a complete loss of a connection. Thus, to be completely self-operating, network AI agents will also be needed to understand the physical layer of the network, that is, how things happen in the world.
These large world agents (LWAs) are a kind of embodied AI, which have physics models that enable them to analyze visual, temporal and spatial data from cameras, text, video/audio and sensor data, and they undertake real-world actions using, for instance, robots, cars, drones, or human personnel. They rely heavily on digital twins to model the world, especially those aspects of it that directly affect the network infrastructure.
Semantic communications
In the complex chain of AI agent interactions needed to self-operate the network, it is not simply a matter of machine-to-machine communications, since humans will many times be in the decision-making loop. For the sake of transparency, among other things, it’s best that communications between agents are carried out in natural language. For this to be efficient, only the key information, with context where required, should be used.
Let’s say a field engineer encounters an issue with the DSL cross-connect box in a customer’s backyard that requires specialized knowledge to resolve. The field engineer may use a smart tablet device powered by an AI agent with semantic communication capabilities to initiate a video call with a remote expert. The AI agent on the tablet analyzes the video feed from its camera and extracts relevant information, such as the type of box, wiring, etc. Meanwhile temperature, vibration and moisture sensor data from the cabinet provide the environmental conditions. All these inputs are analyzed by a remote expert human or AI agent, who/which provides precise, step-by-step guidance to the field engineer.
Trustworthy autonomy
As with every discussion around AI today, it’s important to acknowledge the elephant in the room. How do we ensure that handing over the operation of such critical infrastructure is safe? There are five principles that we at Bell Labs see as critical:
- Multimodal integration and contextual awareness-seamless integration and reasoning across multiple modalities to provide context-aware insights and actions
- Explainability, safety and compliance-prioritization of safety and adherence to industry standards ensuring explainable decision-making and incorporating human-in-the-loop validation for critical actions
- Robustness and reliability-high reliability for edge cases and implementation of fail-safe mechanisms
- Scalability and real-time performance-support for high-data throughput with real-time inferencing available on devices or the edge cloud as well as the ability to scale to multiple facilities and tasks
- Data privacy and security-protection of sensitive network data and network AI models.
There can be enormous benefits in automating networks. The intelligent fusion of network operation agents (NMAs and LNAs), spatial intelligence agents (LWAs), and semantic communications will take us there. This integration promises to enhance operational efficiency and reduce downtime significantly. Most importantly, it will be essential for building the dynamic, resilient platform needed to realize the promises of an AI world.
If you are interested in a more in-depth discussion of AI and networks that self-operate, we have recently made available a white paper exploring our findings and recommendations: Advancing AI: networks that self-operate.