The illusion of internet resilience

Abstract image of blue, purple and pink network waves

We thought we'd solved this. After decades of building distributed systems and automated failover, the internet was supposed to be resilient by design. We could trust the automation to handle the edge cases.

Then October and November 2025 happened, and our assumptions were systematically disproven.

When protection becomes an attack

On October 20 at 06:48 UTC, a race condition in AWS's DynamoDB DNS automation deleted a single database endpoint. For three hours, DynamoDB was unreachable. The cascading failures took fourteen hours to fully resolve, paralyzing EC2 launches, Lambda executions, and dozens of dependent services across us-east-1. The automation designed to ensure resilience had an interaction between components that nobody anticipated. One DNS record. Fourteen hours.

A month later, Cloudflare hit a similar failure mode. A database permissions change doubled the size of a Bot Management configuration file. When that file propagated globally, it exceeded a hard-coded limit (200 features) and caused the proxy software to panic. For three hours, DDoS protection systems designed to keep the internet online instead took down a significant portion of it. The operations team initially suspected a massive DDoS attack, since the symptoms were identical.

Between these incidents, Nokia Deepfield observed a 33 terabits-per-second DDoS attack against a gaming provider—volume comparable to entire national internet backbones. The attack worked because the internet generally treats all packets equally. A legitimate request and a malicious packet get the same priority and routing. At 33 Tbps, this egalitarian design becomes a liability.

The economics are brutally asymmetric: the Aisuru botnet comprises roughly 500,000 compromised IoT devices generating more bandwidth than many countries' entire internet capacity. The attacker pays almost nothing. The target must provision defenses capable of absorbing nation-state levels of traffic. We've built an internet where defense is expensive and attack is cheap, where distinguishing legitimate from malicious traffic requires analysis that most infrastructure can't perform at line rate.

The false promise of scale

The intuitive response is scale. If attacks exceed 30 Tbps, we need providers who can absorb 30+ Tbps. If DNS automation has race conditions, we need more sophisticated automation. If configuration files can exceed limits, we need bigger limits.

This is comforting because it suggests the problem is solvable through engineering. It's also wrong.

First, centralization reduces resilience. Concentrating traffic through mega-providers creates chokepoints. When one has problems (and they will), the impact spreads wider. Second, complexity doesn't scale linearly. AWS's race condition and Cloudflare's panic both likely emerged from interactions between components thoroughly tested in isolation. As systems grow, potential interactions grow exponentially. Third, centralized scrubbing centers are prohibitively expensive for most organizations, and even for those who can afford it, the attack traffic still floods the internet. You're just dealing with it downstream.

What we're actually defending

We're not actually defending networks anymore. We're defending against our own systems.

The AWS outage was caused by automation doing exactly what it was designed to do, in an unanticipated sequence. The Cloudflare outage stemmed from a reasonable database change propagating through a system that made an unreasonable assumption. Even the 33 Tbps attack succeeded because of systemic failures: insecure-by-default IoT devices, asymmetric bandwidth, misaligned incentives, and infrastructure treating all traffic equally.

We've built systems so complex their behavior is unpredictable, connected so tightly that failures propagate instantly, and automated to operate faster than human comprehension. We assumed scale and redundancy would protect us. They haven't.

The intelligence gap

The promise of AI networking is that intelligence will solve these problems. Machine learning will detect anomalies. Automated systems will respond faster than humans.

Maybe. But intelligence without understanding is just faster failure. If we can't design systems that handle a doubled configuration file or a DNS race condition, adding AI won't help.

What we need is different thinking. Not more automation, but automation that fails safely. Not bigger scrubbing centers, but distributed defense built into the network fabric. Not smarter algorithms, but simpler systems with fewer interaction effects.

Perfect reliability is impossible at internet scale. The question is whether we can build networks that fail gracefully, recover quickly, and minimize collateral damage.

The path forward

October and November 2025 taught us that our assumptions were wrong. The architecture isn't inherently robust. The automation isn't foolproof. The scale isn't protective.

The networks we're building for AI workloads will be even more demanding. AI traffic patterns are fundamentally non-deterministic. Training runs spike unpredictably. Inference workloads shift based on user behavior we can't forecast. Model updates trigger cascading reconfigurations. The relatively static traffic patterns we've optimized for disappear, replaced by constant flux.

This makes our current operational model obsolete. We can't debug these networks by stitching together weekly or monthly reports from islands of monitoring data. By the time we've assembled the picture, the network has moved on. We need minute-by-minute, multi-dimensional visibility into what's actually happening. Not dashboards showing yesterday's state, but real-time understanding of traffic flows, capacity utilization, attack patterns, and failure propagation as they occur.

If we can't handle today's simpler cases (a DNS record, a configuration file, a large volumetric attack), we're certainly not ready for non-deterministic AI workloads operating at speeds that make human intervention impossible.

The solution isn't to do more of what we're already doing. It's to question whether our current approach can work at all. That's uncomfortable, because it means admitting that decades of architectural decisions might need rethinking. But the alternative is watching our protective systems become our biggest vulnerabilities.

The choice isn't between perfect security and no security. It's between true security and security theater

October and November 2025 showed us which one we've been practicing.

Jeff Smith

About Jeff Smith

Jeff Smith is the GM and VP of Nokia Deepfield Business Line. 

Based in Ottawa, Canada, Jeff was the NI Business Strategy Director for Security and Analytics, transitioning in 2024 to lead the Deepfield business unit within IP Networks (within Network Infrastructure group) at Nokia. He has held several networking and security roles within Alcatel-Lucent and Nokia focused on ensuring products solutions are secure and that Nokia remains innovative helping protect telecommunications providers, mission critical and large enterprise customers. Before joining Nokia, Jeff led product management and support organizations for start-ups in the security and software industry including TimeStep, ServiceWare and Asset Software International, comprising 30+ years in technology leadership.

Tweet us @nokianetworks

Connect with Jeff on LinkedIn

 

Article tags