When half the internet goes dark: what the Cloudflare outage really teaches us

On Tuesday, the internet was reminded once again of how dependent we are on a very small number of infrastructure players. A single configuration issue at Cloudflare was enough to partially or fully bring down thousands of sites for six hours. And not obscure sites, but some of the most widely used services in the world, including:

ChatGPT

Claude

Canva

Dropbox

Spotify

Uber

Coinbase

Zoom

X (former Twitter)

When a single company sneezes and half the internet catches a cold, the fragility of the modern web becomes impossible to ignore.

This wasn’t the first wake-up call. Only recently, during an AWS outage, Elon Musk pointed out that Signal is fully dependent on AWS to stay online. In response, a developer dryly noted that X itself has a similar hard dependency, only with Cloudflare. Days later, the prediction materialised.

Yet the most remarkable part of this incident wasn’t the failure. It was the postmortem.

In less than twenty-four hours, Cloudflare published a detailed, transparent and deeply technical breakdown of what happened. For anyone who works with digital infrastructure, the level of clarity was impressive. Let’s walk through what happened, why it unfolded the way it did and what we should learn from it.

What actually happened inside Cloudflare

A few hours after containing the incident, Cloudflare’s CEO, Matthew Prince, published a full report explaining exactly what brought half the internet to its knees. The root cause came down to the propagation of a configuration file used by Cloudflare’s Bot Management module. That file broke the module, and the module broke something even more critical: the proxy layer. And when the proxy goes down, the castle goes down with it.

Before diving into the details, it’s worth remembering what this proxy actually does. It shields customers’ origin servers, filters out malicious traffic, blocks bot activity, reduces load and accelerates content delivery. It’s the frontline of Cloudflare’s defence.

And that frontline is precisely what failed.
The entire chain reaction began with something deceptively small: a database permission change in ClickHouse.

Here’s how it unfolded:
➥ The query responsible for retrieving feature data started returning far more entries than it should
➥ This inflated configuration file was passed to the Bot Management module
➥ The module has a hard cap of 200 features for performance reasons
➥ Exceeding this limit caused the system to panic
➥ That panic crashed edge nodes across Cloudflare’s global network

In short: a minor, well-intentioned database change triggered a silent avalanche, cascading into one of the most significant outages of the year.

A closer look at the domino effect

Everything began with something that looked deceptively small: a permission change in a ClickHouse database.

Before the change:

a metadata query returned around sixty features

those sixty features fed the Bot Management module

everything worked as expected

After the change:

the same query began accessing an additional database

the number of returned features doubled

the Bot Management module could not handle more than two hundred

when the system saw more than it could safely process

it simply panicked

And this is not a figure of speech. Cloudflare shared the exact code.
The issue occurred because a section of the code used .unwrap() expecting nothing to go wrong.
But something did go wrong.
And the system froze.

The silent chaos at the edge

Cloudflare’s edge nodes began to fail gradually.
One batch of machines received the good configuration, another received the bad one.
The healthy ones came back to life, the unhealthy ones died.
It looked random, but it wasn’t. It was the worst possible combination: intermittent failures spreading slowly across the network.

And why did the investigation take so long?
Because, to make matters worse, Cloudflare’s status page also went down at the start of the incident.
The result: engineers assumed they were under attack.
They weren’t.
But the coincidence pulled their attention in the wrong direction.
The full recovery took:

2.5 hours to identify the root cause

2.5 hours of clean-up

total: six hours of global chaos

Why the postmortem was so fast

This part is genuinely unusual for a company of Cloudflare’s size.

Matthew Prince:

was on the call during the incident

went back home

wrote the initial draft of the postmortem himself that same evening

circulated it for technical review

collected clarifications

and published the refined version before the next day ended

This level of transparency is rare. Most companies publish vague, sanitised, weeks-late summaries. Cloudflare did the opposite: fast, honest and technically rich. Whether you admire Cloudflare or not, this is a masterclass in accountability.

What this incident teaches us

There are several key lessons worth highlighting.

1. Errors must be logged, not buried

The offending function returned an error that wasn’t logged anywhere. Had it been, the investigation would have been significantly faster.

Logging feels like overhead, but it is the difference between clarity and guesswork during an incident.

2. Global database changes are inherently risky

The initial change was minor, routine and well-intentioned.

Yet it triggered system-wide effects that were impossible to predict fully. This is the nature of distributed systems: every dependency is a potential domino.

3. Two simultaneous failures can mislead even the best engineers

The status page outage was unrelated, but it created the illusion of a coordinated attack. When teams are under pressure, they connect dots that shouldn’t be connected.

4. The internet rests on very few pillars

Cloudflare, AWS, Google Cloud, Fastly.
When one fails, we all feel it.

Redundancy is possible in theory, but in practice:
➥ running a backup CDN is expensive
➥ switching traffic to origin servers creates unpredictable load
➥ warming caches for alternative providers is slow and costly

Even major companies like Downdetector went down during the Cloudflare outage. True independence is rare, and realistically unattainable for most organisations.

5. Transparency is still the most powerful tool for trust

Did Cloudflare make a mistake? Yes.
But the way it owned that mistake was exemplary.

A fast, thorough and direct postmortem, free from defensive language.
The company lost points for the failure and gained points for its maturity.

At the end of the day, the internet wants reliability.
But when something breaks, what it really wants is honesty.

What this all says about the future of the internet

This was not an isolated incident.
In recent years we have seen:

AWS go down and take half the world with it

Fastly bring down government sites

Google Cloud trigger global failures

and now Cloudflare

The truth is simple.

The internet functions like a giant castle resting on a handful of pillars.

And each pillar is a private company.

When one pillar fails, the castle shakes.

For most organisations, true redundancy costs far more than they can reasonably afford. Which is why the world will continue to depend on these centralised infrastructures.

It is a delicate, imperfect and deeply vulnerable balance.

And, paradoxically, what keeps everything running is precisely this: strong teams, good processes and honest postmortems.

When complexity comes calling

The Cloudflare failure shows that:

even world class companies stumble

small details can create massive cascades

transparency accelerates trust

the internet’s infrastructure is fragile

and yet, astonishingly resilient

When half the internet goes down, everyone feels it.
But when someone explains everything clearly, everyone learns.
And that is how an error becomes evolution.

There is also another reading here: complexity is inevitable, but disorganisation is optional. Organisations that grow without technical discipline end up relying on luck. Those that grow with clarity, method and solid architecture dramatically reduce the risk of becoming a headline for the wrong reasons.

This is where Devovea comes in

We help companies navigate exactly this kind of ambiguity, grounded in three fundamental pillars:

Technical clarity from the very first step
A well designed architecture eliminates surprises. It grows from a deep understanding of the business, its dependencies and its risk pathways. No castles built on soft sand.

Method for structural decisions
Changes to infrastructure, platforms or integrations are not tasks. They are turning points. Devovea helps transform this complexity into decisions with positive, predictable and sustainable impact.

Rhythm to make the new actually happen
Implementation needs direction, cadence and governance. We work alongside partners, technical teams and leadership to turn theory into practice, and practice into measurable results.

Mistakes happen. What must not happen is the same mistake twice.
And when it comes to commerce architecture, platforms and digital operations, you deserve a partner who treats every decision as a piece of your company’s future.

If your goal is to avoid systemic risks, strengthen your digital foundation and grow with confidence, Devovea is your next phase.

Camilla Lichti

Categorias: Articles

Tags: Architecture, Digital Commerce, eCommerce, Electronic Commerce, Errors, Lessons Learned

Ready to take the next step?

Build a digital operation that is safer, clearer and genuinely resilient

Incidents like Cloudflare’s show how small technical decisions can generate enormous business impact. If you want to strengthen your architecture, uncover hidden risks or ensure your digital operation grows with solidity, Devovea is the strategic partner you’ve been missing.
We work side by side with you to bring clarity, reduce complexity and turn critical decisions into safe, sustainable pathways forward.

Talk to Devovea