Cloudflare paralyzes half the world (including my site) and, as usual, nobody says what really happened

Today, the brave new, globally networked world has once again shown us how fragile it really is. At around 12.30 p.m. my site first started to slow down, then at some point nothing was accessible at all. HTTP errors, timeouts, everything you love. Cause according to Cloudflare: they don’t know the exact cause, as always. On the status page, there was initially vague talk of an “internal service degradation” that affected “some customers” while half the Internet was down.

In fact, it was not a small hiccup at the edge, but a global outage that sabered right through the services. Media houses and news portals, social media platforms like X, streaming services like Spotify, major retail brands including Ikea, financial services providers like Visa, telecoms providers like Vodafone, various AI platforms like ChatGPT and other LLM services, gaming services like League of Legends and a whole host of other websites were either completely down or reliably producing 500 errors. The fact that even major news providers and agencies had to publicly admit that their own sites were unavailable due to the Cloudflare outage shows just how widespread the impact was today.

Officially, Cloudflare speaks of a “spike in unusual traffic”, i.e. an unusual traffic peak that led to an error storm and the well-known 500 responses. As expected, there is no mention of whether this is a coded paraphrase for a technical misconfiguration, an error in the company’s own infrastructure or a failed protection mechanism against DDoS traffic. On the status page, communication ranges from “we are investigating the problem” to “we are seeing initial recovery” to “Incident now resolved”, while at the same time it continues to point out that some customers are still unable to log into the dashboard and some services remain disrupted. That’s exactly where I am at the moment, my site is up and running again, but my own access to Cloudflare remains reliably blocked.

The pattern is familiar. First the hut is on fire, then the consequences are downplayed, people talk about “some customers”, “increased error rates” or “degraded performance”, although downdetectors, media reports and our own perception make it quite clear that this is a systemic outage affecting a significant part of the network. Thousands of fault reports, countless services affected at the same time, and yet in terms of communication it is only enough for an “internal service degradation”. This is about as honest as if an electricity supplier were to speak of “slightly limited availability” in the event of a nationwide blackout.

This all seems particularly ironic against the backdrop of the much-vaunted, borderless digital world in which everything is getting faster, bigger and more distributed. In practice, however, a disproportionately large part of the infrastructure is once again dependent on a single service provider. According to various estimates, Cloudflare protects and accelerates roughly a fifth of the world’s websites, and significantly more in some segments. If this central node stumbles, it is not just “a little bit of something” that fails, but a relevant part of the global public, including the supply of news, payment processing, traffic information and communication platforms.

This is not an outlier event either, but fits neatly into the list of major infrastructure failures in recent years. We had the Dyn DNS attack, in which services from Amazon and Netflix to the BBC and New York Times went down at the same time because a central DNS backend was hit. We’ve had global outages at major cloud providers such as AWS and Azure, and most recently the CrowdStrike outage that paralyzed Windows systems worldwide. Now it’s Cloudflare that causes a kind of digital pipe burst on a Tuesday afternoon. Each time it is reassured afterwards that they have “learned from the incident”, but structurally little changes to the massive centralization and single point of failure design that we have become comfortably accustomed to.

For operators like me, the whole thing is doubly annoying. You put a lot of effort into servers, content, security and monitoring, only to see an external service provider turn out the lights with a much-cited “unusual traffic spike”. Your own error log is full of generic 500 messages from a third party, the Cloudflare log remains meaningless and the status text sounds as if everything is half as bad. At the same time, however, you receive emails, comments and feedback from readers explaining that “the site is down again”. If you then explain that this time it’s not due to your own server, but to the beautiful global infrastructure, it quickly seems like an excuse, even though it’s exactly the reality.

To top it all off, even hours after the initial “all-clear”, some accounts are still unable to reliably log into their own Cloudflare dashboard, making any meaningful follow-up, log analysis or adjustment difficult. Ideally, after such an incident, you would want to understand exactly how badly your own traffic was affected, which regions suffered in particular and whether certain rules or configurations mitigated or amplified the effects. In practice, however, you are faced with a login screen that either does not load at all or is acknowledged with an error page, while the status page reassuringly reports that “some customers may still be experiencing problems”.

The much-vaunted “global, beautiful, networked world” shows its dark side at times like this. For reasons of convenience, security and performance, we have become dependent on a situation in which a single company can cause large parts of the public discourse, the economy and the infrastructure to falter with a single technical error or failed traffic handling. And when things go wrong, the explanation is a handful of PR formulations, a vague reference to “unusual traffic” and the silent hope that everyone will have forgotten how thin the foundations of this brave new world really are by the next time.