Cloudflare global outage — how a small database change brought services to a halt

Ethan Cole
Ethan Cole I’m Ethan Cole, a digital journalist based in New York. I write about how technology shapes culture and everyday life — from AI and machine learning to cloud services, cybersecurity, hardware, mobile apps, software, and Web3. I’ve been working in tech media for over 7 years, covering everything from big industry news to indie app launches. I enjoy making complex topics easy to understand and showing how new tools actually matter in the real world. Outside of work, I’m a big fan of gaming, coffee, and sci-fi books. You’ll often find me testing a new mobile app, playing the latest indie game, or exploring AI tools for creativity.
4 min read 84 views
Cloudflare global outage — how a small database change brought services to a halt

The Cloudflare global outage that struck on November 18 revealed how fragile large-scale infrastructure can be when a tiny internal change spirals out of control. The disruption began around 11:20 UTC and quickly spread across Cloudflare’s CDN and security layers, producing waves of 5xx errors worldwide. For a period, even Cloudflare’s own team couldn’t sign into internal dashboards.

According to CEO Matthew Prince, the outage traced back to a regression introduced during a routine update to Cloudflare’s ClickHouse database cluster. The goal of the change was to tighten security by making table permissions fully explicit. However, the update had an unexpected consequence inside the Bot Management system.

How the Cloudflare global outage began with a subtle database regression

Engineers expected the metadata query to return a clean list of columns from the “default” database. Instead, it started pulling duplicate rows from r0 database shards, a behavior no one anticipated.

This duplication inflated the “feature file,” which holds configuration data for identifying bot activity. The file became nearly twice its normal size. Cloudflare’s core proxy pre-allocates memory for performance reasons, but it enforces a strict 200-feature safety limit. When the oversized file hit the network, the module exceeded that cap and crashed.

The issue presented itself unevenly, making debugging much harder. Since Cloudflare rolled out the database update gradually, systems would behave normally one minute and fail the next. Engineers initially suspected a massive DDoS attack rather than an internal regression. Confusion escalated when Cloudflare’s own status page went offline at the same time — an unrelated coincidence that only deepened the uncertainty.

Widespread impact shows why Cloudflare global outage felt like “the internet breaking”

The outage’s scale surprised the public. On Reddit, one user summarized the experience:

“You don’t realize how many websites use Cloudflare until Cloudflare stops working.”

Because so many popular services rely on Cloudflare’s edge network, both users and companies felt the disruption instantly. Prince later wrote that any moment when Cloudflare’s network can’t route traffic is “deeply painful” for the entire team. He also acknowledged that this was the company’s most significant outage since 2019.

Cloudflare global outage reignites debate over single-vendor dependence

While customers struggled, industry voices revived a familiar discussion: Should critical systems rely on one vendor?

Syber Couture CEO Dicky Wong described the event as a reminder that even the best-performing platforms can become single points of failure. He compared the situation to “marriage without a prenup,” stressing that multi-vendor strategies are essential for resilience.

Developers on Reddit echoed the sentiment. One user pointed out that a handful of cloud providers now carry most of the internet, meaning a single outage can destabilize huge portions of the web. Senior technology leader Jonathan B. added on LinkedIn that organizations often choose simplicity over redundancy — until that simplicity becomes the outage everyone is talking about.

Cloudflare restores service and plans new safeguards

Engineers resolved the incident by manually distributing a known-good version of the feature file. Traffic returned to normal around 14:30 UTC, and the system stabilized fully later in the afternoon.

Cloudflare says it is now reviewing failure modes across all proxy modules, especially where pre-allocated memory limits can cause cascading failures. Future updates will aim to ensure that malformed or unexpected inputs cannot trigger similar outages.

Conclusion

The Cloudflare global outage shows how a tiny internal change can ripple across one of the internet’s most critical infrastructures. It also highlights the importance of redundancy, resilience and defensive engineering — especially as dependency on large cloud vendors continues to grow.

Read also

Join the discussion in our Facebook community.

Share this article: