Cloudflare goes down due to database issues related to machine learning features

At a time when hyperscalers are facing several outages, cloud solutions company Cloudflare reported a new outage last Tuesday that took down several apps and websites and caused some disruption. Today’s CXO. The company now claims that a database issue was behind the downtime.

In a detailed postmortem on the instance that crippled major services like OpenAI, Perplexity, Spotify, and Canva, the company attempted to explain that the failover was the result of a permission change in ClickHouse, one of the databases used by Cloudflare.

This caused the database to throw multiple entries into a “feature file” used by the bot management system, which doubled in size and caused a panic on Cloudflare’s servers, the company’s co-founder and CEO Matthew Prince explained in a detailed blog post published on the website.

Denies any suggestion of a DDoS attack

Prince began by reiterating that “this issue was not caused, directly or indirectly, by any type of cyber attack or malicious activity,” noting that the software running on Cloudflare systems to route traffic on the network requires capability files to keep the bot management system up to date with ever-changing threats.

“The software had a feature file size limit that was less than twice that size, and that was the cause of the software failure,” Prince said, revealing that security analysts initially suspected the symptoms were the result of a super-large DDoS attack. However, the team has since fixed the issue by replacing the feature file with an earlier version.

What exactly happened? How is Cloudflare addressing it?

The Cloudflare Bot system is a machine learning model that generates a bot score for every request it receives over your network and identifies customers that you use to control permission settings. The feature configuration file forms the input for this ML model. In other words, it is a base file with a model for predicting the nature of requests.

This file is updated every few minutes and helps Cloudflare keep pace with traffic flow patterns across your network. “This allows us to respond to new types of bots and new bot attacks. Therefore, it is important to deploy this capability frequently and quickly, as malicious actors can quickly change their tactics,” Prince said.

“The underlying ClickHouse query behavior changes that generate this file now contain a large number of duplicate ‘features’ rows. This caused the previously fixed size feature configuration file to change size, causing the bot module to trigger an error. As a result, the core proxy system that handles customer traffic processing returned HTTP 5xx error codes for traffic that relied on the bot module. ”

In the post, Prince further explains that database clusters capture data into distributed tables using queries run through a shared system account. On that fateful day, Cloudflare granted explicit rather than implicit access for users to see the metadata of these tables.

“With this change, all users now have access to accurate metadata about the tables they have access to. [the query then] Column “duplicates” are now returned. Because they were from the underlying table. ”

If the bot management function file executed this query and its generation logic allowed each input function in the file, the duplication caused the function to run out of control. “The bot management system has a limit on the number of machine learning features that can be used at runtime. Currently, that limit is set to 200…When a bad file containing more than 200 features was propagated to the server, this limit was reached, resulting in the system panicking,” Prince noted.

To prevent similar errors, Cloudflare had tightened up the inclusion of Cloudflare-generated configuration files, just as it does for user-generated input, while enabling additional global kill switches for the feature. It also eliminates the possibility of core dumps and other error reports overwhelming system resources when determining failure modes for error conditions across all core proxy modules.

Source link