Cloudflare Gets Transparent on DNS Resolver Outage

It’s a cautionary tale for those coding the complex algorithms that go into automated mitigation.

In a testament to transparency, Cloudflare has explained a 17-minute outage on its 1.1.1.1 resolver service last week: It was a glitch in its own systems, not a cyber-incident.

The 1.1.1.1 service is a Domain Name System (DNS) resolver that matches up URLs (say, “cloudflare.com”) with their corresponding numerical IP addresses. This allows users to look up where a certain domain is hosted geographically, and match it up with its associated hardware devices and servers. In other words, it takes the name of the domain and gives people more information about who has ownership or control of that particular website. It’s a boon for cyberthreat researchers in particular, but also gives consumers and IT administrators better transparency tools to understand the websites they interact with.

Cloudflare saw a global outage last Thursday, May 31, thanks to a coding oversight in its Gatebot DDoS mitigation pipeline. Gatebot protects against different types of DDoS attacks on Layer 3, Layer 4 and Layer 7, by collecting live traffic measurements to detect malicious traffic, automatically choosing appropriate mitigations and then executing mitigation logic on the edge.

However, after gaining a new code release, Gatebot began interpreting resolver traffic on the Cloudflare network as an attack, and locked it down. That’s because one of the changes in the release was around automating the process by which Gatebot determines whether an IP address is a Cloudflare IP address or not – and the developers failed to account for an IP range exception in the integration.

“Provision API is a simple RESTful API used to provide this kind of information,” explained the company, in a posting on Friday. “Prior to its existence, Gatebot had to figure out which IP addresses were Cloudflare addresses by reading a list of networks from a hard-coded file…[So] we deployed new code that introduced Gatebot to Provision API.”

What the coders failed to account for was that Gatebot’s hardcoded list of Cloudflare addresses contained a manual exception for the 1.1.1.0/24 and 1.0.0.0/24 recursive DNS resolver IP ranges.

It’s a cautionary tale for those coding the complex algorithms that go into automated mitigation.

“The whole idea of the fix was to remove the hardcoded gotchas!” the company said, adding, “while Gatebot, the DDoS mitigation system, has great power, we failed to test the changes thoroughly. We are using today’s incident to improve our internal systems…The next time we mitigate 1.1.1.1 traffic, we will make sure there is a legitimate attack hitting us.”

Suggested articles