MADRID—John Graham-Cumming presided over a confessional Wednesday at Virus Bulletin 2017.
Cloudflare’s chief technology officer was frank and apologetic about February’s Cloudbleed bug, which leaked memory from the content delivery network that included internal private keys and authentication secrets, in addition to customers’ HTTP headers and POST data, URI parameters, API authentication secrets and OAUTH keys.
Graham-Cumming provided a blow-by-blow account of Cloudflare’s shortcomings and accountability before and during the incident during today’s keynote, as well as its response and relatively quick remediation in under an hour.
“We had it fixed in 47 minutes, but yeah, it was pretty bad,” Graham-Cumming said. “We had been leaking a lot of private stuff.”
Cloudflare not only faced internal remediation, but also had to work with major search engines that had been caching the leaks, in order to have the private and sensitive data removed.
“The ugly part of this was that it was cached all over the place,” Graham-Cumming said. “[The search engines] are not geared up for you calling them and saying can you purge this. We had to work closely with Google, Bing, Yahoo in particular to get these caches removed.”
The saga began Feb. 23 with a tweet from Google Project Zero’s Tavis Ormandy asking for someone from Cloudflare incident response to contact him. Ormandy had discovered uninitialized memory among valid data that he determined was coming from a Cloudflare reverse proxy.
“It looked like that if an html page hosted behind Cloudflare had a specific combination of unbalanced tags, the proxy would intersperse pages of uninitialized memory into the output (kinda like Heartbleed, but Cloudflare-specific and worse for reasons I’ll explain later),” Ormandy said at the time in a Project Zero bug report. “My working theory was that this was related to their ‘ScrapeShield’ feature which parses and obfuscates html – but because reverse proxies are shared between customers, it would affect *all* Cloudflare customers.”
As it turned out, there was a number of conditions that had to be in place to trigger Cloudbleed to begin leaking data. Within 21 minutes, Graham-Cumming said, Ormandy had shared what he knew and that triggered within 10 minutes a closed-door meeting at Cloudflare’s San Francisco office.
Based on Ormandy’s disclosure, Graham-Cumming said that Cloudflare had traced the bug to an HTML parser present in three of its features: Automatic Rewrites, Server-Side Excludes and Email Obfuscation. The leaks escalated upon the implementation of Email Obfuscation on Feb. 13, with more than 637,000 recorded in the five days between implementation and Ormandy’s disclosure. By comparison, there were 605,000 leaks between Sept. 22 and Feb. 13, the dates when the two other features were put into production.
Cloudflare said in February that it replaced its Ragel HTML parser a year earlier with a homemade parser called cf-html. The underlying bug, it said, was in the Ragel parser as well but was never triggered because of the way its NGINX buffers were used. The new parser, however, changed the buffering and caused the leakage. The three features using the parser: Automatic HTTP Rewrites (enabled Sept. 22), Server-Side Excludes (enabled Jan. 30), and Email Obfuscation (enabled Feb. 13) were globally disabled or patched upon learning of the bug, Graham-Cumming said today.
Customers, the CTO said, were most impacted according to the number of requests they were handing Cloudflare’s infrastructure per month. Some, for example, doing 200 billion per month—a relatively small number of customers, Graham-Cumming said—suffered between 22,000 and 33,000 leaks. Most of its customers make fewer than 10 million requests per month, and statistically suffered zero leaks.
The most gutting thing to Graham-Cumming once post-mortems were complete was the realization that Cloudflare had the intel he needed all along to know its infrastructure had sprung a leak. NGINX core dumps were logged indicating large spikes that correlated to the leaks, which led to a zero-tolerance policy internally to these dumps.
Those crashes were traced to a hardware issued related to Intel processors running on Broadwell hardware that had not been patched. Once the update was applied, the core dumps too disappeared.
“We should have been able to catch this. We should have known this was going on,” Graham-Cumming said. “We were getting core dumps, and ignoring them in production.”