Cloudflare has published a detailed and refreshingly honest report into precisely what went wrong earlier this month when its systems fell over and took a big chunk of the internet with it. The Register reports: We already knew from a quick summary published the next day, and our interview with its CTO John Graham-Cumming, that the 30-minute global outage had been caused by an error in a single line of code in a system the company uses to push rapid software changes. […] First up the error itself — it was in this bit of code: .*(?:.*=.*). We won’t go into the full workings as to why because the post does so extensively (a Friday treat for coding nerds) but very broadly the code caused a lot of what’s called “backtracking,” basically repetitive looping. This backtracking got worse — exponentially worse — the more complex the request and very, very quickly maxed out the company’s CPUs.
The impact wasn’t noticed for the simple reason that the test suite didn’t measure CPU usage. It soon will — Cloudflare has an internal deadline of a week from now. The second problem was that a software protection system that would have prevented excessive CPU consumption had been removed “by mistake” just a weeks earlier. That protection is now back in although it clearly needs to be locked down. The software used to run the code — the expression engine — also doesn’t have the ability to check for the sort of backtracking that occurred. Cloudflare says it will shift to one that does. The post goes on to talk about the speed with which it impacted everyone, why it took them so long to fix it, and why it didn’t just do a rollback within minutes and solve the issue while it figured out what was going on. You can read the full postmortem here.