What SREs Can Learn from Facebook’s Largest Outage

por | 8 octubre, 2021

sauce: https://rootly.io/blog/what-sres-can-learn-from-facebook-s-largest-outage?utm_source=reddit&utm_medium=blog

SRE (Site Reliability Engineer).- Is an engineer whose main role is maximizing the reliability of IT systems.

Facebook’s October 2021 outage was the type of event that gives SREs nightmares: A series of critical business apps crashed in minutes and remained unavailable for hours, disrupting more than 3.5 billion users around the world and costing about 60 million dollars. As incidents go, this was a pretty big one.

It’s also a pretty big learning opportunity for SREs. The outage is a lesson in how even expertly planned systems can sometimes fail, despite having multiple layers of reliability built-in.

Here’s an overview of what SREs can learn from the Facebook outage, based on the information reported so far by Facebook and the media about the event at the time of writing.

What happened: A “cascade of errors”

The outage wasn’t the result of one simple mistake or oversight. It was instead a “cascade of errors” that bred a critical disruption, as the New York Times put it.

That cascade started when an engineer ran a command that was supposed to assess capacity for Facebook’s data centers. For reasons that Facebook hasn’t fully explained (maybe it was a typo — a mistake that has caused more than one serious incident in the past — but we’re just guessing), the command disrupted the backbone network that connects Facebook’s data centers. Facebook’s DNS servers also became unreachable.

An auditing tool was supposed to have detected and blocked the errant command, but Facebook said that a “bug” prevented the tool from catching the issue.

Troubled incident response

Facebook was clearly prepared to respond to this incident quickly and efficiently. It weren’t, it would no doubt have taken days to restore service following a failure of this magnitude rather than just hours.

Nonetheless, Facebook has reported that troubleshooting and resolving the network connectivity issues between data centers proved challenging for three main reasons.

First and most obvious, engineers struggled to connect to data centers remotely without a working network. That’s not surprising; as an SRE, you’re likely to run into an issue like this sooner or later. Ideally, you’ll have some kind of secondary remote-access solution, but that’s hard to implement within the context of infrastructure like this.

The second challenge is more interesting. Because Facebook’s data centers “are designed with high levels of physical and system security in mind,” according to the company, it proved especially difficult for engineers to restore networking even after they went on-site at the data centers. Facebook hasn’t offered specific technical details on this topic, but we assume that access controls made it difficult for engineers to do their work.

Finally, even after data center networking was restored, restoring service to users took time because engineers couldn’t turn everything back on at once without risking excess electrical consumption in their data centers. For that reason, it took some time (Facebook hasn’t said exactly how long) to get apps back online even after the root cause issue had been addressed.

Takeaways for SREs

Apart from taking heart in knowing that even Web-scale companies that make huge investments in reliability engineering sometimes experience crippling disruptions, SREs can learn a few important lessons from the story of Facebook’s outage.

Reconcile reliability with security

Probably the most important takeaway is that SREs need to work closely with security teams.

As we’ve written before, reliability engineering and security engineering sometimes work at cross purposes. The most reliable system is not necessarily the most system, and vice versa.

Would Facebook’s engineers have been able to restore service a bit faster if rigid access controls hadn’t gotten in their way? It’s hard to say without more insight into exactly what happened, but we’re guessing the answer is yes. Security should be a priority, but so should reliability, and the two need to be reconciled with each other to the fullest extent possible.

Redundant infrastructure can have single points of failure

Like any good Web-scale company, Facebook spreads its workloads across multiple data centers. This tactic is part of SRE 101: Distributed infrastructure provides reliability advantages because if one data center fails, others should remain available.

Of course, that wasn’t the case in this outage. The backbone network that connected data centers proved to be a single point of failure that brought down all data centers at once, rendering the infrastructure redundancy meaningless.

It’s hard to fault Facebook here. You can’t exactly have multiple backbone networks operating at once. Still, the takeaway here for SREs is that it’s important not to place too much blind faith in redundant data centers (or availability zones, or clusters, or whatever). Distributed infrastructures help to maximize reliability, but they don’t fully guarantee it. It is also a great lesson in the cost of using your own internal incident management tools and what to do when those go down as well.

Transparency is key

One thing that Facebook deserves a lot of credit for is how transparent it was about this incident. Within days of the outage, it released information that explained, in a reasonable amount of technical detail, what went wrong, how Facebook fixed it and how it will work to prevent similar problems from recurring.

Facebook was under no obligation to do any of this. It could have just said “we had a data center problem and we fixed it; please move onto the next news cycle.” But by explaining in some detail what happened, the company instilled a greater sense of confidence among users in its ability to prevent similar issues.

It also got ahead of the speculation that the outage resulted from a massive cyber attack or ransomware incident — which inevitably would have been rumored, had Facebook not acted so quickly and transparently to explain the issue.