It is the morning after. Approximately 24 hours ago, Facebook, WhatsApp, Instagram, and more, went down. Why? How? From the early indications of what is being said, I don't think the general public will ever be told the whole truth. This is especially going to be the case if there is a design problem with Facebook's network. Would they admit it? Would they openly tell the world they don't know how to design networks? Doubt it.
Let's start with this blog post from Facebook this morning (
Configuration changes. Ok - hold that thought. (BTW - you have to giggle at the fact Santosh has a Twitter account)
We have also learned that whatever the real cause, the result of the problem is being well documented:
Right, so the result of the actual problem caused removal of routes that included Facebooks DNS systems from the global routing tables. Like having your name removed from the phonebook, no systems could find the numbers for these Facebook owned systems...all of them!
Further, we have seen a number of pundits talking about BGP being the culprit. BGP is a routing protocol that does what you tell it, and if you tell it the wrong thing this is not a problem with BGP, but rather the result of a different problem.
How does something like this happen? To answer this question we have to take an educated guess. At the risk of being wrong, I am going to take a shot at what really happened.
One possibility is that Facebook was hacked. Frankly, at first, I suspected this as well. As soon as the DNS reports started coming in, though, I doubted this was the case. While the DNS has been successfully hacked many times in history, that system has become more resilient and more distributed than ever. So to take down the entire world would be a monumental hack. Not saying it couldn't be done, just saying I doubted this possibility.
The next possibility is that someone made a "typo". What I mean by this is all network professionals end up having to type into the command line interface of routers and switches - much like you would type commands in Windows CMD, and it is extremely easy to make a typo, an error that tells the device to do something other than it is supposed to do and chaos ensues. This possibility seems reasonable, especially given the Facebook blog post above. One of the places such an error could be made is in the "network command" under BGP or OSPF (see my article here). A typo would explain a single command problem, however there is a problem with my hypothesis. To affect all of Facebook's systems with one command line entry, I would hope, is nearly impossible unless their design is actually so bad that Facebook had no redundancy in their addressing design for the DNS systems. If the design is that bad, they will never admit it.
In the past 5 to 7 years this command line interface skill set has been dwindling. More and more, we have had a movement to automation in the networking industry. Scripts and Python applets have been dominating the provisioning and maintenance processes: Software Defined Networking. This automation could be the culprit. That said, someone has to write the code, or write the script, and so the "typo" cause can raise its ugly head here too. If the typo was cut and paste into multiple areas in the script or code: boom, you could affect multiple systems with a simple click, or push of the enter key. I can tell you that no one wants to blame SDN or automation or any of this. Companies like Cisco and Juniper and even Facebook itself are too invested in these automation and software driven developments in the industry. If my hypothesis on automation here is accurate, again I think we will never know the truth.
Whether is was a single command line, or an automation script/code process, I am going to put my money on this "typo" issue as being the root cause of Facebook's issue this October 4th. I think Santosh from Facebook is almost admitting it while being as vague as possible.
Like my readers, I am staying tuned to see what, if anything, comes out over the coming days and weeks.