Rate this content:
5 of 5 - 1 votes
Thank you for rating this article.

It is the morning after.  Approximately 24 hours ago, Facebook, WhatsApp, Instagram, and more, went down.  Why?  How?  From the early indications of what is being said, I don't think the general public will ever be told the whole truth.  This is especially going to be the case if there is a design problem with Facebook's network.  Would they admit it?  Would they openly tell the world they don't know how to design networks?  Doubt it.

Let's start with this blog post from Facebook this morning (https://engineering.fb.com/2021/10/04/networking-traffic/outage/):

2021 10 05 9 03 01

Configuration changes.  Ok - hold that thought.  (BTW - you have to giggle at the fact Santosh has a Twitter account)

We have also learned that whatever the real cause, the result of the problem is being well documented: https://blog.cloudflare.com/october-2021-facebook-outage/  But keep in mind - the DNS issues are not the cause of the problem, they are the result.

Right, so the result of the actual problem caused removal of routes that included Facebooks DNS systems from the global routing tables.  Like having your name removed from the phonebook, no systems could find the numbers for these Facebook owned systems...all of them!

Further, we have seen a number of pundits talking about BGP being the culprit.  BGP is a routing protocol that does what you tell it, and if you tell it the wrong thing this is not a problem with BGP, but rather the result of a different problem.

How does something like this happen?  To answer this question we have to take an educated guess.  At the risk of being wrong, I am going to take a shot at what really happened.

One possibility is that Facebook was hacked. Frankly, at first, I suspected this as well.  As soon as the DNS reports started coming in, though, I doubted this was the case.  While the DNS has been successfully hacked many times in history, that system has become more resilient and more distributed than ever.  So to take down the entire world would be a monumental hack.  Not saying it couldn't be done, just saying I doubted this possibility.

The next possibility is that someone made a "typo".  What I mean by this is all network professionals end up having to type into the command line interface of routers and switches - much like you would type commands in Windows CMD, and it is extremely easy to make a typo, an error that tells the device to do something other than it is supposed to do and chaos ensues. This possibility seems reasonable, especially given the Facebook blog post above.  One of the places such an error could be made is in the "network command" under BGP or OSPF (see my article here).  A typo would explain a single command problem, however there is a problem with my hypothesis.  To affect all of Facebook's systems with one command line entry, I would hope, is nearly impossible unless their design is actually so bad that Facebook had no redundancy in their addressing design for the DNS systems. If the design is that bad, they will never admit it.

In the past 5 to 7 years this command line interface skill set has been dwindling.  More and more, we have had a movement to automation in the networking industry.  Scripts and Python applets have been dominating the provisioning and maintenance processes: Software Defined Networking.  This automation could be the culprit.  That said, someone has to write the code, or write the script, and so the "typo" cause can raise its ugly head here too.  If the typo was cut and paste into multiple areas in the script or code: boom, you could affect multiple systems with a simple click, or push of the enter key.  I can tell you that no one wants to blame SDN or automation or any of this.  Companies like Cisco and Juniper and even Facebook itself are too invested in these automation and software driven developments in the industry. If my hypothesis on automation here is accurate, again I think we will never know the truth.

Whether is was a single command line, or an automation script/code process, I am going to put my money on this "typo" issue as being the root cause of Facebook's issue this October 4th. I think Santosh from Facebook is almost admitting it while being as vague as possible.

Like my readers, I am staying tuned to see what, if anything, comes out over the coming days and weeks.

Add comment

Submit

Did you learn something?
Did I save you time? 

Buy me a coffeeBuy me a coffee!

Find by Tag

5G Networks 6LoWLAN 6LoWPAN 802.11 802.11ah 802.11ax 802.11ay 802.11az ACL Addressing Analysis Ansible Architecture ARP Assessment AToM Backup Bandwidth BGP Biography Bloom's Taxonomy Briefings CBRS CellStream Cellular Central Office Cheat Sheet Chrome Cisco Clock Cloud Computer Consulting CPI Data Center Data Networking Decryption DHCPv4 DHCPv6 Display Filter DNS Documentation ECMP EIGRP Ethernet Ethics Flipping the Certification Model Follow Me Fragmentation Git GNS3 Google GQUIC Hands-On History Home Network HTTPS ICMP ICMPv6 IEEE 802.11p IEEE 802.15.4 In A Day Internet IOS Classic IoT IPv4 IPv6 L2 Switch L2VPN L3VPN LDP Learning Services Linux LLN Logging LoL M-BGP MAC MAC OSx Macro Microsoft mininet Monitoring Monitor Mode MPLS Multicast Name Resolution Netflow NetMon netsh Networking Network Science nmap Npcap nslookup Online Learning Online School OpenFlow OSPF OSPFv2 OSPFv3 OSX Parrot PIM Ping Policy POTS POTS to Pipes PPP Profile Profiles Programming Project Management Python QoS QUIC Requirements RIP Routing RPL RSVP Rural SAS SDN Security Self Certification Service Provider Services Sharepoint Small Business Smartport SONET Speed SSH SSL Subnetting T-Shark TCP TCP/IP Telco Telecom 101 Telecommunications Telephone Telnet Terminal TLS Tools Traceroute Traffic Analysis Traffic Engineering Training Travel Tunnel Utility Video Virtualbox Virtualization Voice VoIP VRF VXLAN Webex Wi-Fi Wi-Fi 4 Wi-Fi 5 Wi-Fi 6 Wi-Fi 6/6E Windows Wireless Wireless 5G Wireshark Wireshark Tip WLAN ZigBee Zoom

Twitter Feed