This is according to a Facebook post by Santosh Janardhan, the company’s VP of infrastructure. According to the post, “During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally.”
Janardhan notes that Facebook actually has a system in place meant to audit these types of commands to specifically avoid issues like this, but it turns out a bug in the audit tool caused them to miss it. “Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.”
As a result, “This change caused a complete disconnection of our server connections between our data centers and the internet. And that total loss of connection caused a second issue that made things worse.” Things seem to be up and running again, but for several hours, it left many users unable to access their messages and social networks.