The October the 4th service outage at Facebook, which also took Instagram and WhatsApp offline, had a major impact on many people. An outage of services that have billions of users is sure to have an outsized effect. Beyond the impact on ordinary users, there was also an impact on small businesses that rely on Facebook and Instagram as a sales pipeline and as a platform for targeted advertising. An outage that lasts a few hours can have a significant impact on many of these small businesses. Not to mention the estimated loss in advertising revenue suffered by the Facebook group of companies itself — estimated to be in the region of $100 million for an outage the length of the one on October the 4th.
You don’t have to be a service the size of Facebook for an outage lasting a few hours to impact your business. So what happened to Facebook, and can you take steps to prevent something similar from happening to your online service presence?
What Caused the Facebook Outage?
We won’t do a deep dive into the causes of the Facebook outage in this article. There are plenty of those available already, including this More details about the October the 4th outage post by Facebook’s Santosh Janardhan.
The outage resulted from a mistake during a scheduled maintenance session on the Facebook-owned and operated global network. It occurred when system admins issued a command to test the capacity of their backbone network. This command took down all the connections in the backbone network, and as a result, all the Facebook-hosted services were unreachable on the Internet. The services themselves were all running fine, but DNS servers could not resolve a path to any of them due to the border gateway protocol (BGP) service on the Facebook routers not advertising routing information for the services. This Understanding How Facebook Disappeared from the Internet blog from Cloudflare has additional technical details.
Preventing Similar Outages on Your Network
Many organizations operate networks that have multiple locations connected over the Internet. Even if these networks are not at the scale of Facebook’s or even global, there are a lot of SME businesses that have multi-location regional and country-specific networks. Losing access to these locations and online services will be just as devastating for these businesses as it is for a global entity like Facebook. Thankfully it is possible to design and implement multi-location networks to prevent errors (including human errors) from spreading and taking out all services. This is true for many types of issues, not just the DNS BGP issue that Facebook suffered. Resilient network design and implementation will guard against lots of problems.
How to Design a Network for Resilience
At a high level, the best approach to designing a resilient network is to split it into multiple Autonomous System (AS) networks. An AS is a set of Internet routable IP prefixes owned and operated by a single entity. In most cases, this will be the organization or business designing the network. Each AS network uses a common routing policy controlled by its owner.
You could think of each AS on a network as a separate data center (even if it’s not in a physical data center). With the network split into logical AS instances, any routing change in a single AS will remain within that instance and not automatically propagate out to the others. Therefore, if a system admin makes a mistake in a configuration change, it will not spread to other AS instances.
Each AS gets managed as a separate entity on the network, but they also need to communicate with each other to allow for service access from anywhere users are connecting. Kemp LoadMaster Global Server Load Balancing (GSLB) delivers this AS interconnectivity. GSLB routing uses defined policies that determine what can propagate between AS instances and what can’t. If a system admin working in one AS instance disrupts a service like BGP, GSLB rules will prevent the incorrect configuration from spreading to others, and only the single instance with the wrong configuration will be impacted. Of course, any services running in that AS will be offline, but GSLB will route requests for those services to other application instances in another AS. Without users having to do anything special or being aware of any outage.
This five-minute Kemp Light-board talk video provides a high-level overview of how to use GSLB to deliver resilience on a network that has multiple locations. For each reference to a data center in this video, you can substitute a location-based AS. Of course, if you have numerous data centers, this video and topic will also apply!
System admins will always make mistakes that will lead to outages in particular parts of the network. Eliminating them is not possible, but it is possible to design the network so that their impact is localized and easier to fix without a network-wide loss of service. I’m sure Facebook is in the process of designing some network changes to make sure they don’t suffer another outage like the last one. I expect that something very like AS segmentation of the network and GSLB will be part of their solution. This approach can also ensure that your network has the resilience modern organizations need.