What is the BGP protocol and why did it cause WhatsApp, Facebook and Instagram to disappear from the internet for hours
Yesterday WhatsApp, Facebook and Instagram disappeared from the internet. These services were down for more than six hours, but finally everything returned to normal.
That huge drop It was due to the so-called BGP or Border Gateway Protocol, one of the systems used on the internet to get traffic to where it is needed as quickly as possible. How does BGP work and how could the drop be so bad? This is what we explain below.
What is BGP and how does it work
As indicated in CloudFlare, this protocol is a mechanism to exchange routing information among the so-called autonomous systems (AS) on the internet.
The Internet is a network of networks, and makes use of large routers that have in turn huge updated lists of possible routes that can be used to carry a data packet from source to destination.
With BGP it is possible for a network (such as Facebook) to notify other networks that it is there, accessible, on the internet. The problem is that Facebook was no longer notifying the other networks and to the internet operators: it is as if it disappeared from those lists and from that “map”.
Each of those individual networks (like Facebook’s or Cloudflare’s) has the so-called ASN (Autonomous System Number), a single network with a set of internal and unified packet routing rules.
Each autonomous system (AS) can originate so-called prefixes – which control a group of IP addresses – and transit prefixes – which indicate how to reach certain specific groups of IPs. ASNs are “announcing” their predetermined routes through BGP, and that allows other networks to know how to comicate with that one.
BGPlay vizualization of AS32934 route withdraw between 15:43UTC and 15:54UTC. Not real time ~10x speed. pic.twitter.com/yTqyhks7FD
— GGreg (@GGreg) October 4, 2021
Facebook stopped advertising the routes to the prefixes of your domain name servers (DNS) at 16:58 UTC. This meant that although other Facebook IP addresses were still routed, they could not be accessed: it did not matter if that part was active, because the fall of the DNS made them inaccessible.
In Cloudflare they monitor the updates that are made to the BGP to be able to act accordingly with their services, and normally Facebook hardly makes any changes. Nevertheless at 15:40 UTC they noticed a spike of routing changes which were the ones that made the real problem show up on our computers and on our mobiles.
That failure caused services that resolve DNS to fail. These services, as we have already explained sometime, allow that when we write for example “www.xataka.com” in the browser it knows that the requests have to go to the machine with the IP address 184.108.40.206.
When Facebook stopped advertising its DNS routing prefix over BGP, DNS resolver services had no way to connect to its name servers: all ended up giving errors, and that caused more and more side effects.
Among other things, requests to sites such as Twitter, Signal and other messaging platforms such as Telegram increased, something that they also noticed in Cloudflare and with which in fact they made a little joke on Twitter saying “hello literally everyone” because indeed many users went to Twitter in search of answers. Even Facebook used this network to confirm they had a technical problem and they were trying to figure it out.
hello literally everyone
— Twitter (@Twitter) October 4, 2021
Fortunately on Facebook they managed to reestablish the situation at 21:20 UTC– Your BGP activity became significant again around 21:00 UTC according to CloudFlare, peaking at 21:17 UTC.
That made it clear that Facebook was re-advertising all of their routing prefixes, allowing at approximately 21:28 UTC normal access to Facebook, WhatsApp and Instagram will be restored.
What does Facebook say about the problem?
Facebook engineers too they briefly explained the causes of the problem that affected them. They did so on their Facebook Engineering blog.
There they apologized in the first place for the inconvenience that this problem could have caused to the users. According to that article, the problem was caused by the following:
“Configuration changes to the backbone routers that coordinate network traffic between our data centers. That disruption to network traffic had a cascading effect on the way our data centers communicate, causing the shutdown of our services”.
There were no more details about it and Facebook wanted to clarify that at no time was the fall due to a cyberattack: “at this time we believe that the root cause of the crash was a wrong configuration change“.
Also on Facebook they wanted to clarify that “we have no evidence that user data has been compromised as a result of this service outage. “