On Monday, June 24th a major BGP leak occurred. A big part of the internet was sent down a black hole after a major ISP accepted a network misconfiguration from a downstream ASN of a Noction user. This caused major outages at Cloudflare, Amazon, Facebook, and a number of other companies. The incident riveted a lot of network professionals, drawing them into a heated discussion about BGP vulnerabilities, proper BGP filtering and the use of BGP optimization tools.
As they say, there are three sides to every story: your side, my side, and the truth. We stepped back and tried to observe all facets of the situation, instead of just accusing and pointing fingers at a particular party or fiercely negating us being linked to the leak.
As is always the case in such instances, there are multiple things combined that amplified the impact:
- More specific prefixes were generated in our client’s network.
- These more specific prefixes leaked to a particular downstream ASN and subsequently got announced to a major ISP, which accepted these more specific prefixes.
- Filtering at all 3 ASNs was inadequate at the time in question.
The first one is easily explained. Using more specifics is a common practice for a lot of networks out there on the Internet doing traffic shaping. This practice is not limited to only those using BGP optimizers. In fact, the use of more specific prefixes is only going to increase no matter if a network uses any BGP tools or not. In this specific case, the more specific prefixes were generated by Noction IRP. The platform optimizes network traffic by adjusting BGP routing in real-time. IRP uses the optional more specific prefixes feature in order to influence traffic flow without modifying the original BGP information. From a technical point of view, breaking down the prefixes into more specific ones is only one of the methods used to make IRP’s improvements, whilst maintaining the original table. Higher Local Preference without splitting the improvements can be used in IRP. The improvements can be marked with a specific community agreed with the customer(s) and the customer decides which way to make the improvements operate best in each individual network. Besides this, during the deployment phase when switching IRP into the intrusive mode, we always check that prefixes announced by IRP are kept in the customer(s) network by doing test announcements towards our own ASN and confirming the improvement is propagated within a client’s network, whilst also checking it is not leaked directly upstream, or via a downstream and onto the Internet at large.
Noction provides support 24/7 and we are always ready to assist clients with network changes. People can make mistakes when either creating or maintaining peers and miss some filtering.
Normally ISPs have filters to make sure they only accept the correct prefixes from their customers. This was not the case this time. The multiple concurrent failures to filter routes or a complete lack of BGP filtering was the major cause of the incident and its prolonged duration. We are surprised that this could happen at a large US-based ISP.
Unfortunately, BGP is not perfect. Almost 2300 leaks or hijacks happened over the past 7 months. Poor use of filters at Tier 1, Tier 2 and Tier 3 levels linked to all of them.
Although IRP is a link in a rather long chain in 1 of them, we are taking the situation seriously. Nobody found this leak pleasant.
We routinely discuss filters with a client during deployment. In many cases, where appropriate, we will also recommend the use of NO_EXPORT within the IRP configuration. Much has been made of this, but it is largely misleading commentary. There are scenarios where its use works well and there are scenarios where it doesn’t.
NO_EXPORT is not a good option for companies operating multiple ASNs, be it multiple public or a combination of private and public.
NO_EXPORT can be and routinely is lost between routers connected via iBGP and eBGP. There aren’t many single router networks in operation.
NO_EXPORT is also not suitable when a client decides to not use more specific prefixes, as NO_EXPORT will then cause the original routes to not propagate downstream. The end result being a downstream expecting a full routing table has tens of thousands of destinations unavailable to it.
Even with the minority scenarios where NO_EXPORT is suitable and realistically able to be maintained, the use of filters is still compulsory and they are something every network administrator is familiar with.
While we disagree with certain accusations and the way the incident has been presented to the general public, the scale of the incident is not lost on us. We will investigate what more we can do with current deployments and future setups to reduce the chance of such incidents.