Categories: Blog

Is BGP multi-homing enough for WAN network performance?

BGP multihoming has become as necessary to the networks connected to Internet, as the use of redundant power sources or multiple data centers. No business can afford prolonged outages, and the first and surprisingly most effective way to maximize uptime is through a robust BGP implementation to multiple transit providers. BGP utilizes a set of Autonomous System numbers (ASN) that are assigned to individual networks or relatively large network segments. An “AS Hop” is defined as a transition from one AS to another. The assumption made by BGP is that for any given path, the route with the least number of AS hops is preferable. In addition to this dynamic information, BGP allows administrators to define static path preferences using weights, local preferences, MEDs, etc.

BGP routing information, is largely based on AS hops, and manually configured static preferences. BGP has no capability to discover any other performance characteristics . As a result, metrics such as packet loss, latency, throughput, link capacity and congestion, historical reliability, and other business characteristics are not addressed by this protocol. BGP has no ability to actively discover any of these characteristics, and thus it has no ability to make routing decisions based on them. The routers relying on BGP cannot make dynamic performance-optimized decisions.

Settlement-free peering and best-effort traffic delivery are vital for the efficiency and relatively low cost of operating and connecting to the Internet. The best-effort hovewer has its flaws – congestion. Congestion occurs because of some transit providers port oversubscription, ddos attacks, daily peaks and even congested public traffic exchanges. Other problems can be caused by BGP’s inherent sense of trust between peering partners. This implied trust means that all route updates are considered valid and are treated as such. Hovewer, due to convergence delay, misconfiguration, external protocol interaction and lots of other reasons, not all updates are valid. Invalid updates in the worst cases can lead to routing loops or blackouts. Blackouts happen during an outage in a transit provider network, while the upstream provider still announces the routes to their customers, making them send the traffic in a blackhole. If the blackout is total, the network engineers will notice this and shutdown the BGP session. A partial routing blackout is hard to diagnose and troubleshoot, because of the routing asymmetry in Internet.

Since BGP is focused on reachability and its own stability, in case some problems occur the traffic may only be rerouted due to hard failures. Hard failures are total losses of reachability as opposed to degradation. This means that even though service may be so degraded that it is unusable for an end user, BGP will continue to assume that a degraded route is valid until and unless the route is invalidated by a total lack of reachability. BGP as a dynamic routing protocol is, unfortunately, reactive, only in cases of total failure.

Multi-homing avoids downtime by providing redundancy, however it does not address performance and congestion-related problems that occur in the “middle-mile”, linking backbone networks. Therefore, simple BGP multihoming is not enough.