BGP and equal-cost multipath (ECMP)

BGP and equal-cost multipath (ECMP)

    BGP and parallel linksOne given in the network world is that bandwidth will increase. So that Gigabit Ethernet link that provided bandwidth to spare a year ago may be insufficient today. The simple solution is to upgrade to a faster link. In the Ethernet world this would mean a jump from 1 Gbps to 10 Gbps—an upgrade that can be rather costly. But wait, our existing Gigabit Ethernet router has a second GE port that isn’t being used. So if we can use two GE links in parallel, that saves us from upgrading to 10GE right away.

    It’s very common to use parallel links to increase bandwidth. This mechanism is often called equal-cost multipath (ECMP). ECMP often works well, but there are a few caveats. Before we get to the issue of running BGP over parallel links, it’s important to look at how traffic is split over multiple parallel links.

    The simplest way to do this is transmit packet 1 over link A, packet 2 over link B, packet 3 over link A again, packet 4 over link B, and so on. This is per-packet load balancing. The problem with per-packet load balancing is that when packets 1, 3, and 4 are 1500-byte data packets that belong to the same TCP session, but packet 2 is a small TCP ACK packet, packet 3 has to wait for packet 1 to be transmitted, but packet 4 doesn’t have to wait nearly as long behind the much shorter packet 2, so packet 4 ends up being transmitted before packet 3.

    So per-packet load balancing causes reordering. In theory, that’s fine, as TCP will simply buffer the reordered packets and deliver the data inside to the receiving application in the right order. However, receiving packets out of order makes TCP think there was packet loss, so it retransmits packets and slows down its transmission rate. To avoid such issues, routers and switches work hard to make sure that all packets that belong to the same TCP session (or UDP flow) are transmitted over the same link. Of course this in turn has the downside that a single TCP session can only make use of one link; thus ECMP is only useful if the traffic consists of multiple TCP sessions.

    Routers and switches usually aren’t in the position to keep track of individual TCP sessions so their packets can be transmitted over the same link. Instead, they look at a few fields in the IP/TCP/UDP/ICMP header and group packets together based on the contents of those fields. Some switches can’t even look inside the IP header, and thus perform load balancing based on the Ethernet MAC addresses. That doesn’t work so well, because this way all the traffic between two routers is transmitted over the same link.

    A more granular way to perform load balancing is based on the 3-tuple: the protocol number in the IP header (i.e., TCP, UDP, ICMP, …) and the IP source and destination addresses. This works better than just using the MAC addresses to determine which packets should be transmitted over the same link. The best way to perform ECMP is using the 5-tuple: the protocol number, the IP addresses and the TCP or UDP source and destination port numbers. Routers and switches implementing ECMP calculate a hash function over these fields and then use (part of) the resulting hash value to select the link to transmit the packet over. (See RFC 2992) As the fields in the 5-tuple are the same for all packets belonging to the same session and thus the hash is the same, all packets belonging to the same session end up using the same link. This works well, but in practice, it can still take as many as a thousand TCP sessions before all the links are utilized equally.

    However, we have been getting ahead of ourselves. Before the ECMP algorithm can distribute packets over parallel links, routing protocols such as BGP must first be convinced to use multiple links in parallel. There are three ways BGP and ECMP can work together:

    1. Bundling the links at the Ethernet level, using IEEE 802.3ad or EtherChannel
    2. With one BGP session over multiple links using loopback addresses
    3. With a separate BGP session over each of the parallel links

    EtherChannel is a proprietary Cisco mechanism to allow multiple Ethernet ports to be used as if it’s a single, higher-bandwidth port. It has been available since the 1990s. In 2000, the IEEE standardized a very similar mechanism under the catchy name 802.3ad. 802.3ad comes with the Link Aggregation Control Protocol (LACP), which negotiates the bundling of Ethernet ports. By foregoing LACP and grouping the ports statically, it’s usually possible to make different implementations like 802.3ad and EtherChannel and other link bundling mechanisms from other vendors work together.

    As far as the IP layer is concerned, the bundled ports are a single interface with a single IP address. So BGP can simply be configured as usual and the traffic is distributed over the ports using the ECMP algorithm.

    It’s not always possible to use 802.3ad or EtherChannel or a similar protocol, for instance because the ports aren’t Ethernet ports, or because of limitations on which ports can be grouped together. An alternative is to configure each port with its own IP subnet, but then still run a single BGP session over the collection of ports. That works as follows:

    !
    interface GigabitEthernet0/1
    ip address 10.0.1.1 255.255.255.252
    interface GigabitEthernet0/2
    ip address 10.0.2.1 255.255.255.252
    interface loopback0
    ip address 192.168.0.1 255.255.255.255
    !
    ip route 172.16.31.2 255.255.255.255 10.0.1.2
    ip route 172.16.31.2 255.255.255.255 10.0.2.2
    !
    router bgp 123
    neighbor 172.16.31.2 remote-as 456
    neighbor 172.16.31.2 update-source loopback0
    neighbor 172.16.31.2 ebgp-multihop 2
    !

     

    In this example, the two GE interfaces have subnets 10.0.1.0/30 and 10.0.2.0/30, respectively. The local router also has IP address 192.168.0.1/32 configured on its loopback interface. (Remember, on routers a loopback interface doesn’t have address 127.0.0.1 for local communication, but rather, a “real” address that remains reachable as interfaces may go down and come back up. Loopback addresses are often used for management and for iBGP sessions.)

    Address 172.16.31.2 is the loopback interface of the router at the other end of both Gigabit Ethernet links. Static routes route 172.16.31.2 over both GE ports. We can then set up a BGP session towards 172.16.31.2. To make sure the BGP updates use the loopback address on our end, we configure update-source loopback0. We also need ebgp-multihop 2 because the extra level of indirection may lead the router to think there’s an extra router on the path. Normally this isn’t allowed, but with ebgp-multihop 2 in effect, this won’t cause any problems.

    With these settings, the BGP session will come up and prefixes learned over the session will have as their next hop address 172.16.31.2. This address points to both of the GE ports, so packets will be load balanced over both ports using ECMP.

    The third way to run BGP over multiple links is to simply configure a BGP session over each link:

    !
    interface GigabitEthernet0/1
    ip address 10.0.1.1 255.255.255.252
    !
    interface GigabitEthernet0/2
    ip address 10.0.2.1 255.255.255.252
    !
    router bgp 123
    neighbor 10.0.1.2 remote-as 456
    neighbor 10.0.2.2 remote-as 456
    maximum-paths 2
    !

     

    Under normal circumstances, BGP would now have two copies of each prefix: one learned from neighbor 10.0.1.2 and one from neighbor 10.0.2.2 and then try to figure out which of these is best. Eventually this will come down to the tie breaker rules and one will win.

    However, with maximum-paths 2 in effect, the router will install two copies of a route in the main routing table, which will trigger ECMP between the two routes. Depending on the IOS version the number of paths that can be configured may be limited to 6. For load balancing to happen, the following path attributes need to be the same for the prefixes learned over the parallel BGP sessions (see Cisco’s documentation):

    • weight
    • local preference
    • AS path length
    • origin
    • MED
    • the neighbor AS or the entire AS path (depending on IOS version)

    Using a separate BGP session for each of the parallel links uses more memory and CPU cycles, so in general this is not the preferred option. However, this option does have an important benefit: unlike the two other options, this one also works if the links connect to different routers on the other side, as long as the attributes listed above are the same.


    Boost BGP Preformance

    Automate BGP Routing optimization with Noction IRP

    NO COMMENTS

    Leave a Reply