1.2.16 IRP Failover

Overview #

IRP offers failover capabilities that ensure Improvements are preserved in case of planned or unplanned downtime of IRP server.

IRP’s failover feature uses a master-slave configuration. A second instance of IRP needs to be deployed in order to enable failover features. For details about failover configuration and troubleshooting refer Failover Configuration.

A failover license is required for the second node. Check with Noction’s sales team for details.

IRP’s failover solution relies on:

slave node running same version of IRP as the master node,
MySQL Multi-Master replication of ‘irp’ database,
announcement of the replicated improvements with different LocalPref and/or communities by both nodes,
monitoring by slave node of BGP announcements originating from master node based on higher precedence of master’s announced prefixes,
activating/deactivating of slave IRP components in case of failure or resumed work by master,
syncing master configuration to slave node.

For exact details about IRP failover solution refer to configuration guides (Failover Configuration, Setup Failover wizard), template files, and (if available) working IRP configurations. For example, some ‘irp’ database tables are not replicated, ‘mysql’ system database is replicated too, some IRP components are stopped.

IRP versions 3.5 and earlier do no offer failover capabilities for Inbound improvements. It is advised that in these versions only one of the IRP instances is configured to perform inbound optimization in order to avoid contradictory decisions. In case of a failure of this instance inbound improvements are withdrawn.

An overview of the solution is presented in the following figure:

Figure 1.2.10: Failover high level overview

The diagram highlights the:

two IRP nodes – Master and Slave,
grayed-out components are in stand-by mode – services are stopped or operating in limited ways. For example, the Frontend detects that it runs on the slave node and prohibits any changes to configuration while still offering access to reports, graphs or dashboards.
configuration changes are pushed by master to slave during synchronization. SSH is used to connect to the slave.
MySQL Multi-Master replication is setup for ‘irp’ database between master and slave nodes. Existing MySQL Multi-Master replication functionality is used.
master IRP node is fully functional and collects statistics, queues for probing, probes and eventually makes Improvements. All the intermediate and final results are stored in MySQL and due to replication will make it into slave’s database as well.
Bgpd works on both master and slave IRP nodes. They make the same announcements with different LocalPref/communities.
Bgpd on slave node monitors the number of master announcements from the router (master announcements have higher priority than slave’s)
Timers are used to prevent flapping of failover-failback.

Requirements #

The following additional preconditions must be met in order to setup failover:

second server to install the slave,
MySQL Multi-Master replication for the irp database.

MySQL replication is not configured by default. Configuration of MySQL Multi-Master replication is a mandatory requirement for a failover IRP configuration. Failover setup, and specifically MySQL Multi-Master replication should follow a provided failover script. Only a subset of tables in irp database are replicated. Replication requires extra storage space, depending on the overall traffic and platform activity, for replication logs on both failover nodes.

a second set of BGP sessions will be established,
a second set of PBR IP addresses are required to assign to the slave node in order to perform probing,
a second set of improvements will be announced to the router,
a failover license for the slave node,
Key-based SSH authentication from master to slave is required. It is used to synchronize IRP configuration from master to slave,
MySQL Multi-Master replication of ‘irp’ database,
IRP setup in Intrusive mode on master node.

In case IRP failover is setup in a multiple Routing Domain configuration and IRP instances are hosted by different RDs this must be specified in IRP configuration too. Refer Optimization for Multiple Routing Domains, global.master_rd, global.slave_rd.

Failover #

IRP failover relies on the slave node running the same version of IRP to determine if there are issues with the master node and take over if such an incident occurs.

Slave’s Bgpd service verifies that announcements are present on a router from master. If announcements from master are withdrawn for some reason the slave node will take over.

In order for this mechanism to work IRP needs to operate in Intrusive mode and master’s node announcements must have higher priority then the slave’s.

During normal operation the slave is kept up to date by master so that it is ready to take over in case of an incident. The following operations are performed:

master synchronizes its configuration to slave. This uses a SSH channel to sync configuration files from master to slave and process necessary services restart.
MySQL Multi-Master replication is configured on relevant irp database tables so that the data is available immediately in case of emergency,
components of IRP such as Core, Explorer, Irppushd are stopped or standing by on slave to prevent split-brain or duplicate probing and notifications,
slave node runs Bgpd and makes exactly the same announcements with a lower BGP LocalPref and/or other communities thus replicating Improvements too.

It is imperative that master’s LocalPref value is greater than slave’s value. This ensures that master’s announcements are preferred and enables slave to also observe them as part of monitoring.

In case of master failure its BGP session(s) goes down and its announcements are withdrawn.

Slave node only considers that master is down and takes over only if master’s Improvements are withdrawn from all edge routers in case of networks with multiple edge routers.

The same announcements are already in router’s local RIB from slave and the router chooses them as best.

This is true only if LocalPref and/or communities assigned to slave node are preferred. If other most preferable announcements are sent by other network elements , no longer announcements from slave node will be best. This defeats the purpose of using IRP failover.

At the same time, Failover logic runs a set of timers after master routes are withdrawn (refer global.failover_timer_fail ). When the timers expire IRP activates its standby components and resumes optimization.

Failback #

IRP includes failback feature too. Failback happens when master comes back online. Once Bgpd on the slave detects announcements from master it starts its failback timer (refer global.failover_timer_failback). Slave node will continue running all IRP components for the duration of the failback period. Once the failback timer expires redundant slave components are switched to standby mode and the entire setup becomes normal again. This timer is intended to prevent cases when master is unstable after being restored and there is a significant risk it will fail again.

During failback it is recommended that both IRP nodes are monitored by network administrators to confirm the system is stable.

Recovery of failed node #

IRP failover configuration is capable to automatically restore its entire failover environment if downtime of failed node is less than 24 hours.

Recovery speed is constrained by restoring replication of MySQL databases. On 1Gbps non-congested links replication for a full day of downtime takes approximately 30-45 minutes with 200-250Mbps network bandwidth utilization between the two IRP nodes. During this time the operational node continues running IRP services too.

If downtime was longer than 24 hours MySQL Multi-Master replication is no longer able to synchronize the databases on the two IRP nodes and manual MySQL replication recovery is required.

Upgrades #

Failover configurations of IRP require careful upgrade procedures especially for major versions.

It is imperative that master and slave nodes are not upgraded at the same time. Update one node first, give the system some time to stabilize and only after that update the second

Updated on Oct 8, 2024

IRP Resources

White Papers

Discover IRP features, review use cases and make informed decisions

Case Studies

Case studies and success stories to help in the decision-making process

Videos

Watch Noction IRP videos, screencasts and client testimonials

Manuals

Technical Noction IRP documentation, deployment instructions and datasheets

Tier 1 Reports

Get a first-hand network performance view of the major Tier 1 Carriers

IRP FAQ

See answers to the questions we get asked the most about Noction IRP

NFA Resources

NFA Documentation

Product overview, user guide and the deployment instructions documents

Tips and Tricks

Practical and useful info on NFA and the overall NetFlow analysis

NFA FAQ

A series of the most common NFA questions and answers

Guides & eBooks

View All Guides

IRP Documentation

1.2.16 IRP Failover

Overview #

Requirements #

Failover #

Failback #

Recovery of failed node #

Upgrades #

+1-650-618-9823

sales@noction.com

Campbell, CA 95008, USA

Company News

Blog

About Us

Press Releases

Careers

Support

Noction Releases Free IRP Lite v4.2.9 with GMI, Threat Mitigation, and Expanded OS Support

Noction IRP Quote

NFA Pricing

IRP Resources

White Papers

Case Studies

Videos

Manuals

Tier 1 Reports

IRP FAQ

NFA Resources

NFA Documentation

Tips and Tricks

NFA FAQ

Guides & eBooks

NOC as a Service

IP Transit Providers Assessment

Tier 1s Performance Dashboard

Tier 1 Carriers Report - October 2025

Noction IRP

Noction IRP Lite

Noction Flow Analyzer