Understanding Graceful Routing Engine Switchover

 

This topic contains the following sections:

Graceful Routing Engine Switchover Concepts

The graceful Routing Engine switchover (GRES) feature in Junos OS enables a router with redundant Routing Engines to continue forwarding packets, even if one Routing Engine fails. GRES preserves interface and kernel information. Traffic is not interrupted. However, GRES does not preserve the control plane.

Note

On T Series routers, TX Matrix routers, and TX Matrix Plus routers, the control plane is preserved in case of GRES with nonstop active routing (NSR), and nearly 75 percent of line rate worth of traffic per Packet Forwarding Engine remains uninterrupted during GRES.

Neighboring routers detect that the router has experienced a restart and react to the event in a manner prescribed by individual routing protocol specifications.

To preserve routing during a switchover, GRES must be combined with either:

Any updates to the master Routing Engine are replicated to the backup Routing Engine as soon as they occur.

Note

Because of its synchronization requirements and logic, NSR/GRES performance is limited by the slowest Routing Engine in the system.

Mastership switches to the backup Routing Engine if:

  • The master Routing Engine kernel stops operating.

  • The master Routing Engine experiences a hardware failure.

  • The administrator initiates a manual switchover.

Note

To quickly restore or to preserve routing protocol state information during a switchover, GRES must be combined with either graceful restart or nonstop active routing, respectively. For more information about graceful restart, see Graceful Restart Concepts. For more information about nonstop active routing, see Nonstop Active Routing Concepts.

If the backup Routing Engine does not receive a keepalive from the master Routing Engine after 2 seconds (4 seconds on M20 routers), it determines that the master Routing Engine has failed; and assumes mastership.

The Packet Forwarding Engine:

  • Seamlessly disconnects from the old master Routing Engine

  • Reconnects to the new master Routing Engine

  • Does not reboot

  • Does not interrupt traffic

The new master Routing Engine and the Packet Forwarding Engine then become synchronized. If the new master Routing Engine detects that the Packet Forwarding Engine state is not up to date, it resends state update messages.

Note

Starting with Junos OS Release 12.2, if adjacencies between the restarting router and the neighboring peer 'helper' routers time out, graceful restart protocol extensions are unable to notify the peer 'helper' routers about the impending restart. Graceful restart can then stop and cause interruptions in traffic.

To ensure that these adjacencies are maintained, change the hold-time for IS-IS protocols from the default of 27 seconds to a value higher than 40 seconds.

Note

Successive Routing Engine switchover events must be a minimum of 240 seconds (4 minutes) apart after both Routing Engines have come up.

If the router or switch displays a warning message similar to Standby Routing Engine is not ready for graceful switchover. Packet Forwarding Engines that are not ready for graceful switchover might be reset , do not attempt switchover. If you choose to proceed with switchover, only the Packet Forwarding Engines that were not ready for graceful switchover are reset. None of the FPCs should spontaneously restart. We recommend that you wait until the warning no longer appears and then proceed with the switchover.

Note

Starting from Junos OS Release 14.2, when you perform GRES on MX Series routers, you must execute the clear synchronous-ethernet wait-to-restore operational mode command on the new master Routing Engine to clear the wait-to-restore timer on it. This is because the clear synchronous-ethernet wait-to-restore operational mode command clears the wait-to-restore timer only on the local Routing Engine.

Note

In a routing matrix with TX Matrix Plus router with 3D SIBs, for successive Routing Engine switchover, events must be a minimum of 900 seconds (15 minutes) apart after both Routing Engines have come up.

GRES must be performed on one line-card chassis (LCC) (of a TX Matrix router with 3D SIBs) at a time to avoid synchronization issues.

Note
  • We do not recommend performing a commit operation on the backup Routing Engine when GRES is enabled on the router or switch.

  • We do not recommend enabling GRES on the backup Routing Engine in any scenario.

Note

On QFX10000 switches, we strongly recommend that you configure the nsr-phantom-holdtime seconds statement at the [edit routing-options] hierarchy level when nonstop routing is enabled with GRES. Doing so helps to prevent traffic loss. When you configure this statement, phantom IP addresses remain in the kernel during a switchover until the specified hold-time interval expires. After the interval expires, these routes are added to the appropriate routing tables. In an Ethernet VPN (EVPN)/VXLAN environment, we recommend that you specify a hold-time value of 300 seconds (5 minutes).

Figure 1 shows the system architecture of graceful Routing Engine switchover and the process a routing platform follows to prepare for a switchover.

Figure 1: Preparing for a Graceful Routing Engine Switchover
Preparing for a Graceful Routing Engine Switchover
Note

Check GRES readiness by executing both:

  • The request chassis routing-engine master switch check command from the master Routing Engine

  • The show system switchover command from the Backup Routing Engine

The switchover preparation process for GRES is as follows:

  1. The master Routing Engine starts.

  2. The routing platform processes (such as the chassis process [chassisd]) start.

  3. The Packet Forwarding Engine starts and connects to the master Routing Engine.

  4. All state information is updated in the system.

  5. The backup Routing Engine starts.

  6. The system determines whether GRES has been enabled.

  7. The kernel synchronization process (ksyncd) synchronizes the backup Routing Engine with the master Routing Engine.

  8. After ksyncd completes the synchronization, all state information and the forwarding table are updated.

Figure 2 shows the effects of a switchover on the routing (or switching )platform.

Figure 2: Graceful Routing Engine Switchover Process
Graceful
Routing Engine Switchover Process

A switchover process comprises the following steps:

  1. When keepalives from the master Routing Engine are lost, the system switches over gracefully to the backup Routing Engine.

  2. The Packet Forwarding Engine connects to the backup Routing Engine, which becomes the new master.

  3. Routing platform processes that are not part of GRES (such as the routing protocol process rpd) restart.

  4. State information learned from the point of the switchover is updated in the system.

  5. If configured, graceful restart protocol extensions collect and restore routing information from neighboring peer helper routers.

Note

For MX Series routers using enhanced subscriber management, the new backup Routing Engine (the former master Routing Engine) will reboot when a graceful Routing Engine switchover is performed. This cold restart resynchronizes the backup Routing Engine state with that of the new master Routing Engine, preventing discrepancies in state that might have occurred during the switchover.

Note

During GRES on T Series and M320 routers during GRES, the Switch Interface Boards (SIBs) are taken offline and restarted one by one. This is done to provide the Switch Processor Mezzanine Board (SPMB) that manages the SIB enough time to populate state information for its associated SIB. However, on a fully populated chassis where all FPCs are sending traffic at full line rate, there might be momentary packet loss during the switchover.

Note

When GRES is configured and the restart chassis-control command is executed on a TX Matrix Plus router with 3D SIBs, you cannot ascertain which Routing Engine becomes the master. This is because the chassisd process restarts with the execution of the restart chassis-control command. The chassisd process is responsible for maintaining and retaining mastership and when it is restarted, the new chassisd is processed based on the router or switch load. As a result, any one of the Routing Engines is made the master.

Effects of a Routing Engine Switchover

Table 1 describes the effects of a Routing Engine switchover when different features are enabled:

  • No high availability features

  • Graceful Routing Engine switchover

  • Graceful restart

  • Nonstop active routing

Table 1: Effects of a Routing Engine Switchover

Feature

Benefits

Considerations

Dual Routing Engines only (no features enabled)

  • When the switchover to the new master Routing Engine is complete, routing convergence takes place and traffic is resumed.

  • All physical interfaces are taken offline.

  • Packet Forwarding Engines restart.

  • The backup Routing Engine restarts the routing protocol process (rpd).

  • All hardware and interfaces are discovered by the new master Routing Engine.

  • The switchover takes several minutes.

  • All of the router's adjacencies are aware of the physical (interface alarms) and routing (topology) changes.

GRES enabled

  • During the switchover, interface and kernel information is preserved.

  • The switchover is faster because the Packet Forwarding Engines are not restarted.

  • The new master Routing Engine restarts the routing protocol process (rpd).

  • All hardware and interfaces are acquired by a process that is similar to a warm restart.

  • All adjacencies are aware of the router's change in state.

GRES and NSR enabled

  • Traffic is not interrupted during the switchover.

  • Interface and kernel information are preserved.

  • Unsupported protocols must be refreshed using the normal recovery mechanisms inherent in each protocol.

GRES and graceful restart enabled

  • Traffic is not interrupted during the switchover.

  • Interface and kernel information are preserved.

  • Graceful restart protocol extensions quickly collect and restore routing information from the neighboring routers.

  • Neighbors are required to support graceful restart, and a wait interval is required.

  • The routing protocol process (rpd) restarts.

  • For certain protocols, a significant change in the network can cause graceful restart to stop.

  • Starting with Junos OS Release 12.2, if adjacencies between the restarting router and the neighboring peer 'helper' routers time out, graceful restart can stop and cause interruptions in traffic.

Graceful Routing Engine Switchover on Aggregated Services interfaces

If a graceful Routing Engine switchover (GRES) is triggered by an operational mode command, the state of aggregated services interfaces (ASIs) are not preserved. For example:

However, if GRES is triggered by a CLI commit or FPC restart or crash, the backup Routing Engine updates the ASI state. For example:

Or:

Release History Table
Release
Description
Starting from Junos OS Release 14.2, when you perform GRES on MX Series routers, you must execute the clear synchronous-ethernet wait-to-restore operational mode command on the new master Routing Engine to clear the wait-to-restore timer on it.
Starting with Junos OS Release 12.2, if adjacencies between the restarting router and the neighboring peer 'helper' routers time out, graceful restart protocol extensions are unable to notify the peer 'helper' routers about the impending restart.
Starting with Junos OS Release 12.2, if adjacencies between the restarting router and the neighboring peer 'helper' routers time out, graceful restart can stop and cause interruptions in traffic.