Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

Understanding Graceful Routing Switchover

Understanding Graceful Routing Engine Switchover

This topic contains the following sections:

Graceful Routing Engine Switchover Concepts

The graceful Routing Engine switchover (GRES) feature in Junos OS and Junos OS Evolved enables a device with redundant Routing Engines to continue forwarding packets even if one Routing Engine fails. GRES preserves interface and kernel information, and traffic is not interrupted. However, GRES does not preserve the control plane.

On PTX10004, PTX10008, and PTX10016 platforms running Junos OS Evolved, GRES is enabled by default and cannot be disabled.

Note:

On T Series routers, TX Matrix routers, and TX Matrix Plus routers, the control plane is preserved in case of GRES with nonstop active routing (NSR), and nearly 75 percent of line rate worth of traffic per Packet Forwarding Engine remains uninterrupted during GRES.

Neighboring devices detect that the device has experienced a restart and react to the event in a manner prescribed by individual routing protocol specifications.

To preserve routing during a switchover, GRES must be combined with either:

  • Graceful restart protocol extensions

  • Nonstop active routing (NSR)

Any updates to the primary Routing Engine are replicated to the backup Routing Engine as soon as they occur.

Note:

Because of its synchronization requirements and logic, NSR/GRES performance is limited by the slowest Routing Engine in the system.

Primary Role switches to the backup Routing Engine if:

  • The primary Routing Engine kernel stops operating.

  • The primary Routing Engine experiences a hardware failure.

  • The administrator initiates a manual switchover.

Note:

To quickly restore or to preserve routing protocol state information during a switchover, GRES must be combined with either graceful restart or nonstop active routing, respectively. For more information about graceful restart, see Graceful Restart Concepts. For more information about nonstop active routing, see Nonstop Active Routing Concepts.

If the backup Routing Engine does not receive a keepalive from the primary Routing Engine after 2 seconds (4 seconds on M20 routers), it determines that the primary Routing Engine has failed; and assumes primary role.

The Packet Forwarding Engine:

  • Seamlessly disconnects from the old primary Routing Engine

  • Reconnects to the new primary Routing Engine

  • Does not reboot

  • Does not interrupt traffic

The new primary Routing Engine and the Packet Forwarding Engine then become synchronized. If the new primary Routing Engine detects that the Packet Forwarding Engine state is not up to date, it resends state update messages.

Note the following GRES behaviors, recommendations, or requirements:

  • Starting with Junos OS Release 12.2, if adjacencies between the restarting device and the neighboring peer 'helper' devices time out, graceful restart protocol extensions are unable to notify the peer 'helper' devices about the impending restart. Graceful restart can then stop and cause interruptions in traffic.

    To ensure that these adjacencies are maintained, change the hold-time for IS-IS protocols from the default of 27 seconds to a value higher than 40 seconds.

  • Successive Routing Engine switchover events must be a minimum of 240 seconds (4 minutes) apart after both Routing Engines have come up.

    If the device displays a warning message similar to:

    then do not attempt a switchover. If you choose to proceed with switchover, the device resets only the Packet Forwarding Engines that were not ready for graceful switchover. None of the FPCs should spontaneously restart. We recommend that you wait until the warning no longer appears and then proceed with the switchover.

  • Starting from Junos OS Release 14.2, when you perform GRES on MX Series routers, you must run the clear synchronous-ethernet wait-to-restore operational mode command on the new primary Routing Engine to clear the wait-to-restore timer on it. This is because the clear synchronous-ethernet wait-to-restore operational mode command clears the wait-to-restore timer only on the local Routing Engine.

  • In a routing matrix with TX Matrix Plus router with 3D SIBs, for successive Routing Engine switchovers, events must be a minimum of 900 seconds (15 minutes) apart after both Routing Engines have come up.

    You must perform GRES on only one line-card chassis (LCC) (of a TX Matrix router with 3D SIBs) at a time to avoid synchronization issues.

  • We do not recommend:

    • Doing a commit operation on the backup Routing Engine when GRES is enabled on the device.

    • Enabling GRES on the backup Routing Engine in any scenario.

  • When you enable nonstop routing with GRES on switches in the QFX10000 line that have redundant Routing Engines, we strongly recommend that you configure the nsr-phantom-holdtime seconds statement at the [edit routing-options] hierarchy level. Doing so helps to prevent traffic loss during a switchover.

    If you configure this statement, phantom IP addresses remain in the kernel during the switchover until the specified hold-time interval expires. After the interval expires, the device adds the corresponding routes to the appropriate routing tables. In an Ethernet VPN (EVPN)-VXLAN environment, we recommend that you specify a hold-time value of 300 seconds (5 minutes).

    This option doesn't apply to QFX10002 switches, which don't have redundant Routing Engines and don't support GRES.

Figure 1 shows the system architecture of graceful Routing Engine switchover and the process a routing platform follows to prepare for a switchover.

Figure 1: Preparing for a Graceful Routing Engine SwitchoverPreparing for a Graceful Routing Engine Switchover
Note:

Check GRES readiness by executing both:

  • The request chassis routing-engine master switch check command from the primary Routing Engine

  • The show system switchover command from the Backup Routing Engine

The switchover preparation process for GRES is as follows:

  1. The primary Routing Engine starts.

  2. The routing platform processes (such as the chassis process [chassisd]) start.

  3. The Packet Forwarding Engine starts and connects to the primary Routing Engine.

  4. All state information is updated in the system.

  5. The backup Routing Engine starts.

  6. The system determines whether GRES has been enabled.

  7. The kernel synchronization process (ksyncd) synchronizes the backup Routing Engine with the primary Routing Engine.

  8. After ksyncd completes the synchronization, all state information and the forwarding table are updated.

Figure 2 shows the effects of a switchover on the routing (or switching )platform.

Figure 2: Graceful Routing Engine Switchover ProcessGraceful Routing Engine Switchover Process

A switchover process comprises the following steps:

  1. When keepalives from the primary Routing Engine are lost, the system switches over gracefully to the backup Routing Engine.

  2. The Packet Forwarding Engine connects to the backup Routing Engine, which becomes the new primary.

  3. Routing platform processes that are not part of GRES (such as the routing protocol process rpd) restart.

  4. State information learned from the point of the switchover is updated in the system.

  5. If configured, graceful restart protocol extensions collect and restore routing information from neighboring peer helper devices.

Note:

For MX Series routers using enhanced subscriber management, the new backup Routing Engine (the former primary Routing Engine) will reboot when a graceful Routing Engine switchover is performed. This cold restart resynchronizes the backup Routing Engine state with that of the new primary Routing Engine, preventing discrepancies in state that might have occurred during the switchover.

Note:

During GRES on T Series and M320 routers during GRES, the Switch Interface Boards (SIBs) are taken offline and restarted one by one. This is done to provide the Switch Processor Mezzanine Board (SPMB) that manages the SIB enough time to populate state information for its associated SIB. However, on a fully populated chassis where all FPCs are sending traffic at full line rate, there might be momentary packet loss during the switchover.

Note:

When GRES is configured and the restart chassis-control command is executed on a TX Matrix Plus router with 3D SIBs, you cannot ascertain which Routing Engine becomes the primary. This is because the chassisd process restarts with the execution of the restart chassis-control command. The chassisd process is responsible for maintaining and retaining primary role and when it is restarted, the new chassisd is processed based on the device load. As a result, any one of the Routing Engines is made the primary.

Effects of a Routing Engine Switchover

Table 1 describes the effects of a Routing Engine switchover when different features are enabled:

  • No high availability features

  • Graceful Routing Engine switchover

  • Graceful restart

  • Nonstop active routing

Table 1: Effects of a Routing Engine Switchover

Feature

Benefits

Considerations

Dual Routing Engines only (no features enabled)

  • When the switchover to the new primary Routing Engine is complete, routing convergence takes place and traffic is resumed.

  • All physical interfaces are taken offline.

  • Packet Forwarding Engines restart.

  • The backup Routing Engine restarts the routing protocol process (rpd).

  • All hardware and interfaces are discovered by the new primary Routing Engine.

  • The switchover takes several minutes.

  • All of the device's adjacencies are aware of the physical (interface alarms) and routing (topology) changes.

GRES enabled

  • During the switchover, interface and kernel information is preserved.

  • The switchover is faster because the Packet Forwarding Engines are not restarted.

  • The new primary Routing Engine restarts the routing protocol process (rpd).

  • All hardware and interfaces are acquired by a process that is similar to a warm restart.

  • All adjacencies are aware of the device's change in state.

GRES and NSR enabled

  • Traffic is not interrupted during the switchover.

  • Interface and kernel information are preserved.

  • Unsupported protocols must be refreshed using the normal recovery mechanisms inherent in each protocol.

GRES and graceful restart enabled

  • Traffic is not interrupted during the switchover.

  • Interface and kernel information are preserved.

  • Graceful restart protocol extensions quickly collect and restore routing information from the neighboring devices.

  • Neighbors are required to support graceful restart, and a wait interval is required.

  • The routing protocol process (rpd) restarts.

  • For certain protocols, a significant change in the network can cause graceful restart to stop.

  • Starting with Junos OS Release 12.2, if adjacencies between the restarting device and the neighboring peer 'helper' devices time out, graceful restart can stop and cause interruptions in traffic.

Graceful Routing Engine Switchover on Aggregated Services Interfaces

If a graceful Routing Engine switchover (GRES) is triggered by an operational mode command, the device does not preserve the state of aggregated services interfaces (ASIs). For example:

However, if GRES is triggered by a CLI commit or FPC restart or crash, the backup Routing Engine updates the ASI state. For example:

Or:

Graceful Routing Engine Switchover System Requirements

Graceful Routing Engine switchover is supported on all routing (or switching) platforms that contain dual Routing Engines. All Routing Engines configured for graceful Routing Engine switchover must run the same Junos OS release. Hardware and software support for graceful Routing Engine switchover is described in the following sections:

Graceful Routing Engine Switchover Platform Support

To enable graceful Routing Engine switchover, your system must meet these minimum requirements:

  • M20 and M40e routers—Junos OS Release 5.7 or later

  • M10i router—Junos OS Release 6.1 or later

  • M320 router—Junos OS Release 6.2 or later

  • T320 router, T640 router, and TX Matrix router—Junos OS Release 7.0 or later

  • M120 router—Junos OS Release 8.2 or later

  • MX960 router—Junos OS Release 8.3 or later

  • MX480 router—Junos OS Release 8.4 or later (8.4R2 recommended)

  • MX240 router—Junos OS Release 9.0 or later

  • PTX5000 router—Junos OS Release 12.1X48 or later

  • Standalone T1600 router—Junos OS Release 8.5 or later

  • Standalone T4000 router—Junos OS Release 12.1R2 or later

  • TX Matrix Plus router—Junos OS Release 9.6 or later

  • TX Matrix Plus router with 3D SIBs—Junos Release 13.1 or later

  • EX Series switches with dual Routing Engines or in a Virtual Chassis — Junos OS Release 9.2 or later for EX Series switches

  • QFX Series switches in a Virtual Chassis —Junos OS Release 13.2 or later for the QFX Series

  • EX Series or QFX Series switches in a Virtual Chassis Fabric —Junos OS Release 13.2X51-D20 or later for the EX Series and QFX Series switches

For more information about support for graceful Routing Engine switchover, see the sections that follow.

Graceful Routing Engine Switchover Feature Support

Graceful Routing Engine switchover supports most Junos OS features in Release 5.7 and later. Particular Junos OS features require specific versions of Junos OS. See Table 2.

Table 2: Graceful Routing Engine Switchover Feature Support

Application

Junos OS Release

Aggregated Ethernet interfaces with Link Aggregation Control Protocol (LACP) and aggregated SONET interfaces

6.2

Asynchronous Transfer Mode (ATM) virtual circuits (VCs)

6.2

Logical systems

Note:

In Junos OS Release 9.3 and later, the logical router feature is renamed to logical system.

6.3

Multicast

6.4 (7.0 for TX Matrix router)

Multilink Point-to-Point Protocol (MLPPP) and Multilink Frame Relay (MLFR)

7.0

Automatic Protection Switching (APS)—The current active interface (either the designated working or the designated protect interface) remains the active interface during a Routing Engine switchover.

7.4

Point-to-multipoint Multiprotocol Label Switching MPLS LSPs (transit only)

7.4

Compressed Real-Time Transport Protocol (CRTP)

7.6

Virtual private LAN service (VPLS)

8.2

Ethernet Operation, Administration, and Management (OAM) as defined by IEEE 802.3ah

8.5

Extended DHCP relay agent

8.5

Ethernet OAM as defined by IEEE 802.1ag

9.0

Packet Gateway Control Protocol (PGCP) process (pgcpd) on Multiservices 500 PICs on T640 routers.

9.0

Subscriber access

9.4

Layer 2 Circuit and LDP-based VPLS pseudowire redundant configuration

9.6

The following constraints apply to graceful Routing Engine switchover feature support:

  • When graceful Routing Engine switchover and aggregated Ethernet interfaces are configured in the same system, the aggregated Ethernet interfaces must not be configured for fast-polling LACP. When fast polling is configured, the LACP polls time out at the remote end during the Routing Engine primary-role switchover. When LACP polling times out, the aggregated link and interface are disabled. The Routing Engine primary role change is fast enough that standard and slow LACP polling do not time out during the procedure. However, note that this restriction does not apply to MX Series Routers that are running Junos OS Release 9.4 or later and have distributed periodic packet management (PPM) enabled—which is the default configuration—on them. In such cases, you can configure graceful Routing Engine switchover and have aggregated Ethernet interfaces configured for fast-polling LACP on the same device.

    Note:

    MACSec sessions will flap upon Graceful Routing Engine switchover.

    Starting with Junos OS Release 13.2, when a graceful Routing Engine switchover occurs, the VRRP state does not change. VRRP is supported by graceful Routing Engine switchover only in the case that PPM delegation is enabled (which the default).

Graceful Routing Engine Switchover DPC Support

Graceful Routing Engine switchover supports all Dense Port Concentrators (DPCs) on the MX Series 5G Universal Routing Platforms running the appropriate version of Junos OS as shown in Graceful Routing Engine Switchover Platform Support. For more information about DPCs, see the MX Series DPC Guide.

Graceful Routing Engine Switchover and Subscriber Access

Graceful Routing Engine switchover currently supports most of the features directly associated with dynamic DHCP and dynamic PPPoE subscriber access. Graceful Routing Engine switchover also supports the unified in-service software upgrade (ISSU) for the DHCP access model and the PPPoE access model used by subscriber access.

Note:

When graceful Routing Engine switchover is enabled for subscriber management, all Routing Engines in the router must have the same amount of DRAM for stable operation.

Graceful Routing Engine Switchover PIC Support

Graceful Routing Engine switchover is supported on most PICs, except for the services PICs listed in this section. The PIC must be on a supported routing platform running the appropriate version of Junos OS. For information about FPC types, FPC/PIC compatibility, and the initial Junos OS Release in which an FPC supported a particular PIC, see the PIC guide for your router platform.

The following constraints apply to graceful Routing Engine switchover support for services PICs:

  • You can include the graceful-switchover statement at the [edit chassis redundancy] hierarchy level on a router with Adaptive Services, Multiservices, and Tunnel Services PICs configured on it and successfully commit the configuration. However, all services on these PICs—except the Layer 2 service packages and extension-provider and SDK applications on Multiservices PICs—are reset during a switchover.

  • Graceful Routing Engine switchover is not supported on any Monitoring Services PICs or Multilink Services PICs. If you include the graceful-switchover statement at the [edit chassis redundancy] hierarchy level on a router with either of these PIC types configured on it and issue the commit command, the commit fails.

  • Graceful Routing Engine switchover is not supported on Multiservices 400 PICs configured for monitoring services applications. If you include the graceful-switchover statement, the commit fails.

Note:

When an unsupported PIC is online, you cannot enable graceful Routing Engine switchover. If graceful Routing Engine switchover is already enabled, an unsupported PIC cannot come online.

Change History Table

Feature support is determined by the platform and release you are using. Use Feature Explorer to determine if a feature is supported on your platform.

Release
Description
14.2
Starting from Junos OS Release 14.2, when you perform GRES on MX Series devices, you must execute the clear synchronous-ethernet wait-to-restore operational mode command on the new primary Routing Engine to clear the wait-to-restore timer on it.
13.2
Starting with Junos OS Release 13.2, when a graceful Routing Engine switchover occurs, the VRRP state does not change.
12.2
Starting with Junos OS Release 12.2, if adjacencies between the restarting device and the neighboring peer 'helper' devices time out, graceful restart protocol extensions are unable to notify the peer 'helper' devices about the impending restart.
12.2
Starting with Junos OS Release 12.2, if adjacencies between the restarting device and the neighboring peer 'helper' devices time out, graceful restart can stop and cause interruptions in traffic.