Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

MX Series Routers Fabric Resiliency

MX routers provide intelligent mechanisms to reduce packet loss in hardware failures scenarios. MX Series routers ensure network and service availability with a broad set of multilayered physical, logical, and protocol-level resiliency aspects

MX10008 provides redundancy and resiliency. All major hardware components including the power system, the cooling system, and the control board are fully redundant.

The MX10004 power system and the Routing Control Board (RCB) provide redundancy and resiliency.

The MX2020 and MX2010 chassis provide redundancy and resiliency. All major hardware components including the power system, the cooling system, the control board and the switch fabrics are fully redundant.

Switch Fabric Boards (SFBs) are the data plane for the subsystems in the MX router chassis. SFBs create a highly scalable and resilient “all-active” centralized switch fabric that delivers up to 4 Tbps of full duplex switching capacity to each MPC slot in an MX2000 router.

The MX240, MX480 and MX960 chassis provide redundancy and resiliency. The hardware system is fully redundant, power supplies, fan trays, Routing Engines, and Switch Control Boards.

The MX304 router contains redundant, pluggable, Routing Engines and supports up to three line-card MICs (LMICs).

This topic contains the following sections that describe fabric resiliency options, failure detection methods used, and corrective actions:

  • No link title

  • No link title

  • No link title

  • No link title

Fabric Connectivity Restoration

Packet Forwarding Engine destinations can become unreachable for the following reasons:

  • The control boards go offline as a result of a CLI command or a pressed physical button.

  • The fabric control boards are turned offline because of high temperature.

  • Voltage or polled I/O errors in the fabric.

  • All Packet Forwarding Engines receive destination errors on all planes from remote Packet Forwarding Engines, even when the fabrics are online.

  • Complete fabric loss caused by destination timeouts, even when the fabrics are online.

When the system detects any unreachable Packet Forwarding Engine destinations, fabric connectivity restoration is attempted. If restoration fails, the system turns off the interfaces to trigger local protection action or traffic re-route on the adjacent routers.

The recovery process consists of the following phases:

  1. Fabric plane restart phase: Restoration is attempted by restarting the fabric planes one by one. This phase does not start if the fabric plane is functioning properly and an error is reported by one line card only. An error message is generated to specify that a connectivity loss is the reason for the fabric plane being turned offline. This phase is performed for fabric plane errors only.

  2. Fabric plane and line card restart phase: The system waits for the first phase to be completed before examining the system state again. If the connectivity is not restored after the first phase is performed or if the problem occurs again within a duration of 10 minutes, connectivity restoration is attempted by restarting both the fabric planes and the line cards. If you configure the action-fpc-restart-disable statement at the [edit chassis fabric degraded] hierarchy level to disable restart of the line cards when a recovery is attempted, an alarm is triggered to indicate that connectivity loss has occurred. In this second phase, three steps are taken:

    1. All the line cards that have destination errors on a PFE are turned offline.

    2. The fabric planes are turned offline and brought back online, one by one, starting with the spare plane.

    3. The line cards that were turned offline are brought back online.

  3. Line card offline phase: The system waits for the second phase to be completed before examining the system state again. Connectivity loss is limited by turning the line cards offline and by turning off interfaces because previous attempts at recovery have failed. If the problem is not resolved by restarting the line cards or if the problem recurs within 10 minutes after restarting the line cards, this phase is performed.

The three phases are controlled by timers. During these phases, if an event (such as offlining/onlining line cards or fabric planes) times out, then the phase skips that event and proceeds to the next event. The timer control has a timeout value of 10 minutes. If the first fabric error occurs in a system with two or more line cards, the fabric planes are restarted. If another fabric error occurs within the next 10 minutes, the fabric planes and line cards are restarted. However, if the second fabric error occurs outside of the timeout period of 10 minutes, then the first phase is performed, which is the restart of only the fabric planes.

In cases where all the destination timeouts are traced to a certain line card, for example, one source line card or one destination line card, only that line card is turned offline and online. The fabric planes are not turned offline and online. If another fabric fault occurs within the period of 10 minutes, the line card is turned offline.

By default, the system limits connectivity loss time by detecting severely degraded fabric. No user interaction is necessary.

Line Cards with Degraded Fabric

You can configure a line card with degraded fabric to be moved to the offline state. On an MX10008, MX10004, MX2020, MX2010, MX960, MX480, MX304, or MX240 router, you can configure link errors or bad fabric planes. This configuration is particularly useful in partial connectivity loss scenarios where bringing the line card offline results in faster re-routing. To configure this option on a line card, use the offline-on-fabric-bandwidth-reduction statement at the [edit chassis fpc slot-number] hierarchy level. For details, see No link title, No link title, No link title, No link title, No link title and No link title.

Connectivity Loss Towards a Single Destination Only

In certain deployments, a line card indicates a complete connectivity loss towards a single destination only, but it functions properly for other destinations. Such cases are identified and the affected line card is recovered. Consider a sample scenario in which the active planes are 0,1,2,3 and the spare planes are 4,5,6,7 in the connection between line card 0 and line card 1. If line card 0 has single link failures for planes 0 and 1 and if line card 1 has single link failures for planes 2 and 3, a complete connectivity loss occurs between the two line cards. Both line card 0 and line card 1 undergo a phased mode of recovery and fabric healing takes place.

Redundancy Fabric Mode on Active Control Boards

You can configure the active control board to be in redundancy mode or in increased fabric bandwidth mode. To configure redundancy mode for the active control board, use the redundancy-mode redundant statement at the [edit chassis fabric] hierarchy level.