Fabric Resiliency

Fabric Resiliency and Degradation

Juniper routers and switches have built in resiliency to tackle failures and error conditions encountered during normal operation. Immediate action is taken by JUNOS software to remedy the failure conditions to minimize traffic loss. No manual intervention is needed. Fabric degradation could be one of the reasons leading to such error conditions. The following sections explain how the PFEs recover in a resilient manner from these failures.

Packet Forwarding Engine Errors and Recovery on PTX Series Routers
Fabric Resiliency and Automatic Recovery of Degraded Fabric
Packet Forwarding Engine Errors and Recovery on T640, T1600 or TX Matrix Routers

Packet Forwarding Engine Errors and Recovery on PTX Series Routers

Packet Forwarding Engine destinations can become unreachable on PTX Series routers for the following reasons:

The fabric Switch Interface Boards (SIBs) are offline as a result of a CLI command .
The fabric SIBs are turned offline by the control board because of high temperature conditions.
Voltage or polled I/O errors in the SIBs are detected by the control board.
Unexpected link-training errors occur on all connected planes.
Two Packet Forwarding Engines can reach the fabric but not each other.
Link errors occur where two Packet Forwarding Engines have connectivity with the fabric but not through a common plane.

Starting with Junos OS Release 13.3, you can use PTX Series routers to configure Packet Forwarding Engine (PFE)-related error levels and the actions to perform when a specified threshold is reached.

If error levels are not defined, a PTX Series router begins the following phases in the recovery process:

SIB restart phase: The router attempts to resolve the issue by restarting the SIBs one by one. This phase does not start if the SIBs are functioning properly and a single line card is facing an issue.
SIB and line card restart phase: The router restarts both the SIBs and the line card. If there are line cards that are unable to initiate high-speed links to the fabric after reboot, it is not relevant to loss of live traffic as no interfaces are created for these line cards, preventing the system from issues.
Line Card offline phase: Because previous attempts at recovery failed, line cards and interfaces are turned off and the system avoids issues and error conditions.

Fabric Resiliency and Automatic Recovery of Degraded Fabric

Starting Junos Evolved Release 23.4R1, the fabric automatic recovery feature is available to limit data loss. Recovery actions taken include FRU restart, link restart and so on.

The following three-phase fabric recovery actions are attempted at FRU level:

1. FRU level recovery using SIB restart.

2. FRU level recovery using FPC restart or PFE restart.

3. Action for unrecoverable PFEs IFD disable or PFE offline.

Note: For platforms that do not have PFE-restart support, FPC restart is provided as the default action.

Fabric recovery action for SIB fault conditions: For reachability faults due to an absent SIB (user driven offline or SIB not present during system power up), Fabric resiliency does not attempt recovery. In systems that do not support fabric recovery, chassis alarms are generated for reachability faults.

PFE Level Recovery Action on PTX Series Routers (PTX10004, PTX10008, and PTX10016 Routers)

For platforms that can support PFE restart, PFE restart will be added as the default phase 2 recovery action.

Note: In ASICs with multiple PFEs, the restart affects PPFEs (Per-plane PFEs), similar to PFE offline action.

Recovery decision for phase 2 action is made for either of the following scenarios:

PFE’s with reachability faults all reside in a single FPC.
PFEs with reachability faults (in one or more FPCs) and have no common of failure.

Phase 2 recovery is attempted on PPFEs that have not recovered from reachability faults after phase 1 recovery.

If the number of PFEs having self reachability faults in an FPC equal to or exceed 50% of the PFEs then the FPC will be restart.

Use the following CLI option to manually configure the default PFE restart action:

The following table shows the actions on phase 2 recovery, based on the configuration and number of PFEs in fault in an FPC.

Recovery decision	Number of implicated PFEs in FPC	PFE restart supported	PFE restart disable	FPC restart disable	Action
Phase 2 action	<= 50%	Yes	No	x	PFE restart
Phase 2 action	<= 50%	Yes	Yes	No	FPC restart
Phase 2 action	<= 50%	Yes	Yes	Yes	PFE restart
Phase 2 action	>50%	Yes	x	No	FPC restart
Phase 2 action	>50%	Yes	Yes	Yes	PFE restart
Phase 2 action	>50%	Yes	No	Yes	PFE restart

Packet Forwarding Engine Errors and Recovery on T640, T1600 or TX Matrix Routers

Packet Forwarding Engine destinations can become unreachable on T640, T1600 or TX Matrix routers for the following reasons:

The fabric Switch Interface Boards (SIBs) are offline as a result of a CLI command or a pressed physical button.
The fabric SIBs are turned offline by the Switch Processor Mezzanine Board (SPMB) because of high temperature conditions.
Voltage or polled I/O errors in the SIBs are detected by the SPMB.
All Packet Forwarding Engines receive destination errors on all planes from remote Packet Forwarding Engines, even when the SIBs are online.
Complete fabric loss is caused by destination timeouts, even when the SIBs are online.

The recovery process consists of the following phases:

The router restarts the fabric planes one by one. This phase does not start if the fabric plane is functioning properly and a single line card has issues.
Fabric plane and Line Card restart phase: The router restarts both the SIBs and the line cards. If there are line cards that are unable to initiate high-speed links to the fabric after reboot, it is not relevant to loss of live traffic as no interfaces are created for these line cards, preventing the system from issues.
Line card offline phase: Because previous attempts at recovery failed, line cards and interfaces are turned off and the system avoids issues and error conditions leading to serious consequences.

Note:

Starting in Junos OS Release 14.2R6, if a SIB becomes offline because of extreme conditions such as high voltage or high temperature, then as part of the recovery process, the router does not restart the fabric plane for that SIB.

The phased recovery mechanism mentioned above is exhaustive unless there are other errors which could be correlated to these issues.

Starting in Junos OS Release 14.2R6, you can manage fabric degradation in single-chassis systems better by incorporating fabric self-ping and Packet Forwarding Engine liveness mechanisms. Fabric self-ping is a mechanism to detect issues in the fabric data path. Using the fabric self-ping mechanism, every Packet Forwarding Engine ascertains that a packet destined to itself is reaching it when the packet is sent over the fabric path. Packet Forwarding Engine liveness is a mechanism to detect whether a Packet Forwarding Engine is reachable on the fabric plane. To verify that it is reachable, the Packet Forwarding Engine sends a self-destined packet over the fabric plane periodically. If any error is detected by these two mechanisms, the fabric manager raises a fabric degraded alarm and initiates recovery by restarting the line card.

MX Series Routers Fabric Resiliency

MX routers provide intelligent mechanisms to reduce packet loss in hardware failures scenarios. MX Series routers ensure network and service availability with a broad set of multilayered physical, logical, and protocol-level resiliency aspects

MX10008 provides redundancy and resiliency. All major hardware components including the power system, the cooling system, and the control board are fully redundant.

The MX10004 power system and the Routing Control Board (RCB) provide redundancy and resiliency.

The MX2020 and MX2010 chassis provide redundancy and resiliency. All major hardware components including the power system, the cooling system, the control board and the switch fabrics are fully redundant.

Switch Fabric Boards (SFBs) are the data plane for the subsystems in the MX router chassis. SFBs create a highly scalable and resilient “all-active” centralized switch fabric that delivers up to 4 Tbps of full duplex switching capacity to each MPC slot in an MX2000 router.

The MX240, MX480 and MX960 chassis provide redundancy and resiliency. The hardware system is fully redundant, power supplies, fan trays, Routing Engines, and Switch Control Boards.

The MX304 router contains redundant, pluggable, Routing Engines and supports up to three line-card MICs (LMICs).

This topic contains the following sections that describe fabric resiliency options, failure detection methods used, and corrective actions:

Fabric Connectivity Restoration
Line Cards with Degraded Fabric
Connectivity Loss Towards a Single Destination Only
Redundancy Fabric Mode on Active Control Boards

Fabric Connectivity Restoration
Line Cards with Degraded Fabric
Connectivity Loss Towards a Single Destination Only
Redundancy Fabric Mode on Active Control Boards

Fabric Connectivity Restoration

Packet Forwarding Engine destinations can become unreachable for the following reasons:

The control boards go offline as a result of a CLI command or a pressed physical button.
The fabric control boards are turned offline because of high temperature.
Voltage or polled I/O errors in the fabric.
All Packet Forwarding Engines receive destination errors on all planes from remote Packet Forwarding Engines, even when the fabrics are online.
Complete fabric loss caused by destination timeouts, even when the fabrics are online.

When the system detects any unreachable Packet Forwarding Engine destinations, fabric connectivity restoration is attempted. If restoration fails, the system turns off the interfaces to trigger local protection action or traffic re-route on the adjacent routers.

The recovery process consists of the following phases:

Fabric plane restart phase: Restoration is attempted by restarting the fabric planes one by one. This phase does not start if the fabric plane is functioning properly and an error is reported by one line card only. An error message is generated to specify that a connectivity loss is the reason for the fabric plane being turned offline. This phase is performed for fabric plane errors only.
Fabric plane and line card restart phase: The system waits for the first phase to be completed before examining the system state again. If the connectivity is not restored after the first phase is performed or if the problem occurs again within a duration of 10 minutes, connectivity restoration is attempted by restarting both the fabric planes and the line cards. If you configure the action-fpc-restart-disable statement at the [edit chassis fabric degraded] hierarchy level to disable restart of the line cards when a recovery is attempted, an alarm is triggered to indicate that connectivity loss has occurred. In this second phase, three steps are taken:
1. All the line cards that have destination errors on a PFE are turned offline.
2. The fabric planes are turned offline and brought back online, one by one, starting with the spare plane.
3. The line cards that were turned offline are brought back online.
Line card offline phase: The system waits for the second phase to be completed before examining the system state again. Connectivity loss is limited by turning the line cards offline and by turning off interfaces because previous attempts at recovery have failed. If the problem is not resolved by restarting the line cards or if the problem recurs within 10 minutes after restarting the line cards, this phase is performed.

The three phases are controlled by timers. During these phases, if an event (such as offlining/onlining line cards or fabric planes) times out, then the phase skips that event and proceeds to the next event. The timer control has a timeout value of 10 minutes. If the first fabric error occurs in a system with two or more line cards, the fabric planes are restarted. If another fabric error occurs within the next 10 minutes, the fabric planes and line cards are restarted. However, if the second fabric error occurs outside of the timeout period of 10 minutes, then the first phase is performed, which is the restart of only the fabric planes.

In cases where all the destination timeouts are traced to a certain line card, for example, one source line card or one destination line card, only that line card is turned offline and online. The fabric planes are not turned offline and online. If another fabric fault occurs within the period of 10 minutes, the line card is turned offline.

By default, the system limits connectivity loss time by detecting severely degraded fabric. No user interaction is necessary.

Line Cards with Degraded Fabric

You can configure a line card with degraded fabric to be moved to the offline state. On an MX10008, MX10004, MX2020, MX2010, MX960, MX480, MX304, or MX240 router, you can configure link errors or bad fabric planes. This configuration is particularly useful in partial connectivity loss scenarios where bringing the line card offline results in faster re-routing. To configure this option on a line card, use the offline-on-fabric-bandwidth-reduction statement at the [edit chassis fpc slot-number] hierarchy level. For details, see Fabric Plane Management on MX304 Routers, Fabric Plane Management on MX10K-LC9600 and SFB2 (Model Number: JNP10008-SF2), Fabric Plane Management on MX10004 Devices, Fabric Plane Management on JNP10K-LC2101 and JNP10K-LC480, Fabric-Plane-Management-on-MX10004 and MX10008-Devices and Fabric Plane Management on AS MLC Modular Carrier Card.

Connectivity Loss Towards a Single Destination Only

In certain deployments, a line card indicates a complete connectivity loss towards a single destination only, but it functions properly for other destinations. Such cases are identified and the affected line card is recovered. Consider a sample scenario in which the active planes are 0,1,2,3 and the spare planes are 4,5,6,7 in the connection between line card 0 and line card 1. If line card 0 has single link failures for planes 0 and 1 and if line card 1 has single link failures for planes 2 and 3, a complete connectivity loss occurs between the two line cards. Both line card 0 and line card 1 undergo a phased mode of recovery and fabric healing takes place.

Redundancy Fabric Mode on Active Control Boards

You can configure the active control board to be in redundancy mode or in increased fabric bandwidth mode. To configure redundancy mode for the active control board, use the redundancy-mode redundant statement at the [edit chassis fabric] hierarchy level.

Detection and Recovery of Fabric-Related Failures Caused by Loss of Connectivity on MX Series Routers

Connectivity loss in a router occurs when the router is unable to transmit data packets to other neighboring routers, although the interfaces on that router continue to be in the active state. As a result, the other neighboring routers continue to forward traffic to the impacted router, which drops the arriving packets without sending a notification to the other routers.

When a Packet Forwarding Engine in a router is unable to send traffic to other Packet Forwarding Engines over the data plane within the same router, the router is unable to transmit any packets to a neighboring router, although the interfaces are advertised as active on the control plane. Fabric failure can be one of the reasons for the loss of connectivity.

The following fabric failure scenarios can occur:

Removal of the control board
High-speed link 2 (HSL2) training failures
Single link failure on a line card
Multiple link failures on the same line card or the same fabric plane
Multiple link failures randomly on a line card or a fabric plane
Intermittent cyclic redundancy check (CRC) errors
A complete loss of connectivity for only one destination and not to other destinations

When a line card does not forward traffic due to a certain reason to other line cards within the device, the control protocol on the Routing Engine is unable to detect this condition. The traffic transmission is not diverted to the functional, active line cards and, instead, the packets are continued to be sent to the affected line card and are dropped at that point. The following might be the causes for a line card being unable to forward traffic:

All the planes in the system are in the Offline or Fault state.
All the Packet Forwarding Engines on the line card might have disabled the fabric streams due to destination errors.

If all the Switch Control Boards (SCBs) lose connectivity to the line cards, then all the interfaces are brought down. If a Packet Forwarding Engine of a line card loses complete connectivity to or from the fabric, then that line card is brought down.

System hardware failures can be of the following types:

A single occurrence or a rare failure for a brief period (such as environmental spikes). This failure is effectively healed without manual intervention by restarting the fabric plane and restarting the line cards and the fabric plane, if necessary.
Repeated failures that occur frequently.
A permanent failure.

A recovery from any case of reduced throughput, such as multiple Packet Forwarding Engine destination timeouts on multiple planes is not attempted. Restoration of connectivity is attempted only when all the planes are in the Offline or Fault state or when the destinations are unreachable on all active planes.

If connectivity loss occurs because of a certain line card, which is either a common source or common destination of the destination timeout, and if you have configured the action-fpc-restart-disable statement at the [edit chassis fabric degraded] hierarchy level, no recovery action is taken. The show chassis fabric reachability command output can be used to verify the status of the fabric and the line card. An alarm is triggered to indicate that the particular line card is causing the connectivity loss.

Fabric-Failure Detection Methods on MX Series Routers

The chassis daemon (chassisd) process detects the removal of a control board. The removal of the control board causes all the active planes that reside on that board to be disabled and a switchover is performed. If the active Routing Engine is also unplugged along with the control board, the detection of the control board removal is delayed until the switchover of the Routing Engine occurs and the reconnection in the primary, backup Routing Engine pair occurs. If the control board is turned offline by specifying the request chassis cb slot slot-number offline or a pressed physical button to cause a graceful shutdown, a fabric failure does not occur, even if the control board is moved to the offline state.

If you remove the control board on the primary Routing Engine, resulting in removal of active fabric planes, the line card takes the local action of disabling the removed planes. If spare planes are available, the line card initiates switchover to spare planes. If an active control board on a backup Routing Engines is removed, the primary Routing Engine disables the removed planes and performs the switchover to spare planes, if available. The software attempts to optimize the duration of connectivity loss by disabling all removed planes. The spare planes are transitioned to the online state one by one.

Fabric self-ping is a mechanism to detect any issues in the fabric data path. Each Packet Forwarding Engine forwards fabric data cells that are destined to itself over all active fabric planes. To transmit the data cell, the Packet Forwarding Engine fabric sends the request cells over an active plane and waits for a grant packet. The destination Packet Forwarding Engine sends a grant packet over the same plane on which the request cell is received. When the grant cell is received, the source Packet Forwarding Engine sends the data cell.

The Packet Forwarding Engine fabric contains the capability to detect grant delays. If grants are not received within a certain period of time, a destination timeout is declared. Destination timeout on a certain plane by a Packet Forwarding Engine on two or more line cards is considered as an indication for plane failures. Even if one Packet Forwarding Engine on a line card flashes an error, the line card is considered to be in error. Destination timeouts are noticed when the Packet Forwarding Engine sends traffic actively because requests are sent only for valid data cells. The software takes an appropriate action based on the destination timeout. For self-ping, a data cell is destined to the source Packet Forwarding Engine only.

Fabric ping failure messages are sent to the fabric manager on the Routing Engine, which collates all of the errors reported by all the line cards and takes a corrective action. For example, a ping failure for all links of the same line card might indicate a problem on the line card. Ping failure for multiple line cards for the same fabric plane might indicate a problem with the fabric.

If the Routing Engine determines that a fabric plane is down, based on the information on errors it receives from the line cards or the Packet Forwarding Engines, over a period of 5 seconds, it indicates a fabric failure. The duration of 5 seconds is the period for which the Routing Engine collates the errors from all of the line cards.

Fabric self-ping packets are periodically sent to check the sanity of the fabric links. Self pings are sent at interval of 500 ms. The destination timeout is also checked in intervals of 500 ms. If two timeouts ocur successively, self ping failure is detected. When a destination timeout is received, the Packet Forwarding Engine fabric stops the sending of packets to the fabric. To examine the link condition again, the software resets the credits to ensure that new requests are sent again. When a self-ping failure occurs, the line card removes the affected plane from sending data to all destinations. This method ensures that self-ping is not attempted to be sent again on the defective plane.

The following guidelines apply to the self-ping capability:

By default, self pings are not sent on spare fabric planes because spare planes do not carry traffic.
The size of self-ping packets is large enough to enable the cells to be loaded over all the active fabric planes (MX2020 supports 24 fabric planes and MX10008 supports 12 fabric planes).
A detection of received self-ping packets is not performed.
High priority queue is used to enable self-ping to be sent for oversubscription cases.

Detection and Corrective Actions of Line Cards on MX Series Routers

You can configure a line card to be moved to the offline state on an MX-Series routers (such as MX10008, MX10004, MX2020, MX2010, MX2008, MX960, MX480, or MX304, MX240, and so on). Configuring this feature does not affect the system. You can configure this feature without restarting the line card or restarting the system.

The following scenarios can occur when you configure the feature to disable line cards :

If a line card has been brought offline because of fabric errors and this functionality to move the line card to offline state is disabled, the line card is transitioned to the online state automatically.
If a line card has been brought offline because of fabric errors and this functionality to move the line card to offline state is disabled or configured for some other line card, the line card that was turned offline is transitioned to the online state automatically.
All the line cards that were brought offline , when you configured this setting, are brought back online when you commit any configuration under the [edit chassis] hierarchy level. Similarly, a restart of the chassis daemon or the Graceful Routing Engine switchover (GRES) operation also causes the line card that is disabled because of degraded fabric to be moved to the online state.

When a line card is operating with less than the required number of active fabric planes. If a line card is operating with less than four planes, the fabric traffic operates at a reduced bandwidth.

The following conditions can result in reduced operating bandwidth in fabric:

The fabric control boards go offline as a result of an unintentional, abrupt power shutdown.
An application-specific integrated circuit (ASIC) error, which causes a plane of a control board to be automatically turned offline.
Manually bringing the fabric plane or the control board to the offline state.
Removal of the control board
Self-ping failure on any plane.
HSL2 training failure for active plane.
If a spare fabric plane has CRC errors, and this spare plane is made online, the link with the CRC error is disabled. This mechanism might cause a degradation in fabric in one direction and might cause a null route in the other direction.
When a self-ping or HSL2 training failure occurs, the fabric plane is disabled for a particular line card and it is online for other line cards. This condition can also cause a null route.

If you need to remove the control board or move a fabric plane to the offline state during a system maintenance, you must enable the functionality to turn the line cards with degraded bandwidth to the offline state (by using the offline-on-fabric-bandwidth-reduction statement at the [edit chassis fpc slot-number] hierarchy level).

The following corrective actions are performed when a null route or reduced operating bandwidth occurs in the fabric:

Regardless of whether a spare control board is available or not, self-ping state for each line card is monitored at intervals of 5 seconds at the Routing Engine. Fabric manager determines the presence of spare control boards
The switch fabric is hosted on the Switch Fabric Boards (SFBs) on MX10008, MX10004, MX2020, MX2010 and MX2000 devices:
- The MX10008 router has eight slots for the line cards that can support a maximum of 768 100-Gigabit Ethernet ports (4x100), 192 40-Gigabit Ethernet ports, 192 100-Gigabit Ethernet ports, or 192 400-Gigabit Ethernet ports with line card slots 0-7 that combine Packet Forwarding Engine (PFE) and Ethernet interfaces enclosed in a single assembly. MX10008 supports six Switch Fabric Boards (SFBs) There are two models of SFBs: the JNP10008-SF and the JNP10008-SF2. SFBs installed must be of the same model type in a running chassis.
  
  For details, see Fabric-Plane-Management-on-MX10004 and MX10008-Devices
- MX10004 features a compact 7-U modular chassis, line card slots 0-3 silicon line cards (2.4 Tbps, 480 Gbps, and 9.6 Tbps throughput) , with full hardware redundancy. Switch Fabric Boards (SFBs) create the switch fabric for the MX10004. Each SFB has a set of connectors to the line cards and the Routing and Control Board (RCB) to the switch fabric. Three SFBs provide reduced switching functionality to an MX10004 router. Six SFBs provide full throughput. Each MX10004 SFB has four connectors. Each connector matches up with a line card slot, eliminating the need for a backplane.
  
  For details on fabric plane management, see Fabric Plane Management on MX10004 Devices.
- The MX10003 router contains modular routing engines and PFEs. The single PFE performs both ingress and egress packet forwarding. The router provides two dedicated line card slots. The router supports one primary and two redundant Routing and Control Boards (RCBs).
- The MX2020 and MX2010 devices support 8 SFBs. The Mx2020 has 20 dedicated line card slots.The MX2010 router has 10 dedicated line-card slots The host subsystem consists of two Control Boards with Routing Engines (CBREs) and eight Switch Fabric Boards (SFBs). Data packets are transferred across the backplane between the MPCs through the fabric ASICs on the SFBs.
  
  Switch Fabric Boards (SFBs) provide increased fabric bandwidth per slot. Up to eight SFBs, SFB2s, or
  
  SFB3s can be installed in an MX2020 or MX2010 router. All switch fabric boards in the chassis must be the same type. Mixed mode is not supported.
- MX960 routers with I-chip or I-chip and Trio-chip-based line cards that contain three control boards.
- MX240 or MX480 routers with I-chip or I-chip and Trio-chip-based line cards that contain two control boards.
- MX960, MX480, or MX240 routers that contain only Trio-based line cards are not considered to contain a spare control board.
If during any such interval of 5 seconds, two line cards indicate a failure for the same plane, a switchover to the spare control board. In this case, the control board that reported errors is turned offline and the spare control board is turned online.
If a spare control board is available, and if you configure the functionality to disable line cards , self-ping state for each line card is monitored at intervals of 5 seconds at the Routing Engine. The following conditions can occur:
- During any 5-second interval, if only one line card indicates a failure for a plane, the fabric Manager waits for the next interval. During the subsequent interval, if no other line card indicates a failure for the same plane, switchover of the control board is performed.
- During any 5-second interval, if multiple line cards show failures for multiple control boards, the fabric manager waits for the next interval. During the subsequent interval, if the same condition remains, all the failing line cards are turned offline even if the spare control board is present.
- During any 5-second interval, if any line card shows a failure for multiple planes on multiple control boards, the fabric manager waits for the next interval. During the subsequent interval, if the same condition persists, the line card is turned offline even if the spare control board is present.
If spare planes are not available, the line card is turned offline when it displays a failure for a single plane or multiple planes. The line card is brought offline only if you previously configured the offline-on-fabric-bandwidth-reduction statement at the [edit chassis fpc slot-number] hierarchy level.

Understanding Fabric Fault Handling on T4000 Router

The T4000 router consists of a Switch Interface Board (SIB) with fabric bandwidth double the capacity of the T1600 router. The fabric fault management functionality is similar to that in T1600 routers. This topic describes the fabric fault handling functionality on T4000 routers.

The fabric fault management functionality involves monitoring all high-speed links connected to the fabric and the ones within the fabric core for link failures and link errors.

Action is taken based on the fault and its location. The actions include:

Reporting link errors in system log files and sending this information to the Routing Engine.
Reporting link failures at the Flexible Port Concentrator (FPC) or at the SIB and sending this information to the Routing Engine.
Marking a SIB in Check state.
Moving a SIB into Fault state.

The SIB in T4000 routers forms the core of the fabric with 4:1 redundancy—the redundant SIB becomes active when the active SIB becomes nonfunctional, is deactivated, or is removed. The following are the high-level indications of fabric faults that are monitored by Junos OS:

An SNMP trap is generated whenever a SIB is reported as Check or Fault.
show chassis alarms—Indicates that a SIB is in Check or Fault state.
show chassis sibs—Indicates that a SIB is in Check or Fault state or that a SIB is in Offline state when the SIB initializes (this occurs when the SIB does not power on fully).
show chassis fabric fpcs—Indicates whether any fabric links are in error on the FPCs’ side.
show chassis fabric sibs—Indicates whether any fabric links are in error on the SIBs’ side.
The /var/log/messages system log messages file at the Routing Engine has error messages with the prefix CHASSISD_FM_ERROR.
The SIBs display the FAIL LED.

Note:

The fabric planes in the chassis determine whether the chassis is a T640 router, a T1600 router, or a T4000 router. Power entry modules (PEMs), FPCs, or fan trays do not determine chassis personality. Alarms are raised if the old PEMs or fan trays are present in a T4000 chassis. You can identify a router based on its fabric planes:

If all planes present are F16-based SIBs, the chassis is a T640 chassis.
If all planes present are SF-based SIBs, the chassis is a T1600 chassis.
If all planes present are XF-based SIBs, the chassis is a T4000 chassis.

Note that mixing of fabric planes is not a supported configuration except during upgrade. You can change the personality of a chassis without a reboot by changing all the fabric planes and by issuing the set chassis fabric upgrade-mode CLI command to check the personality. If you do not issue the set chassis fabric upgrade-mode CLI command, the personality does not change until the next boot.

In T4000 routers, you come across the following faults:

Board-level faults—These faults occur during initialization or during runtime. Power failure during board initialization, high-speed links transmit error, and polled I/O error during runtime are some examples of board-level faults.
Link-level faults—These faults occur during initialization or during runtime. Link training failure at initialization time (failure of the data plane links between an FPC and a SIB to be trained when the FPC or SIB is initialized), error detected on the channel between the SIB and a Packet Forwarding Engine, cyclic redundancy check (CRC) errors detected at runtime, and Packet Forwarding Engine destination errors are types of link-level faults.
Faults based on environmental conditions—These faults occur during runtime. Sudden removal of an FPC or a SIB might result in an operator error. When a SIB becomes too hot or when SIB voltages are beyond thresholds, the errors generated are classified into environmental errors.

You can implement one of the following options to handle the faults:

Log the error and raise an alarm.
Switch over to the spare plane, if available.
Continue with a reduced number of parts of a plane.
Continue with a reduced number of usable planes.
Use polling-based fault handling.
Monitor high-speed link errors and manually bring the link down to a suitable threshold.

The polled I/O errors and the link errors are monitored every 500 milliseconds, and the board exhaust temperature and board voltages are monitored every 10 seconds.

Understanding Fabric Fault Handling on PTX5000 Packet Transport Router

Starting with Junos OS Release 14.1, the PTX5000 Packet Transport Router supports nine Switch Interface Boards (SIBs). Each FPC2-PTX-P1A FPC supports 1Tb per slot capacity, thereby resulting in a fabric bandwidth of 16 terabits per second (Tbps), full-duplex (8 Tbps of any-to-any, nonblocking, half-duplex) switching.

The fabric fault management functionality involves monitoring all high-speed links connected to the fabric and the ones within the fabric core for link failures and link errors.

The faults that occur in a PTX5000 can be broadly categorized into:

Board faults—Faults that arise in a SIB or in an Flexible Port Concentrator (FPC) during initialization or during runtime, including issues that arise when a router component is accessing the SIB or FPC or issues that arise out of midplane failures.
Link faults—Faults that occur on high-level links in a router during initialization or during runtime.
Faults due to environmental conditions—Faults that occur because of overvoltage or over-temperature; faults that occur because of an operator mishandling a SIB or an FPC, and so on.

The router takes action on the basis of the fault category and the fault location. The actions include:

Reporting link errors in system log files and sending this information to the Routing Engine.

Displaying the link errors when you run one of the operational commands listed in Table 1:

Table 1: List of Operational Mode Commands
Operational mode command	Description
`show chassis sibs`	Displays Switch Interface Boards (SIBs) status information.
`show chassis fabric fpcs <slot number>`	Displays the fabric state of the specified FPC slot. If no slot number is provided, it displays the status of all FPCs.
`show chassis fabric sibs <slot number>`	Displays the state of the electrical switch fabric link between the SIBs and the FPCs.
`show chassis fabric reachability <detail>`	Displays the current state of fabric destination reachability.
`show chassis fabric unreachable-destinations`	Displays the list of destinations that have transitioned from a reachable state to an unreachable state.
`show pfe statistics error`	Displays Packet Forwarding Engine error statistics.
`show chassis fabric topology <sib_slot>`	Displays the input-output link topology.
`show chassis fabric summary`	Displays the state of all fabric planes and the elapsed uptime.

Reporting link failures at the FPC level or at the SIB level and sending this information to the Routing Engine.
Reporting link error information in the show chassis alarms operational command.
Moving a SIB into fault state.

The following sections explain fabric fault handling functionality on the PTX5000:

SIB-Level Faults
FPC-Level Faults

SIB-Level Faults

The following sections give a brief overview on the types of faults that occur on a SIB and how to handle them:

Types of Faults That Occur on a SIB
Handling SIB-Level Faults

Types of Faults That Occur on a SIB

Board faults and link faults occur on a SIB during initialization and during runtime. Some faults occur because of environmental conditions such as overvoltage or over-temperature, or when an operator mishandles the SIB.

Note:

Run the operational mode commands listed in Table 1 to detect faults.

During SIB initialization and runtime, the following faults might occur:

Board faults, such as failure of SIBs to power up, ASICs reset failure, Switch Processor Mezzanine Board (SPMB) polled I/O access failure to ASICs, board component failures such as PIC failures, or router component access failures.
Link faults such as high-level link errors that occur during link training.
Faults that occur because of environmental conditions or because of mishandling of the SIB by the operator.

Handling SIB-Level Faults

The following list illustrates how the router handles a fault that occurs on a SIB during initialization, during runtime, because of environmental conditions, and because of mishandling of the SIB by the operator:

To handle a board fault on a SIB during initialization, the chassis daemon (chassisd) marks the SIB to be in fault state. After the SIB is marked as faulty, no operation occurs on this SIB.
To handle a board fault on a SIB during runtime, chassisd logs an error in the system log file, raises an alarm indication error type, and marks the SIB as faulty. After the SIB is marked as faulty, no operation occurs on this SIB.
To handle a link fault on a SIB during runtime, when a link error comes up during link training, chassisd informs the FPC corresponding to the link on which the error occurred to disable the links to the affected SIB. The chassisd then sends an error message to all the other FPCs in the router to stop using the failed SIB link and a link error alarm is generated. Note that when more than one FPC report errors for a given SIB, the SIB is disabled for all FPCs and no traffic is sent by the Packet Forwarding Engine through the affected SIB.
To handle a link fault on a SIB during runtime, chassisd marks the SIB as faulty and specifies a reason for the error, and the SIB is disabled.
In case of an environmental fault—overvoltage or over-temperature—the SIB is immediately taken offline. Note that an error is logged periodically as the temperature or voltage rises, and the SIB is taken offline when it crosses a certain threshold voltage or temperature.
When a SIB is abruptly removed or dislodged, all the affected Packet Forwarding Engines stop using that plane to reach other Packet Forwarding Engines in the router.

FPC-Level Faults

The following sections give a brief overview of the types of faults that occur on an FPC and how to handle them:

Types of Faults That Occur on an FPC
Handling FPC-Level Faults

Types of Faults That Occur on an FPC

Board faults and link faults occur on an FPC during initialization and during runtime. Some faults also occur because of environmental conditions such as overvoltage, over-temperature, or when the operator mishandles the FPC.

Note:

Run the operational commands listed in Table 1 to detect faults.

During FPC initialization and runtime, the following faults might occur:

Board faults such as failure of FPCs to power up, failure of ASICs to come out of reset phase, PMB polled I/O access failure to ASICs, board component failures such as PIC failure, or router component access failures.
Link faults such as high-level link errors that occur during link training.
Faults that occur because of environmental conditions or because of mishandling of an FPC by the operator.

Handling FPC-Level Faults

The following list illustrates how the router handles a fault that occurs on an FPC during initialization, during runtime, because of environmental conditions, and because of mishandling of the FPC by the operator:

To handle a board fault on an FPC during initialization, chassisd marks the FPC to be in fault state. After the SIB is marked as faulty, no operation occurs on this FPC.
To handle a board fault on an FPC during runtime, chassisd logs an error in the system log file, raises an alarm indication error type, and marks the FPC as faulty. After the FPC is marked as faulty, no operation occurs on this FPC.
To handle onboard link errors on an FPC during initialization or during runtime, the FPC is taken down and all the affected Packet Forwarding Engines stop using that plane to reach other Packet Forwarding Engines in the router.

Note:
No planes are taken down during initialization because the link training process for the fabric is not yet complete.

Onboard link errors during runtime are resolved on the basis of current configuration; either the FPC is rebooted or the error is logged and the FPC continues with initialization.
In case of an environmental fault—over voltage or over-temperature—the FPC is immediately taken offline. Note that an error is logged periodically as the temperature or voltage rises, and the FPC is taken offline when it crosses a certain threshold voltage or temperature.
When an FPC is abruptly removed or dislodged, all the other Packet Forwarding Engines stop sending traffic to the Packet Forwarding Engines in this FPC.

Understanding Fabric Fault Handling on Enhanced Switch Fabric Board (SFB2)

The MX2000 line of routers support Switch Fabric Boards (SFBs) and enhanced SFBs (SFB2s) but not both at the same time. The SFB and SFB2 host three fabric planes each. So, the chassis supports a total of 24 planes. Junos OS Release 15.1F6 and 16.1R1 support fabric fault handling for each plane in both SFB and SFB2. In earlier releases, fabric fault handling is supported for each SFB, not for each plane.

Table 2 lists the differences between fabric fault handling per plane and per SFB.

Table 2: SFB Versus SFB2 Fabric Fault Handling
SFB Level (SFB)	Plane Level (SFB and SFB2)
Cyclic redundancy check(CRC) errors on any link on the SFB are indicated on the SFB.	CRC errors on any link on the SFB or SFB2 are indicated on the plane.
On encountering destination errors, the line card isolates the SFB (all 3 planes).	On encountering destination errors, the line card isolates the corresponding plane. Other planes continue to operate.

Fabric fault handling per-plane provides the following benefits:

Increased granularity, which helps identify, isolate, and repair faults.
Alarms and log messages provide fault information per plane instead of per SFB, which makes debugging easier.
If an SFB has a single faulty plane, the other two planes can continue to operate. There is no need to take the entire SFB offline.
In case of transient errors, while repairing you can isolate a single plane instead of isolating the bouncing the SFB.

To view fabric fault handling information for all 24 planes, use the extended option with the existing fabric commands.

Managing Bandwidth Degradation

Certain errors result in packets being dropped by a system without notification. Other connected systems continue to forward traffic to the affected system, impacting network performance. A severely degraded fabric plane can be one of the reasons here.

By default, Juniper Networks routers attempt to start healing from such situations when the system detects issues with Packet Forwarding Engines. If the healing fails, the system turns off the interfaces, thereby preventing further escalations.

On Junos OS, you can use the configuration statement bandwidth-degradation at the [edit chassis fpc slot-numberfabric] hierarchy to detect and respond to fabric plane degradation in ways you deem fit. You can configure the router to specify which healing actions the router should take once such a condition is detected. You can also use the optional statement blackhole-action to determine how the line card responds to a 100 percent fabric degradation scenario. This command is optional and overrides the default fabric hardening procedures.

Note:

The bandwidth-degradation command and the offline-on-fabric-bandwidth-reduction statements are mutually exclusive. If both commands are configured, an error is issued during the commit check.

The bandwidth-degradation statement is configured with a percentage and an action. The percent-age value can range from 1 to 99, and it represents the percentage of fabric degradation needed to trigger a response from the line card. The action attribute determines the type of response the line card performs once fabric degradation reaches the configured percentage.

The statement is only configured with an action attribute, which triggers when the percentage of fabric degradation reaches 100 percent.

The following actions can be applied to either configuration statement:

log-only: A message gets logged in the chassisd and message files when the fabric degradation threshold is reached. No other actions are taken.
restart: The line card with a degraded fabric plane is restarted once the threshold is reached.
offline: The line card with a degraded fabric plane is taken offline once the threshold is reached. The line card requires manual intervention to be brought back online. This is the default action if no action attribute configured.
restart-then-offline: The line card with a degraded fabric plane is restarted once the threshold is reached, and if fabric plane degradation is detected again within 10 minutes, the line card is taken offline. The line card requires manual intervention to be brought back online.

Note:

This feature is available in the Junos OS Release 15.1R1.

Fabric Hardening and Recovery on PTX10001-36MR, PTX10004, PTX10008, and PTX100016 with PTX10K-LC1202-36MR Line Card

PTX10001-36MR, PTX10004, PTX10008, and PTX100016 routers support fabric hardening. Fabric hardening is a resiliency feature to detect fabric blackholing and attempt automatic recovery process to restore the Packet Forwarding Engines from blackhole condition.

We’ve enabled fabric hardening by default. When the system detects any unreachable Packet Forwarding Engine destination, this feature attempts automatic fabric connectivity restoration.

If restoration fails, the system turns off the interfaces to limit the blackholing and trigger alarm to indicate the unreachable Packet Forwarding Engine destinations. However, instead of turning off the interfaces, user can configure Packet Forwarding Engine offline by using set chassis fabric event reachability-fault actions recovery-failure pfe-offline statement at the [set chassis fabric event] hierarchy level.

Packet Forwarding Engine destinations can become unreachable for the following reasons:

Complete self-blackhole- Complete connectivity loss occurs on all fabric planes.
Complete peer-blackhole- Two Packet Forwarding Engines can reach the fabric but not each other.

You can configure a router to trigger fabric recovery when the router detects degradation in fabric bandwidth by using degraded statement at the [edit chassis fabric event reachability-fault] hierarchy level. The degradation statement is configured with a percentage value that can range from 1 to 99. The percentage value represents the error threshold for fabric bandwidth degradation and the router starts the recovery once the threshold is reached.

When the degraded error threshold is configured, the router can also attempt fabric recovery for the following reasons:

Self degrdation- Degraded fabric condition in a Packet Forwarding Engine destination.
Peer degradation- Degraded fabric condition between two Packet Forwarding Engines.

The fabric recovery process involves one or more of the following phases:

SIB restart phase: If Packet Forwarding Engine destinations across multiple line cards have fabric connectivity failures on planes, then the router attempts to resolve the issue by restarting the SIBs. If multiple SIBs require a restart, the router restarts the SIBs one by one.
FPC restart phase: The router attempts automatic recovery by restarting the FPCs for the following scenarios:
- All Packet Forwarding Engine destinations having complete or partial blackhole conditions are in a single FPC.
- If Packet Forwarding Engine destinations with complete or partial blackhole conditions occur across different FPCs, but none of the Packet Forwarding Engines share common plane of failure.
- The attempt of SIB restart phase failed to recover Packet Forwarding Engines.
You can disable restarting of FPCs to limit recovery actions from a degraded fabric condition. To disable restarting of FPCs, use the set chassis fabric event reachability-fault actions fpc-restart-disable statement at the [set chassis fabric event] hierarchy level.
Packet Forwarding Engine offline phase: Because previous attempts of recovery phases failed or recovery action disabled in the configuration, the router turns off the interfaces to limit the blackholing by default. However, instead of turning off the interfaces, user can configure Packet Forwarding Engine offline by using set chassis fabric event reachability-fault actions recovery-failure pfe-offline statement at the [set chassis fabric event] hierarchy level.

If the router has only Packet Forwarding Engines with peer blackhole or peer degradation condition, then the router attempts recovery through link autoheal by restarting fabric links on the planes.

Benefits

Attempts automatic recovery process to recover the Packet Forwarding Engines from degraded fabric conditions to minimize traffic loss.
Raise alarms that provide fault information to indicate the unreachable Packet Forwarding Engine destinations, if the recovery fails.

Disabling Line Card Restart to Limit Recovery Actions from Degraded Fabric Conditions

You can disable line card restarts to limit recovery actions from a degraded fabric condition. On T640 and T1600 routers, only the fabric plane is restarted. On PTX Series routers, only the Switch Interface Boards (SIBs) are restarted. To disable the restarting of line cards, use the action-fpc-restart-disable statement at the [edit chassis fabric degraded] hierarchy level:

Whenever a line card restart is disabled, an alarm is raised when there are unreachable destinations present in the router, and you must restart the line cards manually.

To ensure that both the fabric planes (T640 and T1600 routers) or the SIBs (PTX Series routers) and the line cards are restarted during the recovery process, do not configure the action-fpc-restart-disable statement at the [edit chassis fabric degraded] hierarchy level.

Disabling an FPC with Degraded Fabric Bandwidth

You can bring an FPC with degraded fabric bandwidth offline to avoid causing a null route in the chassis for an extended time. To configure the option to disable an FPC with degraded bandwidth, use the offline-on-fabric-bandwidth-reduction statement at the [edit chassis fpc slot-number] hierarchy level:

The fabric manager checks the number of current active planes periodically. If the number of active planes is lower than the required number of active planes for a particular router, the system waits 10 seconds before it takes any corrective action. If the reduced bandwidth condition persists for an FPC and if this feature has been configured for the FPC, the system brings the FPC offline.

Error Handling by Fabric OAM

Fabric Operation, Administration, Maintenance (OAM) helps in detecting failures in fabric paths. Fabric OAM validates the fabric connectivity before sending traffic on a fabric plane whenever a new fabric path is brought up for a PFE. If a failure is detected, the software reports the fault and avoids using that fabric plane for that PFE. This feature works by sending a very low packets per second (PPS) self-destined OAM traffic over each of the available fabric planes and detecting any loss of traffic at the end points (fabric self-ping check).

Note:

In Junos OS Evolved Release 20.4R1, the fabric OAM feature is enabled by default. You can disable the feature by using the CLI command set chassis fabric oam detection-disable.
In Junos OS Evolved Releases 20.4R2 and 21.1R1, the fabric OAM feature is disabled by default.
In Junos OS Evolved Release 22.1R1, the runtime fabric OAM feature is enabled by default. You can disable the feature by using the CLI command edit chassis fabric oam runtime-disable. The runtime fabric OAM feature is supported on PTX10004, PTX10008, and PTX10016 routers.

The Fabric OAM checks are done at boot time. The failed paths are disabled. The system does not do any recovery action. However, you can try to recover the affected fabric planes by restarting the SIBs. The recovery steps depend on the nature of the failure.

A fabric plane represents an independent bidirectional path between a PFE and fabric ASIC. Runtime Fabric OAM periodically checks fabric connectivity and helps detect and report failures in fabric planes during system runtime. Runtime Fabric OAM detects the fabric reachability of each PFE.

When the same fabric planes fail on a single or multiple FPCs, restart the SIB containing the failed planes, using the following commands:

user@host> request chassis sib slot slot-number offline

user@host> request chassis sib slot slot-number online

When random fabric planes fail on multiple FPCs, the fault cannot be isolated to a specific FPC or SIB. However, you can try to recover the planes by restarting the SIBs that contain the affected planes in a sequential manner.

For each error detected by the fabric OAM feature, a syslog is generated. The following is an example:

The following syslog message indicates that a fabric OAM-related error was cleared.

Also, you can use the CLI commands show system errors active detail and show system alarms to view the Fabric OAM-related errors.

The following output shows details for both single fabric plane failure (on Packet Forwarding Engine 0) and all fabric planes failure (on Packet Forwarding Engine 1).

You can use the CLI command show chassis fabric fpcs to view the fabric OAM self-ping state of each fabric plane.

The show chassis fabric fpcs command displays the following output when the fabric OAM feature is disabled:

ON THIS PAGE

Fabric Resiliency and Degradation

Packet Forwarding Engine Errors and Recovery on PTX Series Routers

Fabric Resiliency and Automatic Recovery of Degraded Fabric

PFE Level Recovery Action on PTX Series Routers (PTX10004, PTX10008, and PTX10016 Routers)

Packet Forwarding Engine Errors and Recovery on T640, T1600 or TX Matrix Routers

MX Series Routers Fabric Resiliency

Fabric Connectivity Restoration

Line Cards with Degraded Fabric

Connectivity Loss Towards a Single Destination Only

Redundancy Fabric Mode on Active Control Boards

Detection and Corrective Actions of Line Cards on MX Series Routers

Understanding Fabric Fault Handling on T4000 Router

Understanding Fabric Fault Handling on PTX5000 Packet Transport Router

SIB-Level Faults

Types of Faults That Occur on a SIB

Handling SIB-Level Faults

FPC-Level Faults

Types of Faults That Occur on an FPC

Handling FPC-Level Faults

Understanding Fabric Fault Handling on Enhanced Switch Fabric Board (SFB2)

Managing Bandwidth Degradation

Fabric Hardening and Recovery on PTX10001-36MR, PTX10004, PTX10008, and PTX100016 with PTX10K-LC1202-36MR Line Card

Benefits

Disabling Line Card Restart to Limit Recovery Actions from Degraded Fabric Conditions

Disabling an FPC with Degraded Fabric Bandwidth

Error Handling by Fabric OAM

Change History Table

ON THIS PAGE

Fabric Resiliency

Fabric Resiliency and Degradation

Packet Forwarding Engine Errors and Recovery on PTX Series Routers

Fabric Resiliency and Automatic Recovery of Degraded Fabric

PFE Level Recovery Action on PTX Series Routers (PTX10004, PTX10008, and PTX10016 Routers)

Packet Forwarding Engine Errors and Recovery on T640, T1600 or TX Matrix Routers

MX Series Routers Fabric Resiliency

Fabric Connectivity Restoration

Line Cards with Degraded Fabric

Connectivity Loss Towards a Single Destination Only

Redundancy Fabric Mode on Active Control Boards

Detection and Recovery of Fabric-Related Failures Caused by Loss of Connectivity on MX Series Routers

Fabric-Failure Detection Methods on MX Series Routers

Detection and Corrective Actions of Line Cards on MX Series Routers

Understanding Fabric Fault Handling on T4000 Router

Understanding Fabric Fault Handling on PTX5000 Packet Transport Router

SIB-Level Faults

Types of Faults That Occur on a SIB

Handling SIB-Level Faults

FPC-Level Faults

Types of Faults That Occur on an FPC

Handling FPC-Level Faults

Understanding Fabric Fault Handling on Enhanced Switch Fabric Board (SFB2)

Managing Bandwidth Degradation

Fabric Hardening and Recovery on PTX10001-36MR, PTX10004, PTX10008, and PTX100016 with PTX10K-LC1202-36MR Line Card

Benefits

Disabling Line Card Restart to Limit Recovery Actions from Degraded Fabric Conditions

Disabling an FPC with Degraded Fabric Bandwidth

Error Handling by Fabric OAM

Related Documentation

Change History Table