Fabric Resiliency and Degradation

Juniper routers and switches have built in resiliency to tackle failures and error conditions encountered during normal operation. Immediate action is taken by JUNOS software to remedy the failure conditions to minimize traffic loss. No manual intervention is needed. Fabric degradation could be one of the reasons leading to such error conditions. The following sections explain how the PFEs recover in a resilient manner from these failures.

Packet Forwarding Engine Errors and Recovery on PTX Series Routers

Packet Forwarding Engine destinations can become unreachable on PTX Series routers for the following reasons:

The fabric Switch Interface Boards (SIBs) are offline as a result of a CLI command .
The fabric SIBs are turned offline by the control board because of high temperature conditions.
Voltage or polled I/O errors in the SIBs are detected by the control board.
Unexpected link-training errors occur on all connected planes.
Two Packet Forwarding Engines can reach the fabric but not each other.
Link errors occur where two Packet Forwarding Engines have connectivity with the fabric but not through a common plane.

Starting with Junos OS Release 13.3, you can use PTX Series routers to configure Packet Forwarding Engine (PFE)-related error levels and the actions to perform when a specified threshold is reached.

If error levels are not defined, a PTX Series router begins the following phases in the recovery process:

SIB restart phase: The router attempts to resolve the issue by restarting the SIBs one by one. This phase does not start if the SIBs are functioning properly and a single line card is facing an issue.
SIB and line card restart phase: The router restarts both the SIBs and the line card. If there are line cards that are unable to initiate high-speed links to the fabric after reboot, it is not relevant to loss of live traffic as no interfaces are created for these line cards, preventing the system from issues.
Line Card offline phase: Because previous attempts at recovery failed, line cards and interfaces are turned off and the system avoids issues and error conditions.

Fabric Resiliency and Automatic Recovery of Degraded Fabric

Starting Junos Evolved Release 23.4R1, the fabric automatic recovery feature is available to limit data loss. Recovery actions taken include FRU restart, link restart and so on.

The following three-phase fabric recovery actions are attempted at FRU level:

1. FRU level recovery using SIB restart.

2. FRU level recovery using FPC restart or PFE restart.

3. Action for unrecoverable PFEs IFD disable or PFE offline.

Note: For platforms that do not have PFE-restart support, FPC restart is provided as the default action.

Fabric recovery action for SIB fault conditions: For reachability faults due to an absent SIB (user driven offline or SIB not present during system power up), Fabric resiliency does not attempt recovery. In systems that do not support fabric recovery, chassis alarms are generated for reachability faults.

PFE Level Recovery Action on PTX Series Routers (PTX10004, PTX10008, and PTX10016 Routers)

For platforms that can support PFE restart, PFE restart will be added as the default phase 2 recovery action.

Note: In ASICs with multiple PFEs, the restart affects PPFEs (Per-plane PFEs), similar to PFE offline action.

Recovery decision for phase 2 action is made for either of the following scenarios:

PFE’s with reachability faults all reside in a single FPC.
PFEs with reachability faults (in one or more FPCs) and have no common of failure.

Phase 2 recovery is attempted on PPFEs that have not recovered from reachability faults after phase 1 recovery.

If the number of PFEs having self reachability faults in an FPC equal to or exceed 50% of the PFEs then the FPC will be restart.

Use the following CLI option to manually configure the default PFE restart action:

The following table shows the actions on phase 2 recovery, based on the configuration and number of PFEs in fault in an FPC.

Recovery decision	Number of implicated PFEs in FPC	PFE restart supported	PFE restart disable	FPC restart disable	Action
Phase 2 action	<= 50%	Yes	No	x	PFE restart
Phase 2 action	<= 50%	Yes	Yes	No	FPC restart
Phase 2 action	<= 50%	Yes	Yes	Yes	PFE restart
Phase 2 action	>50%	Yes	x	No	FPC restart
Phase 2 action	>50%	Yes	Yes	Yes	PFE restart
Phase 2 action	>50%	Yes	No	Yes	PFE restart

Packet Forwarding Engine Errors and Recovery on T640, T1600 or TX Matrix Routers

Packet Forwarding Engine destinations can become unreachable on T640, T1600 or TX Matrix routers for the following reasons:

The fabric Switch Interface Boards (SIBs) are offline as a result of a CLI command or a pressed physical button.
The fabric SIBs are turned offline by the Switch Processor Mezzanine Board (SPMB) because of high temperature conditions.
Voltage or polled I/O errors in the SIBs are detected by the SPMB.
All Packet Forwarding Engines receive destination errors on all planes from remote Packet Forwarding Engines, even when the SIBs are online.
Complete fabric loss is caused by destination timeouts, even when the SIBs are online.

The recovery process consists of the following phases:

The router restarts the fabric planes one by one. This phase does not start if the fabric plane is functioning properly and a single line card has issues.
Fabric plane and Line Card restart phase: The router restarts both the SIBs and the line cards. If there are line cards that are unable to initiate high-speed links to the fabric after reboot, it is not relevant to loss of live traffic as no interfaces are created for these line cards, preventing the system from issues.
Line card offline phase: Because previous attempts at recovery failed, line cards and interfaces are turned off and the system avoids issues and error conditions leading to serious consequences.

Note:

Starting in Junos OS Release 14.2R6, if a SIB becomes offline because of extreme conditions such as high voltage or high temperature, then as part of the recovery process, the router does not restart the fabric plane for that SIB.

The phased recovery mechanism mentioned above is exhaustive unless there are other errors which could be correlated to these issues.

Starting in Junos OS Release 14.2R6, you can manage fabric degradation in single-chassis systems better by incorporating fabric self-ping and Packet Forwarding Engine liveness mechanisms. Fabric self-ping is a mechanism to detect issues in the fabric data path. Using the fabric self-ping mechanism, every Packet Forwarding Engine ascertains that a packet destined to itself is reaching it when the packet is sent over the fabric path. Packet Forwarding Engine liveness is a mechanism to detect whether a Packet Forwarding Engine is reachable on the fabric plane. To verify that it is reachable, the Packet Forwarding Engine sends a self-destined packet over the fabric plane periodically. If any error is detected by these two mechanisms, the fabric manager raises a fabric degraded alarm and initiates recovery by restarting the line card.

Change History Table

Feature support is determined by the platform and release you are using. Use Feature Explorer to determine if a feature is supported on your platform.

Release

Description

14.2R6

Starting in Junos OS Release 14.2R6, you can manage fabric degradation in single-chassis systems better by incorporating fabric self-ping and Packet Forwarding Engine liveness mechanisms.

13.3