Fabric Resiliency

 

Fabric Resiliency and Degradation

Juniper routers and switches have built in resiliency to tackle failures and error conditions encountered during normal operation. Immediate action is taken by JUNOS software to remedy the failure conditions to minimize traffic loss. No manual intervention is needed. Fabric degradation could be one of the reasons leading to such error conditions. The following sections explain how the PFEs recover in a resilient manner from these failures.

Packet Forwarding Engine Errors and Recovery on PTX Series Routers

Packet Forwarding Engine destinations can become unreachable on PTX Series routers for the following reasons:

  • The fabric Switch Interface Boards (SIBs) are offline as a result of a CLI command or a pressed physical button.

  • The fabric SIBs are turned offline by the control board because of high temperature conditions.

  • Voltage or polled I/O errors in the SIBs are detected by the control board.

  • Unexpected link-training errors occur on all connected planes.

  • Two Packet Forwarding Engines can reach the fabric but not each other.

  • Link errors occur where two Packet Forwarding Engines have connectivity with the fabric but not through a common plane.

Starting with Junos OS Release 13.3, you can use PTX Series routers to configure Packet Forwarding Engine (PFE)-related error levels and the actions to perform when a specified threshold is reached.

If error levels are not defined, a PTX Series router begins the following phases in the recovery process:

  1. SIB restart phase: The router attempts to resolve the issue by restarting the SIBs one by one. This phase does not start if the SIBs are functioning properly and a single line card is facing an issue.

  2. SIB and line card restart phase: The router restarts both the SIBs and the line card. If there are line cards that are unable to initiate high-speed links to the fabric after reboot, it is not relevant to loss of live traffic as no interfaces are created for these line cards, preventing the system from issues.

  3. Line Card offline phase: Because previous attempts at recovery failed, line cards and interfaces are turned off and the system avoids issues and error conditions.

Packet Forwarding Engine Errors and Recovery on T640, T1600 or TX Matrix Routers

Packet Forwarding Engine destinations can become unreachable on T640, T1600 or TX Matrix routers for the following reasons:

  • The fabric Switch Interface Boards (SIBs) are offline as a result of a CLI command or a pressed physical button.

  • The fabric SIBs are turned offline by the Switch Processor Mezzanine Board (SPMB) because of high temperature conditions.

  • Voltage or polled I/O errors in the SIBs are detected by the SPMB.

  • All Packet Forwarding Engines receive destination errors on all planes from remote Packet Forwarding Engines, even when the SIBs are online.

  • Complete fabric loss is caused by destination timeouts, even when the SIBs are online.

The recovery process consists of the following phases:

  1. The router restarts the fabric planes one by one. This phase does not start if the fabric plane is functioning properly and a single line card has issues.

  2. Fabric plane and Line Card restart phase: The router restarts both the SIBs and the line cards. If there are line cards that are unable to initiate high-speed links to the fabric after reboot, it is not relevant to loss of live traffic as no interfaces are created for these line cards, preventing the system from issues.

  3. Line card offline phase: Because previous attempts at recovery failed, line cards and interfaces are turned off and the system avoids issues and error conditions leading to serious consequences.

Note

Starting in Junos OS Release 14.2R6, if a SIB becomes offline because of extreme conditions such as high voltage or high temperature, then as part of the recovery process, the router does not restart the fabric plane for that SIB.

The phased recovery mechanism mentioned above is exhaustive unless there are other errors which could be correlated to these issues.

Starting in Junos OS Release 14.2R6, you can manage fabric degradation in single-chassis systems better by incorporating fabric self-ping and Packet Forwarding Engine liveness mechanisms. Fabric self-ping is a mechanism to detect issues in the fabric data path. Using the fabric self-ping mechanism, every Packet Forwarding Engine ascertains that a packet destined to itself is reaching it when the packet is sent over the fabric path. Packet Forwarding Engine liveness is a mechanism to detect whether a Packet Forwarding Engine is reachable on the fabric plane. To verify that it is reachable, the Packet Forwarding Engine sends a self-destined packet over the fabric plane periodically. If any error is detected by these two mechanisms, the fabric manager raises a fabric degraded alarm and initiates recovery by restarting the line card.

Connectivity loss in a router occurs when the router is unable to transmit data packets to other neighboring routers, although the interfaces on that router continue to be in the active state. As a result, the other neighboring routers continue to forward traffic to the impacted router, which drops the arriving packets without sending a notification to the other routers.

When a Packet Forwarding Engine in a router is unable to send traffic to other Packet Forwarding Engines over the data plane within the same router, the router is unable to transmit any packets to a neighboring router, although the interfaces are advertised as active on the control plane. Fabric failure can be one of the reasons for the loss of connectivity.

The following fabric failure scenarios can occur:

  • Removal of the control board

  • High-speed link 2 (HSL2) training failures

  • Single link failure on a line card

  • Multiple link failures on the same line card or the same fabric plane

  • Multiple link failures randomly on a line card or a fabric plane

  • Intermittent cyclic redundancy check (CRC) errors

  • A complete loss of connectivity for only one destination and not to other destinations

When a line card does not forward traffic due to a certain reason to other line cards within the device, the control protocol on the Routing Engine is unable to detect this condition. The traffic transmission is not diverted to the functional, active line cards and, instead, the packets are continued to be sent to the affected line card and are dropped at that point. The following might be the causes for a line card being unable to forward traffic:

  • All the planes in the system are in the Offline or Fault state.

  • All the Packet Forwarding Engines on the line card might have disabled the fabric streams due to destination errors.

If all the Switch Control Boards (SCBs) lose connectivity to the line cards, then all the interfaces are brought down. If a Packet Forwarding Engine of a line card loses complete connectivity to or from the fabric, then that line card is brought down.

System hardware failures can be of the following types:

  • A single occurrence or a rare failure for a brief period (such as environmental spikes). This failure is effectively healed without manual intervention by restarting the fabric plane and restarting the line cards and the fabric plane, if necessary.

  • Repeated failures that occur frequently.

  • A permanent failure.

A recovery from any case of reduced throughput, such as multiple Packet Forwarding Engine destination timeouts on multiple planes is not attempted. Restoration of connectivity is attempted only when all the planes are in the Offline or Fault state or when the destinations are unreachable on all active planes.

If connectivity loss occurs because of a certain line card, which is either a common source or common destination of the destination timeout, and if you have configured the action-fpc-restart-disable statement at the [edit chassis fabric degraded] hierarchy level, no recovery action is taken. The show chassis fabric reachability command output can be used to verify the status of the fabric and the line card. An alarm is triggered to indicate that the particular line card is causing the connectivity loss.

Fabric-Failure Detection Methods on MX Series Routers

The chassis daemon (chassisd) process detects the removal of a control board. The removal of the control board causes all the active planes that reside on that board to be disabled and a switchover is performed. If the active Routing Engine is also unplugged along with the control board, the detection of the control board removal is delayed until the switchover of the Routing Engine occurs and the reconnection in the primary, backup Routing Engine pair occurs. If the control board is turned offline by specifying the request chassis cb slot slot-number offline or a pressed physical button to cause a graceful shutdown, a fabric failure does not occur, even if the control board is moved to the offline state.

If active fabric planes are removed because of removal of the control board on the master RE, the line card takes the local action of disabling removed planes. If spare planes are available, line card initiates switchover to spare planes. If an active control board on a backup RE is removed, the master RE performs the switchover. The software attempts to optimize the duration of connectivity loss by disabling all removed planes. The spare planes are transitioned to the online state one by one.

Fabric self-ping is a mechanism to detect any issues in the fabric data path. Each Packet Forwarding Engine forwards fabric data cells that are destined to itself over all active fabric planes. To transmit the data cell, the Packet Forwarding Engine fabric sends the request cells over an active plane and waits for a grant packet. The destination Packet Forwarding Engine sends a grant packet over the same plane on which the request cell is received. When the grant cell is received, the source Packet Forwarding Engine sends the data cell.

The Packet Forwarding Engine fabric contains the capability to detect grant delays. If grants are not received within a certain period of time, a destination timeout is declared. Destination timeout on a certain plane by a Packet Forwarding Engine on two or more line cards is considered as an indication for plane failures. Even if one Packet Forwarding Engine on a line card flashes an error, the line card is considered to be in error. Destination timeouts are noticed when the Packet Forwarding Engine sends traffic actively because requests are sent only for valid data cells. The software takes an appropriate action based on the destination timeout. For self-ping, a data cell is destined to the source Packet Forwarding Engine only.

Fabric ping failure messages are sent to the fabric manager on the Routing Engine, which collates all of the errors reported by all the line cards and takes a corrective action. For example, a ping failure for all links of the same line card might indicate a problem on the line card. Ping failure for multiple line cards for the same fabric plane might indicate a problem with the fabric.

If the Routing Engine determines that a fabric plane is down, based on the information on errors it receives from the line cards or the Packet Forwarding Engines, over a period of 5 seconds, it indicates a fabric failure. The duration of 5 seconds is the period for which the Routing Engine collates the errors from all of the line cards.

Fabric self-ping packets are periodically sent to check the sanity of the fabric links. Self pings are sent at interval of 500 ms. The destination timeout is also checked in intervals of 500 ms. If two timeouts ocur successively, self ping failure is detected. When a destination timeout is received, the Packet Forwarding Engine fabric stops the sending of packets to the fabric. To examine the link condition again, the software resets the credits to ensure that new requests are sent again. When a self-ping failure occurs, the line card removes the affected plane from sending data to all destinations. This method ensures that self-ping is not attempted to be sent again on the defective plane.

The following guidelines apply to the self-ping capability:

  • By default, self pings are not sent on spare fabric planes because spare planes do not carry traffic.

  • The size of self-ping packets is large enough to enable the cells to be loaded over all the active fabric planes (maximum of 8 for MX Series routers).

  • A detection of received self-ping packets is not performed.

  • High priority queue is used to enable self-ping to be sent for oversubscription cases.

MX Series Routers Fabric Resiliency

MX routers provide intelligent mechanisms to reduce packet loss in hardware failures scenarios.

This topic contains the following sections that describe fabric resiliency options, failure detection methods used, and corrective actions:

Fabric Connectivity Restoration

Packet Forwarding Engine destinations can become unreachable for the following reasons:

  • The control boards go offline as a result of a CLI command or a pressed physical button.

  • The fabric control boards are turned offline because of high temperature.

  • Voltage or polled I/O errors in the fabric.

  • All Packet Forwarding Engines receive destination errors on all planes from remote Packet Forwarding Engines, even when the fabrics are online.

  • Complete fabric loss caused by destination timeouts, even when the fabrics are online.

When the system detects any unreachable Packet Forwarding Engine destinations, fabric connectivity restoration is attempted. If restoration fails, the system turns off the interfaces to trigger local protection action or traffic re-route on the adjacent routers.

The recovery process consists of the following phases:

  1. Fabric plane restart phase: Restoration is attempted by restarting the fabric planes one by one. This phase does not start if the fabric plane is functioning properly and an error is reported by one line card only. An error message is generated to specify that a connectivity loss is the reason for the fabric plane being turned offline. This phase is performed for fabric plane errors only.

  2. Fabric plane and line card restart phase: The system waits for the first phase to be completed before examining the system state again. If the connectivity is not restored after the first phase is performed or if the problem occurs again within a duration of 10 minutes, connectivity restoration is attempted by restarting both the fabric planes and the line cards. If you configure the action-fpc-restart-disable statement at the [edit chassis fabric degraded] hierarchy level to disable restart of the line cards when a recovery is attempted, an alarm is triggered to indicate that connectivity loss has occurred. In this second phase, three steps are taken:

    1. All the line cards that have destination errors on a PFE are turned offline.

    2. The fabric planes are turned offline and brought back online, one by one, starting with the spare plane.

    3. The line cards that were turned offline are brought back online.

  3. Line card offline phase: The system waits for the second phase to be completed before examining the system state again. Connectivity loss is limited by turning the line cards offline and by turning off interfaces because previous attempts at recovery have failed. If the problem is not resolved by restarting the line cards or if the problem recurs within 10 minutes after restarting the line cards, this phase is performed.

The three phases are controlled by timers. During these phases, if an event (such as offlining/onlining line cards or fabric planes) times out, then the phase skips that event and proceeds to the next event. The timer control has a timeout value of 10 minutes. If the first fabric error occurs in a system with two or more line cards, the fabric planes are restarted. If another fabric error occurs within the next 10 minutes, the fabric planes and line cards are restarted. However, if the second fabric error occurs outside of the timeout period of 10 minutes, then the first phase is performed, which is the restart of only the fabric planes.

In cases where all the destination timeouts are traced to a certain line card, for example, one source line card or one destination line card, only that line card is turned offline and online. The fabric planes are not turned offline and online. If another fabric fault occurs within the period of 10 minutes, the line card is turned offline.

By default, the system limits connectivity loss time by detecting severely degraded fabric. No user interaction is necessary.

Line Cards with Degraded Fabric

You can configure a line card with degraded fabric to be moved to the offline state. On an MX960, MX480, or MX240 router, you can configure link errors or bad fabric planes. This configuration is particularly useful in partial connectivity loss scenarios where bringing the line card offline results in faster re-routing. To configure this option on a line card, use the offline-on-fabric-bandwidth-reduction statement at the [edit chassis fpc slot-number] hierarchy level. For more information, see Detection and Corrective Actions of Line Cards with Degraded Fabric on MX Series RoutersDetection and Corrective Actions of Line Cards with Degraded Fabric on MX Series Routers.

Connectivity Loss Towards a Single Destination Only

In certain deployments, a line card indicates a complete connectivity loss towards a single destination only, but it functions properly for other destinations. Such cases are identified and the affected line card is recovered. Consider a sample scenario in which the active planes are 0,1,2,3 and the spare planes are 4,5,6,7 in the connection between line card 0 and line card 1. If line card 0 has single link failures for planes 0 and 1 and if line card 1 has single link failures for planes 2 and 3, a complete connectivity loss occurs between the two line cards. Both line card 0 and line card 1 undergo a phased mode of recovery and fabric healing takes place.

Redundancy Fabric Mode on Active Control Boards

You can configure the active control board to be in redundancy mode or in increased fabric bandwidth mode. To configure redundancy mode for the active control board, use the redundancy-mode redundant statement at the [edit chassis fabric] hierarchy level. In redundancy mode, all the line cards use 4 fabric planes as active planes, regardless of the type of the line card. You can enable increased fabric bandwidth of active control boards for optimal and efficient performance and traffic handling. On an MX960, MX480, or MX240 router, you can use the redundancy-mode increased-bandwidth statement at the [edit chassis fabric] hierarchy level to enable increased fabric bandwidth mode for the active control board to cause all the available fabric planes to be used. In this mode, the maximum number of available fabric planes are used for MX routers and the MPC3E. On MX960 routers with active control boards, 6 active planes are used, and on MX240 and MX480 routers with active control boards, 8 active planes are used.

Increased fabric bandwidth mode is enabled by default on MX routers with Switch Control Board (SCB). On MX routers with Enhanced SCB—SCBE—and the MPC3E, redundancy mode is enabled by default. For more information, see Configuring Fabric Redundancy Mode for Active Control Boards on MX Series Routers.

Detection and Corrective Actions of Line Cards with Degraded Fabric on MX Series Routers

You can configure a line card with degraded fabric to be moved to the offline state on an MX960, MX480, or MX240 router. Configuring this feature does not affect the system. You can configure this feature without restarting the line card or restarting the system.

The following scenarios can occur when you configure the feature to disable line cards with degraded fabric:

  • If a line card has degraded fabric bandwidth and if you configure this capability to turn off such a line card after it has been operating with degraded fabric for some time, the corrective action is still taken.

  • If a line card has been brought offline because of fabric errors and this functionality to move the line card to offline state is disabled, the line card is transitioned to the online state automatically.

  • If a line card has been brought offline because of fabric errors and this functionality to move the line card to offline state is disabled or configured for some other line card, the line card that was turned offline is transitioned to the online state automatically.

  • All the line cards that were brought offline because of degraded fabric, when you configured this setting, are brought back online when you commit any configuration under the [edit chassis] hierarchy level. Similarly, a restart of the chassis daemon or the Graceful Routing Engine switchover (GRES) operation also causes the line card that is disabled because of degraded fabric to be moved to the online state.

Degraded fabric indicates that a line card is operating with less than the required number of active fabric planes. If an line card is operating with less than four planes, it is considered to be degraded. This rule applies to all types of line cards and fabric. Degraded condition denotes that good fabric traffic exists at a reduced bandwidth.

The following conditions can result in degradation of fabric:

  • The fabric control boards go offline as a result of an unintentional, abrupt power shutdown.

  • An application-specific integrated circuit (ASIC) error, which causes a plane of a control board to be automatically turned offline.

  • Manually bringing the fabric plane or the control board to the offline state.

  • Removal of the control board

  • Self-ping failure on any plane.

  • HSL2 training failure for active plane.

  • If a spare fabric plane has CRC errors, and this spare plane is made online, the link with the CRC error is disabled. This mechanism might cause a degradation in fabric in one direction and might cause a traffic black hole in the other direction.

  • When a self-ping or HSL2 training failure occurs, the fabric plane is disabled for a particular line card and it is online for other line cards. This condition can also cause a traffic black hole.

If you need to remove the control board or move a fabric plane to the offline state during a system maintenance, you must enable the functionality to turn the line cards with degraded bandwidth to the offline state (by using the offline-on-fabric-bandwidth-reduction statement at the [edit chassis fpc slot-number] hierarchy level).

The following corrective actions are performed when a traffic black hole or fabric degradation occurs:

  • Regardless of whether a spare control board is available or not, self-ping state for each line card is monitored at intervals of 5 seconds at the Routing Engine. Fabric manager uses the following rule to determine the presence of a spare control board:

    • MX960 routers with I-chip or I-chip and Trio-chip-based line cards that contain three control boards

    • MX240 or MX480 routers with I-chip or I-chip and Trio-chip-based line cards that contain two control boards

    • MX960, MX480, or MX240 routers that contain only Trio-based line cards are not considered to contain a spare control board

    If during any such interval of 5 seconds, two line cards indicate a failure for the same plane, a switchover to the spare control board. In this case, the control board that reported errors is turned offline and the spare control board is turned online.

  • If a spare control board is available, and if you configure the functionality to disable line cards with degraded fabric, self-ping state for each line card is monitored at intervals of 5 seconds at the Routing Engine. The following conditions can occur:

    • During any 5-second interval, if only one line card indicates a failure for a plane, the fabric Manager waits for the next interval. During the subsequent interval, if no other line card indicates a failure for the same plane, switchover of the control board is performed.

    • During any 5-second interval, if multiple line cards show failures for multiple control boards, the fabric manager waits for the next interval. During the subsequent interval, if the same condition remains, all the failing line cards are turned offline even if the spare control board is present.

    • During any 5-second interval, if any line card shows a failure for multiple planes on multiple control boards, the fabric manager waits for the next interval. During the subsequent interval, if the same condition persists, the line card is turned offline even if the spare control board is present.

  • If spare planes are not available, the line card is turned offline when it displays a failure for a single plane or multiple planes. The line card is brought offline only if you previously configured the offline-on-fabric-bandwidth-reduction statement at the [edit chassis fpc slot-number] hierarchy level.

Managing Bandwidth Degradation

Certain errors result in packets being dropped by a system without notification. Other connected systems continue to forward traffic to the affected system, impacting network performance. A severely degraded fabric plane can be one of the reasons here.

By default, Juniper Networks routers attempt to start healing from such situations when the system detects issues with Packet Forwarding Engines. If the healing fails, the system turns off the interfaces, thereby preventing further escalations.

Junos OS software has the ability and the flexibility where a bandwidth-degradation configuration is available to detect and respond to fabric plane degradation in ways the user deems fit. Users can configure the router to specify which healing actions the router should take once such a condition is detected.

The bandwidth-degradation statement is configured with a percentage and an action. The percent-age value can range from 1 to 99, and it represents the percentage of fabric degradation needed to trigger a response from the line card. The action attribute determines the type of response the line card performs once fabric degradation reaches the configured percentage.

The statement is only configured with an action attribute, which triggers when the percentage of fabric degradation reaches 100 percent.

The following actions can be applied to either configuration statement:

  • log-only: A message gets logged in the chassisd and message files when the fabric degradation threshold is reached. No other actions are taken.

  • restart: The line card with a degraded fabric plane is restarted once the threshold is reached.

  • offline: The line card with a degraded fabric plane is taken offline once the threshold is reached. The line card requires manual intervention to be brought back online. This is the default action if no action attribute configured.

  • restart-then-offline: The line card with a degraded fabric plane is restarted once the threshold is reached, and if fabric plane degradation is detected again within 10 minutes, the line card is taken offline. The line card requires manual intervention to be brought back online.

Note

This feature is available in the Junos OS Release 15.1R1.

Disabling Line Card Restart to Limit Recovery Actions from Degraded Fabric Conditions

You can disable line card restarts to limit recovery actions from a degraded fabric condition. On T640 and T1600 routers, only the fabric plane is restarted. On PTX Series routers, only the Switch Interface Boards (SIBs) are restarted. To disable the restarting of line cards, use the action-fpc-restart-disable statement at the [edit chassis fabric degraded] hierarchy level:

Whenever a line card restart is disabled, an alarm is raised when there are unreachable destinations present in the router, and you must restart the line cards manually.

To ensure that both the fabric planes (T640 and T1600 routers) or the SIBs (PTX Series routers) and the line cards are restarted during the recovery process, do not configure the action-fpc-restart-disable statement at the [edit chassis fabric degraded] hierarchy level.

Disabling an FPC with Degraded Fabric Bandwidth

You can bring an FPC with degraded fabric bandwidth offline to avoid causing a traffic black hole in the chassis for an extended time. To configure the option to disable an FPC with degraded bandwidth, use the offline-on-fabric-bandwidth-reduction statement at the [edit chassis fpc slot-number] hierarchy level:

The fabric manager checks the number of current active planes periodically. If the number of active planes is lower than the required number of active planes for a particular router, the system waits 10 seconds before it takes any corrective action. If the reduced bandwidth condition persists for an FPC and if this feature has been configured for the FPC, the system brings the FPC offline.

Enabling Fabric Header Protection to Prevent Wedges

Starting in Junos OS Release 17.3R1, you can configure fabric header protection for MPCs to prevent wedges (blackholes).

Note

Fabric header protection is not supported on multiservices DPC (MS-DPC) and on MX104 and MX80 routers.

If the internal fabric header of a packet is corrupted, it can result in packet corruption. When the corrupted packets are transmitted within the chassis, from Ingress Packet Forwarding Engine (PFE) to Egress Packet Forwarding Engine (PFE), it can result in application-specific integrated circuit (ASIC) wedges (blackholes). To protect the internal fabric header from corruption, enable fabric header protection. When you enable protection of the fabric header, a 32-byte cyclic redundancy check (CRC) is added to each packet sent from Ingress PFE to the Egress PFE. When the packet is received at the Egress PFE, the CRC is validated and if the check fails, the packet is dropped. This protects the Egress PFE from corrupted packets.

To enable fabric header protection on MPCs to prevent wedges:

  1. Enable fabric header protection by including the fabric-header-crc-enable statement at the [edit chassis] hierarchy level.
  2. After enabling fabric header protection, commit the configuration.
    Note

    After enabling fabric header protection and committing the configuration, the router displays the following warning message after you commit the configuration:

    [edit]

    ’chassis’

    warning: Chassis configuration for fabric header crc has been changed. A system reboot is mandatory. Please reboot the system NOW. Continuing without a reboot might result in unexpected system behavior. commit complete


  3. Reboot the router for the configuration to take effect.
Release History Table
Release
Description
Starting in Junos OS Release 17.3R1, you can configure fabric header protection for MPCs to prevent wedges (blackholes).
Starting in Junos OS Release 14.2R6, if a SIB becomes offline because of extreme conditions such as high voltage or high temperature, then as part of the recovery process, the router does not restart the fabric plane for that SIB.
Starting in Junos OS Release 14.2R6, you can manage fabric degradation in single-chassis systems better by incorporating fabric self-ping and Packet Forwarding Engine liveness mechanisms.
Starting with Junos OS Release 13.3, you can use PTX Series routers to configure Packet Forwarding Engine (PFE)-related error levels and the actions to perform when a specified threshold is reached.