Connectivity loss in a router occurs when the router is unable
to transmit data packets to other neighboring routers, although the
interfaces on that router continue to be in the active state. As a
result, the other neighboring routers continue to forward traffic
to the impacted router, which drops the arriving packets without sending
a notification to the other routers.
When a Packet Forwarding Engine in a router is unable to send
traffic to other Packet Forwarding Engines over the data plane within
the same router, the router is unable to transmit any packets to a
neighboring router, although the interfaces are advertised as active
on the control plane. Fabric failure can be one of the reasons for
the loss of connectivity.
The following fabric failure scenarios can occur:
Removal of the control board
High-speed link 2 (HSL2) training failures
Single link failure on a line card
Multiple link failures on the same line card or the same
fabric plane
Multiple link failures randomly on a line card or a fabric
plane
Intermittent cyclic redundancy check (CRC) errors
A complete loss of connectivity for only one destination
and not to other destinations
When a line card does not forward traffic due to a certain reason
to other line cards within the device, the control protocol on the
Routing Engine is unable to detect this condition. The traffic transmission
is not diverted to the functional, active line cards and, instead,
the packets are continued to be sent to the affected line card and
are dropped at that point. The following might be the causes for a
line card being unable to forward traffic:
If all the Switch Control Boards (SCBs) lose connectivity to
the line cards, then all the interfaces are brought down. If a Packet
Forwarding Engine of a line card loses complete connectivity to or
from the fabric, then that line card is brought down.
System hardware failures can be of the following types:
A single occurrence or a rare failure for a brief period
(such as environmental spikes). This failure is effectively healed
without manual intervention by restarting the fabric plane and restarting
the line cards and the fabric plane, if necessary.
Repeated failures that occur frequently.
A permanent failure.
A recovery from any case of reduced throughput, such as multiple
Packet Forwarding Engine destination timeouts on multiple planes is
not attempted. Restoration of connectivity is attempted only when
all the planes are in the Offline or Fault state or when the destinations are unreachable
on all active planes.
If connectivity loss occurs because of a certain line card,
which is either a common source or common destination of the destination
timeout, and if you have configured the action-fpc-restart-disable statement at the [edit chassis fabric degraded] hierarchy
level, no recovery action is taken. The show chassis fabric reachability command output can be used to verify the status of the fabric and
the line card. An alarm is triggered to indicate that the particular
line card is causing the connectivity loss.
Fabric-Failure Detection Methods on MX Series Routers
The chassis daemon (chassisd) process detects the removal of
a control board. The removal of the control board causes all the
active planes that reside on that board to be disabled and a switchover
is performed. If the active Routing Engine is also unplugged along
with the control board, the detection of the control board removal
is delayed until the switchover of the Routing Engine occurs and
the reconnection in the primary, backup Routing Engine pair occurs.
If the control board is turned offline by specifying the request
chassis cb slot slot-number offline or
a pressed physical button to cause a graceful shutdown, a fabric failure
does not occur, even if the control board is moved to the offline
state.
If you remove the control board on the primary Routing Engine,
resulting in removal of active fabric planes, the line card takes
the local action of disabling the removed planes. If spare planes
are available, the line card initiates switchover to spare planes.
If an active control board on a backup Routing Engines is removed,
the primary Routing Engine disables the removed planes and performs
the switchover to spare planes, if available. The software attempts
to optimize the duration of connectivity loss by disabling all removed
planes. The spare planes are transitioned to the online state one
by one.
Fabric self-ping is a mechanism to detect any issues in the
fabric data path. Each Packet Forwarding Engine forwards fabric data
cells that are destined to itself over all active fabric planes.
To transmit the data cell, the Packet Forwarding Engine fabric sends
the request cells over an active plane and waits for a grant packet.
The destination Packet Forwarding Engine sends a grant packet over
the same plane on which the request cell is received. When the grant
cell is received, the source Packet Forwarding Engine sends the data
cell.
The Packet Forwarding Engine fabric contains the capability
to detect grant delays. If grants are not received within a certain
period of time, a destination timeout is declared. Destination timeout
on a certain plane by a Packet Forwarding Engine on two or more line
cards is considered as an indication for plane failures. Even if one
Packet Forwarding Engine on a line card flashes an error, the line
card is considered to be in error. Destination timeouts are noticed
when the Packet Forwarding Engine sends traffic actively because requests
are sent only for valid data cells. The software takes an appropriate
action based on the destination timeout. For self-ping, a data cell
is destined to the source Packet Forwarding Engine only.
Fabric ping failure messages are sent to the fabric manager
on the Routing Engine, which collates all of the errors reported by
all the line cards and takes a corrective action. For example, a
ping failure for all links of the same line card might indicate a
problem on the line card. Ping failure for multiple line cards for
the same fabric plane might indicate a problem with the fabric.
If the Routing Engine determines that a fabric plane is down,
based on the information on errors it receives from the line cards
or the Packet Forwarding Engines, over a period of 5 seconds, it indicates
a fabric failure. The duration of 5 seconds is the period for which
the Routing Engine collates the errors from all of the line cards.
Fabric self-ping packets are periodically sent to check the
sanity of the fabric links. Self pings are sent at interval of 500
ms. The destination timeout is also checked in intervals of 500 ms.
If two timeouts ocur successively, self ping failure is detected.
When a destination timeout is received, the Packet Forwarding Engine
fabric stops the sending of packets to the fabric. To examine the
link condition again, the software resets the credits to ensure that
new requests are sent again. When a self-ping failure occurs, the
line card removes the affected plane from sending data to all destinations.
This method ensures that self-ping is not attempted to be sent again
on the defective plane.
The following guidelines apply to the self-ping capability:
By default, self pings are not sent on spare fabric planes
because spare planes do not carry traffic.
The size of self-ping packets is large enough to enable the cells to be loaded over all the
active fabric planes
(MX2020
supports 24 fabric planes and MX10008 supports 12 fabric
planes).
A detection of received self-ping packets is not performed.
High priority queue is used to enable self-ping to be
sent for oversubscription cases.